AW: AW: AW: AW: [aseek-users] external converters, pdf files

Markus . Rietzler Thu, 12 Sep 2002 02:18:35 -0700

mh,
with adding charset there is no exec-line in the index-log, so no
pdf-conversion at all.
tried to


        Converter application/pdf text/html; charset=iso8859-1
/usr/bin/pdftohtml -i -noframes -stdout $in >$out
and
        Converter application/pdf "text/html; charset=iso8859-1"
/usr/bin/pdftohtml -i -noframes -stdout $in >$out

without charset option pdf's are indexed...

mfg

Markus Rietzler
* <rietzler_software/>
* RZF NRW
* Tel: 0211.4572-130



-----Urspr�ngliche Nachricht-----
Von: Kir Kolyshkin [mailto:[EMAIL PROTECTED]]
Gesendet am: Donnerstag, 12. September 2002 11:26
An: [EMAIL PROTECTED]
Betreff: Re: AW: AW: AW: [aseek-users] external converters, pdf files

Try to add ";charset=iso8859-1" (substitute with charset you need)
after "text/html" in Converter line.

Sorry, seems it is missing from man page; I will add it.

So, it will look like:

Converter application/pdf text/html; charset=iso8859-5 /usr/bin/pdftohtml -i
-noframes -stdout $in >$out
Please report here if it works or not :)

[EMAIL PROTECTED] wrote:
> 
> that trick worked (half).
> the pdf-file is being indexed but i can't search for words with eg.
umlauts.
> in the excerpt i see "?" on the places where umlauts (�,�,�) should be.
> so the charset of the document is wrong. any ideas
> 
> mfg
> 
> Markus Rietzler
> * <rietzler_software/>
> * RZF NRW
> * Tel: 0211.4572-130
> 
> -----Urspr�ngliche Nachricht-----
> Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]]
> Gesendet am: Donnerstag, 12. September 2002 02:48
> An: [EMAIL PROTECTED]
> Betreff: Re: AW: AW: [aseek-users] external converters, pdf files
> 
> While trying out pdftohtml for myself (yeah, it's nicer than plain text
> and provides a title too), I figured it out.
> pdftohtml will add an ".html" extension to the output file; and hence
> index won't use it nor delete it afterwards. The solution I use is simply:
> 
> Converter application/pdf            text/html       /usr/bin/pdftohtml -i
> -noframes -stdout $in >$out
> 
> And so far it seems to be working...
> 
> Cheers,
> 
> [EMAIL PROTECTED] wrote:
> 
> >in aspseek.conf i have
> >
> >       Converter application/pdf text/html /users/aspseek/sbin/pdftohtml
-i
> >-noframes $in $out.html
> >
> >and this is what index says, looks good for me
> >
> >www@I011-32:/users/aspseek/sbin> ./index -a -m -u http://..../test.pdf
> >Loading configuration from /users/aspseek/etc/db.conf
> >Loading configuration from /users/aspseek/etc/ucharset.conf
> >Loading configuration from /users/aspseek/etc/stopwords.conf
> >Loading configuration from /users/aspseek/etc/server.url
> >Loading configuration from /users/aspseek/etc/allow.url
> >Loading configuration from /users/aspseek/etc/aspseek.conf
> >Adding URL: http://..../test.pdf
> >exec /users/aspseek/sbin/pdftohtml -i -noframes /tmp/asiLa7HF2
> >/tmp/asoOpsGUU.html
> >Page-1
> >Page-2
> >Page-3
> >Page-4
> >Page-5
> >Page-6
> >Page-7
> >Page-8
> >Saving real-time database ... done.
> >Saving delta files [..................................................]
> >done.
> >Deleting 'deleted' records from urlword[s] ... done. (0 records deleted)
> >Saving real-time ... done
> >Saving redirects ... done
> >Splitting href delta file ... done
> >Saving href delta files ... done
> >Saving direct href delta files ... done
> >Calculating ranks  [................................................]
done.
> >Saving lastmods ... done
> >Generating word site ... done
> >Generating subset http://..../% ... done (193 URLs)
> >index process finished.
> >
> >btw: those two tempfiles are not deleted in /tmp, maybe another bug
> >/tmp/asoOpsGUU.html is a html-export of the pdf file. so text is
recognized
> >and exported correct, but when i search for one of the words from this
file
> >i get no results...
> >
> >urlword-table says
> >
> >*************************** 1. row ***************************
> >         url_id: 100
> >        site_id: 1
> >        deleted: 0
> >            url: http://.../versorgungsreform.pdf
> >next_index_time: 1031828953
> >         status: 200
> >            crc: d41d8cd98f00b204e9800998ecf8427e
> >  last_modified: Wed, 11 Sep 2002 02:00:11 GMT
> >           etag: "1cc11a-cb26-3d7ea3ab"
> >last_index_time: 1031742553
> >       referrer: 23
> >            tag: 0
> >           hops: 3
> >          redir: 0
> >         origin: 0
> >1 row in set (0.00 sec)
> >
> >mfg
> >
> >Markus Rietzler
> >* <rietzler_software/>
> >* RZF NRW
> >* Tel: 0211.4572-130
> >
> >
> >
> >-----Urspr�ngliche Nachricht-----
> >Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]]
> >Gesendet am: Mittwoch, 11. September 2002 10:48
> >An: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> >Betreff: Re: AW: [aseek-users] external converters, pdf files
> >
> >Hi,
> >What does your 'converter' line in aspseek.conf look like?
> >Also try running index -a -m -u "%.pdf" and see what the output is
> >(perhaps an error message is displayed).
> >
> >Cheers,
> >
> >[EMAIL PROTECTED] wrote:
> >
> >
> >
> >>nono,
> >>these are "plain" pdf files, mostly converted from winword. so there is
a
> >>lot of text. when i use pdf2text or pdftohtml and look in the result, i
> get
> >>all the words/text from the pdf file. so something different happens
> >>
> >>
> >here...
> >
> >
> >>mfg
> >>
> >>Markus Rietzler
> >>* <rietzler_software/>
> >>* RZF NRW
> >>* Tel: 0211.4572-130
> >>
> >>
> >>
> >>-----Urspr�ngliche Nachricht-----
> >>Von: Gregory Kozlovsky [mailto:[EMAIL PROTECTED]]
> >>Gesendet am: Mittwoch, 11. September 2002 10:07
> >>An: '[EMAIL PROTECTED]'
> >>Betreff: RE: [aseek-users] external converters, pdf files
> >>
> >>Sometimes, what appears to be text in .pdf files is actually scanned
> images
> >>that cannot be indexed. Check for it.
> >>
> >>   Gregory Kozlovsky
> >>
> >>-----Original Message-----
> >>From: [EMAIL PROTECTED]
> >>[mailto:[EMAIL PROTECTED]]
> >>Sent: Mittwoch, 11. September 2002 09:59
> >>To: [EMAIL PROTECTED]
> >>Subject: [aseek-users] external converters, pdf files
> >>
> >>
> >>hi,
> >>i am trying to setup aspseek with external converter support. i
installed
> >>pdftohtml, indexing works fine, pdf files seem to be processed, i can
find
> >>the urls to the pdf files in urlword table even with status code 200.
but
> >>when i do a search with words from the pdf-files i get no result, pdf
> files
> >>were not listet in the results...
> >>
> >>any idea?
> >>
> >>thanxs
> >>
> >>mfg
> >>
> >>Markus Rietzler
> >>* <rietzler_software/>
> >>* RZF NRW
> >>* Tel: 0211.4572-130
> >>
> >>
> >>
> >>-----Urspr�ngliche Nachricht-----
> >>Von: Charlie Farinella [mailto:[EMAIL PROTECTED]]
> >>Gesendet am: Dienstag, 10. September 2002 23:35
> >>An: [EMAIL PROTECTED]
> >>Betreff: [aseek-users] selective removal of urls
> >>
> >>Is there a way to selectively remove a url from our database after it
> >>has been indexed?  We would like to remove porn sites from a family
> >>friendly database.
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >

-- [EMAIL PROTECTED]  ICQ7551596  [EMAIL PROTECTED] --
   Guinness a Day Keeps a Doctor Away (people's wisdom)

AW: AW: AW: AW: [aseek-users] external converters, pdf files

Reply via email to