that trick worked (half). the pdf-file is being indexed but i can't search for words with eg. umlauts. in the excerpt i see "?" on the places where umlauts (�,�,�) should be. so the charset of the document is wrong. any ideas
mfg Markus Rietzler * <rietzler_software/> * RZF NRW * Tel: 0211.4572-130 -----Urspr�ngliche Nachricht----- Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]] Gesendet am: Donnerstag, 12. September 2002 02:48 An: [EMAIL PROTECTED] Betreff: Re: AW: AW: [aseek-users] external converters, pdf files While trying out pdftohtml for myself (yeah, it's nicer than plain text and provides a title too), I figured it out. pdftohtml will add an ".html" extension to the output file; and hence index won't use it nor delete it afterwards. The solution I use is simply: Converter application/pdf text/html /usr/bin/pdftohtml -i -noframes -stdout $in >$out And so far it seems to be working... Cheers, [EMAIL PROTECTED] wrote: >in aspseek.conf i have > > Converter application/pdf text/html /users/aspseek/sbin/pdftohtml -i >-noframes $in $out.html > >and this is what index says, looks good for me > >www@I011-32:/users/aspseek/sbin> ./index -a -m -u http://..../test.pdf >Loading configuration from /users/aspseek/etc/db.conf >Loading configuration from /users/aspseek/etc/ucharset.conf >Loading configuration from /users/aspseek/etc/stopwords.conf >Loading configuration from /users/aspseek/etc/server.url >Loading configuration from /users/aspseek/etc/allow.url >Loading configuration from /users/aspseek/etc/aspseek.conf >Adding URL: http://..../test.pdf >exec /users/aspseek/sbin/pdftohtml -i -noframes /tmp/asiLa7HF2 >/tmp/asoOpsGUU.html >Page-1 >Page-2 >Page-3 >Page-4 >Page-5 >Page-6 >Page-7 >Page-8 >Saving real-time database ... done. >Saving delta files [..................................................] >done. >Deleting 'deleted' records from urlword[s] ... done. (0 records deleted) >Saving real-time ... done >Saving redirects ... done >Splitting href delta file ... done >Saving href delta files ... done >Saving direct href delta files ... done >Calculating ranks [................................................] done. >Saving lastmods ... done >Generating word site ... done >Generating subset http://..../% ... done (193 URLs) >index process finished. > >btw: those two tempfiles are not deleted in /tmp, maybe another bug >/tmp/asoOpsGUU.html is a html-export of the pdf file. so text is recognized >and exported correct, but when i search for one of the words from this file >i get no results... > >urlword-table says > >*************************** 1. row *************************** > url_id: 100 > site_id: 1 > deleted: 0 > url: http://.../versorgungsreform.pdf >next_index_time: 1031828953 > status: 200 > crc: d41d8cd98f00b204e9800998ecf8427e > last_modified: Wed, 11 Sep 2002 02:00:11 GMT > etag: "1cc11a-cb26-3d7ea3ab" >last_index_time: 1031742553 > referrer: 23 > tag: 0 > hops: 3 > redir: 0 > origin: 0 >1 row in set (0.00 sec) > >mfg > >Markus Rietzler >* <rietzler_software/> >* RZF NRW >* Tel: 0211.4572-130 > > > >-----Urspr�ngliche Nachricht----- >Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]] >Gesendet am: Mittwoch, 11. September 2002 10:48 >An: [EMAIL PROTECTED]; [EMAIL PROTECTED] >Betreff: Re: AW: [aseek-users] external converters, pdf files > >Hi, >What does your 'converter' line in aspseek.conf look like? >Also try running index -a -m -u "%.pdf" and see what the output is >(perhaps an error message is displayed). > >Cheers, > >[EMAIL PROTECTED] wrote: > > > >>nono, >>these are "plain" pdf files, mostly converted from winword. so there is a >>lot of text. when i use pdf2text or pdftohtml and look in the result, i get >>all the words/text from the pdf file. so something different happens >> >> >here... > > >>mfg >> >>Markus Rietzler >>* <rietzler_software/> >>* RZF NRW >>* Tel: 0211.4572-130 >> >> >> >>-----Urspr�ngliche Nachricht----- >>Von: Gregory Kozlovsky [mailto:[EMAIL PROTECTED]] >>Gesendet am: Mittwoch, 11. September 2002 10:07 >>An: '[EMAIL PROTECTED]' >>Betreff: RE: [aseek-users] external converters, pdf files >> >>Sometimes, what appears to be text in .pdf files is actually scanned images >>that cannot be indexed. Check for it. >> >> Gregory Kozlovsky >> >>-----Original Message----- >>From: [EMAIL PROTECTED] >>[mailto:[EMAIL PROTECTED]] >>Sent: Mittwoch, 11. September 2002 09:59 >>To: [EMAIL PROTECTED] >>Subject: [aseek-users] external converters, pdf files >> >> >>hi, >>i am trying to setup aspseek with external converter support. i installed >>pdftohtml, indexing works fine, pdf files seem to be processed, i can find >>the urls to the pdf files in urlword table even with status code 200. but >>when i do a search with words from the pdf-files i get no result, pdf files >>were not listet in the results... >> >>any idea? >> >>thanxs >> >>mfg >> >>Markus Rietzler >>* <rietzler_software/> >>* RZF NRW >>* Tel: 0211.4572-130 >> >> >> >>-----Urspr�ngliche Nachricht----- >>Von: Charlie Farinella [mailto:[EMAIL PROTECTED]] >>Gesendet am: Dienstag, 10. September 2002 23:35 >>An: [EMAIL PROTECTED] >>Betreff: [aseek-users] selective removal of urls >> >>Is there a way to selectively remove a url from our database after it >>has been indexed? We would like to remove porn sites from a family >>friendly database. >> >> >> >> >> > > > >
