AW: AW: AW: [aseek-users] external converters, pdf files

Markus . Rietzler Thu, 12 Sep 2002 01:53:53 -0700

that trick worked (half).
the pdf-file is being indexed but i can't search for words with eg. umlauts.
in the excerpt i see "?" on the places where umlauts (�,�,�) should be.
so the charset of the document is wrong. any ideas


mfg

Markus Rietzler
* <rietzler_software/>
* RZF NRW
* Tel: 0211.4572-130



-----Urspr�ngliche Nachricht-----
Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]]
Gesendet am: Donnerstag, 12. September 2002 02:48
An: [EMAIL PROTECTED]
Betreff: Re: AW: AW: [aseek-users] external converters, pdf files

While trying out pdftohtml for myself (yeah, it's nicer than plain text 
and provides a title too), I figured it out.
pdftohtml will add an ".html" extension to the output file; and hence 
index won't use it nor delete it afterwards. The solution I use is simply:

Converter application/pdf            text/html       /usr/bin/pdftohtml -i
-noframes -stdout $in >$out

And so far it seems to be working...

Cheers,



[EMAIL PROTECTED] wrote:

>in aspseek.conf i have
>
>       Converter application/pdf text/html /users/aspseek/sbin/pdftohtml -i
>-noframes $in $out.html
>
>and this is what index says, looks good for me
>
>www@I011-32:/users/aspseek/sbin> ./index -a -m -u http://..../test.pdf
>Loading configuration from /users/aspseek/etc/db.conf
>Loading configuration from /users/aspseek/etc/ucharset.conf
>Loading configuration from /users/aspseek/etc/stopwords.conf
>Loading configuration from /users/aspseek/etc/server.url
>Loading configuration from /users/aspseek/etc/allow.url
>Loading configuration from /users/aspseek/etc/aspseek.conf
>Adding URL: http://..../test.pdf
>exec /users/aspseek/sbin/pdftohtml -i -noframes /tmp/asiLa7HF2
>/tmp/asoOpsGUU.html
>Page-1
>Page-2
>Page-3
>Page-4
>Page-5
>Page-6
>Page-7
>Page-8
>Saving real-time database ... done.
>Saving delta files [..................................................]
>done.
>Deleting 'deleted' records from urlword[s] ... done. (0 records deleted)
>Saving real-time ... done
>Saving redirects ... done
>Splitting href delta file ... done
>Saving href delta files ... done
>Saving direct href delta files ... done
>Calculating ranks  [................................................] done.
>Saving lastmods ... done
>Generating word site ... done
>Generating subset http://..../% ... done (193 URLs)
>index process finished.
>
>btw: those two tempfiles are not deleted in /tmp, maybe another bug
>/tmp/asoOpsGUU.html is a html-export of the pdf file. so text is recognized
>and exported correct, but when i search for one of the words from this file
>i get no results...
>
>urlword-table says
>
>*************************** 1. row ***************************
>         url_id: 100
>        site_id: 1
>        deleted: 0
>            url: http://.../versorgungsreform.pdf
>next_index_time: 1031828953
>         status: 200
>            crc: d41d8cd98f00b204e9800998ecf8427e
>  last_modified: Wed, 11 Sep 2002 02:00:11 GMT
>           etag: "1cc11a-cb26-3d7ea3ab"
>last_index_time: 1031742553
>       referrer: 23
>            tag: 0
>           hops: 3
>          redir: 0
>         origin: 0
>1 row in set (0.00 sec)
>
>mfg
>
>Markus Rietzler
>* <rietzler_software/>
>* RZF NRW
>* Tel: 0211.4572-130
>
>
>
>-----Urspr�ngliche Nachricht-----
>Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]]
>Gesendet am: Mittwoch, 11. September 2002 10:48
>An: [EMAIL PROTECTED]; [EMAIL PROTECTED]
>Betreff: Re: AW: [aseek-users] external converters, pdf files
>
>Hi,
>What does your 'converter' line in aspseek.conf look like?
>Also try running index -a -m -u "%.pdf" and see what the output is 
>(perhaps an error message is displayed).
>
>Cheers,
>
>[EMAIL PROTECTED] wrote:
>
>  
>
>>nono,
>>these are "plain" pdf files, mostly converted from winword. so there is a
>>lot of text. when i use pdf2text or pdftohtml and look in the result, i
get
>>all the words/text from the pdf file. so something different happens
>>    
>>
>here...
>  
>
>>mfg
>>
>>Markus Rietzler
>>* <rietzler_software/>
>>* RZF NRW
>>* Tel: 0211.4572-130
>>
>>
>>
>>-----Urspr�ngliche Nachricht-----
>>Von: Gregory Kozlovsky [mailto:[EMAIL PROTECTED]]
>>Gesendet am: Mittwoch, 11. September 2002 10:07
>>An: '[EMAIL PROTECTED]'
>>Betreff: RE: [aseek-users] external converters, pdf files
>>
>>Sometimes, what appears to be text in .pdf files is actually scanned
images
>>that cannot be indexed. Check for it.
>>
>>   Gregory Kozlovsky
>>
>>-----Original Message-----
>>From: [EMAIL PROTECTED]
>>[mailto:[EMAIL PROTECTED]]
>>Sent: Mittwoch, 11. September 2002 09:59
>>To: [EMAIL PROTECTED]
>>Subject: [aseek-users] external converters, pdf files
>>
>>
>>hi,
>>i am trying to setup aspseek with external converter support. i installed
>>pdftohtml, indexing works fine, pdf files seem to be processed, i can find
>>the urls to the pdf files in urlword table even with status code 200. but
>>when i do a search with words from the pdf-files i get no result, pdf
files
>>were not listet in the results...
>>
>>any idea?
>>
>>thanxs
>>
>>mfg
>>
>>Markus Rietzler
>>* <rietzler_software/>
>>* RZF NRW
>>* Tel: 0211.4572-130
>>
>>
>>
>>-----Urspr�ngliche Nachricht-----
>>Von: Charlie Farinella [mailto:[EMAIL PROTECTED]]
>>Gesendet am: Dienstag, 10. September 2002 23:35
>>An: [EMAIL PROTECTED]
>>Betreff: [aseek-users] selective removal of urls
>>
>>Is there a way to selectively remove a url from our database after it
>>has been indexed?  We would like to remove porn sites from a family
>>friendly database.
>>
>> 
>>
>>    
>>
>
>
>  
>

AW: AW: AW: [aseek-users] external converters, pdf files

Reply via email to