Re: AW: AW: [aseek-users] external converters, pdf files

Gerrit Hannaert Wed, 11 Sep 2002 15:06:01 -0700

Puzzling, I've had a few problems of my own but with me it turned out 
the 'exec' never ran.


AFAIK, the converter line looks fine (as the exec statement seems to 
confirm).
Is clearing the database an option? You could try that.
Do other converters work? Stupid question but are html or text files 
indexed properly?

Otherwise I think this is a question for someone with a bit more 
knowledge of the internal workings of indexer...

-G



[EMAIL PROTECTED] wrote:

>in aspseek.conf i have
>
>       Converter application/pdf text/html /users/aspseek/sbin/pdftohtml -i
>-noframes $in $out.html
>
>and this is what index says, looks good for me
>
>www@I011-32:/users/aspseek/sbin> ./index -a -m -u http://..../test.pdf
>Loading configuration from /users/aspseek/etc/db.conf
>Loading configuration from /users/aspseek/etc/ucharset.conf
>Loading configuration from /users/aspseek/etc/stopwords.conf
>Loading configuration from /users/aspseek/etc/server.url
>Loading configuration from /users/aspseek/etc/allow.url
>Loading configuration from /users/aspseek/etc/aspseek.conf
>Adding URL: http://..../test.pdf
>exec /users/aspseek/sbin/pdftohtml -i -noframes /tmp/asiLa7HF2
>/tmp/asoOpsGUU.html
>Page-1
>Page-2
>Page-3
>Page-4
>Page-5
>Page-6
>Page-7
>Page-8
>Saving real-time database ... done.
>Saving delta files [..................................................]
>done.
>Deleting 'deleted' records from urlword[s] ... done. (0 records deleted)
>Saving real-time ... done
>Saving redirects ... done
>Splitting href delta file ... done
>Saving href delta files ... done
>Saving direct href delta files ... done
>Calculating ranks  [................................................] done.
>Saving lastmods ... done
>Generating word site ... done
>Generating subset http://..../% ... done (193 URLs)
>index process finished.
>
>btw: those two tempfiles are not deleted in /tmp, maybe another bug
>/tmp/asoOpsGUU.html is a html-export of the pdf file. so text is recognized
>and exported correct, but when i search for one of the words from this file
>i get no results...
>
>urlword-table says
>
>*************************** 1. row ***************************
>         url_id: 100
>        site_id: 1
>        deleted: 0
>            url: http://.../versorgungsreform.pdf
>next_index_time: 1031828953
>         status: 200
>            crc: d41d8cd98f00b204e9800998ecf8427e
>  last_modified: Wed, 11 Sep 2002 02:00:11 GMT
>           etag: "1cc11a-cb26-3d7ea3ab"
>last_index_time: 1031742553
>       referrer: 23
>            tag: 0
>           hops: 3
>          redir: 0
>         origin: 0
>1 row in set (0.00 sec)
>
>mfg
>
>Markus Rietzler
>* <rietzler_software/>
>* RZF NRW
>* Tel: 0211.4572-130
>
>
>
>-----Urspr�ngliche Nachricht-----
>Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]]
>Gesendet am: Mittwoch, 11. September 2002 10:48
>An: [EMAIL PROTECTED]; [EMAIL PROTECTED]
>Betreff: Re: AW: [aseek-users] external converters, pdf files
>
>Hi,
>What does your 'converter' line in aspseek.conf look like?
>Also try running index -a -m -u "%.pdf" and see what the output is 
>(perhaps an error message is displayed).
>
>Cheers,
>
>[EMAIL PROTECTED] wrote:
>
>  
>
>>nono,
>>these are "plain" pdf files, mostly converted from winword. so there is a
>>lot of text. when i use pdf2text or pdftohtml and look in the result, i get
>>all the words/text from the pdf file. so something different happens
>>    
>>
>here...
>  
>
>>mfg
>>
>>Markus Rietzler
>>* <rietzler_software/>
>>* RZF NRW
>>* Tel: 0211.4572-130
>>
>>
>>
>>-----Urspr�ngliche Nachricht-----
>>Von: Gregory Kozlovsky [mailto:[EMAIL PROTECTED]]
>>Gesendet am: Mittwoch, 11. September 2002 10:07
>>An: '[EMAIL PROTECTED]'
>>Betreff: RE: [aseek-users] external converters, pdf files
>>
>>Sometimes, what appears to be text in .pdf files is actually scanned images
>>that cannot be indexed. Check for it.
>>
>>   Gregory Kozlovsky
>>
>>-----Original Message-----
>>From: [EMAIL PROTECTED]
>>[mailto:[EMAIL PROTECTED]]
>>Sent: Mittwoch, 11. September 2002 09:59
>>To: [EMAIL PROTECTED]
>>Subject: [aseek-users] external converters, pdf files
>>
>>
>>hi,
>>i am trying to setup aspseek with external converter support. i installed
>>pdftohtml, indexing works fine, pdf files seem to be processed, i can find
>>the urls to the pdf files in urlword table even with status code 200. but
>>when i do a search with words from the pdf-files i get no result, pdf files
>>were not listet in the results...
>>
>>any idea?
>>
>>thanxs
>>
>>mfg
>>
>>Markus Rietzler
>>* <rietzler_software/>
>>* RZF NRW
>>* Tel: 0211.4572-130
>>
>>
>>
>>-----Urspr�ngliche Nachricht-----
>>Von: Charlie Farinella [mailto:[EMAIL PROTECTED]]
>>Gesendet am: Dienstag, 10. September 2002 23:35
>>An: [EMAIL PROTECTED]
>>Betreff: [aseek-users] selective removal of urls
>>
>>Is there a way to selectively remove a url from our database after it
>>has been indexed?  We would like to remove porn sites from a family
>>friendly database.
>>
>> 
>>
>>    
>>
>
>
>  
>

Re: AW: AW: [aseek-users] external converters, pdf files

Reply via email to