in aspseek.conf i have

        Converter application/pdf text/html /users/aspseek/sbin/pdftohtml -i
-noframes $in $out.html

and this is what index says, looks good for me

www@I011-32:/users/aspseek/sbin> ./index -a -m -u http://..../test.pdf
Loading configuration from /users/aspseek/etc/db.conf
Loading configuration from /users/aspseek/etc/ucharset.conf
Loading configuration from /users/aspseek/etc/stopwords.conf
Loading configuration from /users/aspseek/etc/server.url
Loading configuration from /users/aspseek/etc/allow.url
Loading configuration from /users/aspseek/etc/aspseek.conf
Adding URL: http://..../test.pdf
exec /users/aspseek/sbin/pdftohtml -i -noframes /tmp/asiLa7HF2
/tmp/asoOpsGUU.html
Page-1
Page-2
Page-3
Page-4
Page-5
Page-6
Page-7
Page-8
Saving real-time database ... done.
Saving delta files [..................................................]
done.
Deleting 'deleted' records from urlword[s] ... done. (0 records deleted)
Saving real-time ... done
Saving redirects ... done
Splitting href delta file ... done
Saving href delta files ... done
Saving direct href delta files ... done
Calculating ranks  [................................................] done.
Saving lastmods ... done
Generating word site ... done
Generating subset http://..../% ... done (193 URLs)
index process finished.

btw: those two tempfiles are not deleted in /tmp, maybe another bug
/tmp/asoOpsGUU.html is a html-export of the pdf file. so text is recognized
and exported correct, but when i search for one of the words from this file
i get no results...

urlword-table says

*************************** 1. row ***************************
         url_id: 100
        site_id: 1
        deleted: 0
            url: http://.../versorgungsreform.pdf
next_index_time: 1031828953
         status: 200
            crc: d41d8cd98f00b204e9800998ecf8427e
  last_modified: Wed, 11 Sep 2002 02:00:11 GMT
           etag: "1cc11a-cb26-3d7ea3ab"
last_index_time: 1031742553
       referrer: 23
            tag: 0
           hops: 3
          redir: 0
         origin: 0
1 row in set (0.00 sec)

mfg

Markus Rietzler
* <rietzler_software/>
* RZF NRW
* Tel: 0211.4572-130



-----Urspr�ngliche Nachricht-----
Von: Gerrit Hannaert [mailto:[EMAIL PROTECTED]]
Gesendet am: Mittwoch, 11. September 2002 10:48
An: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Betreff: Re: AW: [aseek-users] external converters, pdf files

Hi,
What does your 'converter' line in aspseek.conf look like?
Also try running index -a -m -u "%.pdf" and see what the output is 
(perhaps an error message is displayed).

Cheers,

[EMAIL PROTECTED] wrote:

>nono,
>these are "plain" pdf files, mostly converted from winword. so there is a
>lot of text. when i use pdf2text or pdftohtml and look in the result, i get
>all the words/text from the pdf file. so something different happens
here...
>
>mfg
>
>Markus Rietzler
>* <rietzler_software/>
>* RZF NRW
>* Tel: 0211.4572-130
>
>
>
>-----Urspr�ngliche Nachricht-----
>Von: Gregory Kozlovsky [mailto:[EMAIL PROTECTED]]
>Gesendet am: Mittwoch, 11. September 2002 10:07
>An: '[EMAIL PROTECTED]'
>Betreff: RE: [aseek-users] external converters, pdf files
>
>Sometimes, what appears to be text in .pdf files is actually scanned images
>that cannot be indexed. Check for it.
>
>    Gregory Kozlovsky
>
>-----Original Message-----
>From: [EMAIL PROTECTED]
>[mailto:[EMAIL PROTECTED]]
>Sent: Mittwoch, 11. September 2002 09:59
>To: [EMAIL PROTECTED]
>Subject: [aseek-users] external converters, pdf files
>
>
>hi,
>i am trying to setup aspseek with external converter support. i installed
>pdftohtml, indexing works fine, pdf files seem to be processed, i can find
>the urls to the pdf files in urlword table even with status code 200. but
>when i do a search with words from the pdf-files i get no result, pdf files
>were not listet in the results...
>
>any idea?
>
>thanxs
>
>mfg
>
>Markus Rietzler
>* <rietzler_software/>
>* RZF NRW
>* Tel: 0211.4572-130
>
>
>
>-----Urspr�ngliche Nachricht-----
>Von: Charlie Farinella [mailto:[EMAIL PROTECTED]]
>Gesendet am: Dienstag, 10. September 2002 23:35
>An: [EMAIL PROTECTED]
>Betreff: [aseek-users] selective removal of urls
>
>Is there a way to selectively remove a url from our database after it
>has been indexed?  We would like to remove porn sites from a family
>friendly database.
>
>  
>


Reply via email to