Hello
htdig 3.1.5 on solaris 2.6
It works for some pdf files like
http://public.archi.fr/PubliMIARA/Communique_33.pdf
and not for some pdf files like http://public.archi.fr/PubliMIARA/AthisMons.pdf
So, I guess my configuration and soft are ok :
external_parsers: application/pdf /usr/local/bin/parsepdf.pl
and this file is modified to fit the correct path :
$parser = "/usr/local/bin/pdftotext";
$info = "/usr/local/bin/pdfinfo";
these 2 files come from xpdf for solaris
When it works i get the log
1:0:http://public.archi.fr/PubliMIARA/Communique_33.pdf
New server: public.archi.fr, 80
Retrieval command for http://public.archi.fr/robots.txt: GET /robots.txt HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Host: public.archi.fr
Header line: HTTP/1.1 404 Not Found
Header line: Date: Mon, 24 Mar 2003 14:15:54 GMT
Header line: Server: Apache/1.3.26 (Unix) PHP/4.2.2 mod_ssl/2.8.10 OpenSSL/0.9.6e
Header line: Connection: close
Header line: Content-Type: text/html; charset=iso-8859-1
Header line:
returnStatus = 1
pushed
pick: public.archi.fr, # servers = 1
0:0:0:http://public.archi.fr/PubliMIARA/Communique_33.pdf: Retrieval command for http://public.arc
hi.fr/PubliMIARA/Communique_33.pdf: GET /PubliMIARA/Communique_33.pdf HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Host: public.archi.fr
Header line: HTTP/1.1 200 OK
Header line: Date: Mon, 24 Mar 2003 14:15:54 GMT
Header line: Server: Apache/1.3.26 (Unix) PHP/4.2.2 mod_ssl/2.8.10 OpenSSL/0.9.6e
Header line: Last-Modified: Fri, 07 Sep 2001 07:39:14 GMT
Translated Fri, 07 Sep 2001 07:39:14 GMT to 2001-09-07 07:39:14 (101)
And converted to Fri, 07 Sep 2001 07:39:14
Header line: ETag: "16bd12-72f1-3b9879a2"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 29425
Header line: Connection: close
Header line: Content-Type: application/pdf
Header line:
returnStatus = 0
Read 8192 from document
Read 8192 from document
Read 8192 from document
Read a total of 29425 bytes
title: Document PDF Communique_33.pdf
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
word: [EMAIL PROTECTED]
...
size = 29425
pick: public.archi.fr, # servers = 1
When it doesn't work, I get almost the same except the lines with word: ....
Does it come from the pdf or do I have to set properly something else.
Any suggestion would be appreciated.
Thanks
- [htdig] Indexing PDF Files Franck Collineau
- RE: [htdig] Indexing PDF Files David T. Ashley
- Re: [htdig] Indexing PDF Files Franck Collineau
- Re: [htdig] Indexing PDF Files Gilles Detillieux
- [htdig] indexing pdf files Liste
- [htdig] indexing pdf files Anne Durand
- Re: [htdig] indexing pdf files Geoff Hutchison
- [htdig] indexing pdf files Anne Durand
- Re: [htdig] indexing pdf files Olivier Korn
- Re: [htdig] indexing pdf files David Adams

