Re: [htdig] PDF parsing question

David Adams Mon, 19 Jan 2004 07:26:35 -0800

Jon,

Add the lines


DOC2HTML_LOG=''
export DOC2HTML_LOG

to your rundig script to get more diagnostics from doc2html. (That's a pair
of single quotes in the first line.)

The FAQ (http://www.htdig.org/FAQ.html) may help.  In particular:

http://www.htdig.org/FAQ.html#q4.9
http://www.htdig.org/FAQ.html#q5.2
http://www.htdig.org/FAQ.html#q5.34
http://www.htdig.org/FAQ.html#q5.37

David Adams
Corporate Information Services
Information Systems Services
University of Southampton


----- Original Message ----- 
From: "Jon Stanley" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Sunday, January 18, 2004 8:34 AM
Subject: [htdig] PDF parsing question


> I know that this might have been asked before, but here goes:
>
> I've got a bunch of PDF documents that I would like to index.  Currently,
> I'm only trying to index one of them.  I have htdig working just fine for
> HTML documents, and it appears to work fine for the PDF's, but when I do a
> search, I can't find any of the content in the index.  Here's all of the
> relevant output from rundig -vvv.  I'm using the doc2html script that came
> with htdig to do a pdftotext conversion.  I've verified that content
> extraction is allowed with this PDF - it actaully has no security on it.
> I've also modified the max_doc_size in htdig.conf to allow for this large
> document.
>
> Any suggestion as to what I'm doing wrong?
>
> +href: http://<hostname>/tsmdrm.pdf (TSM DRM Guide (IBM Redbook))
> resolving 'http://<hostname>/tsmdrm.pdf'
>
>    pushing http://<hostname>/tsmdrm.pdf
> + size = 667
> 12983:12983:1:http://<hostname>/tsmdrm.pdf: Retrieval command for
> http://<hostname>/tsmdrm.pdf: GET /tsmdrm.pdf HTTP/1.0
> User-Agent: htdig/3.1.6 ([EMAIL PROTECTED])
> Referer: http://<hostname>/books/
> Host: <hostname>
>
> Header line: HTTP/1.1 200 OK
> Header line: Date: Wed, 14 Jan 2004 05:43:49 GMT
> Header line: Server: Apache/1.3.28 (Unix) PHP/4.3.3
> Header line: Last-Modified: Wed, 14 Jan 2004 02:42:49 GMT
> Converted Wed, 14 Jan 2004 02:42:49 GMT to Wed, 14 Jan 2004 02:42:49
> Header line: ETag: "1ae685-4bdaa1-4004aca9"
> Header line: Accept-Ranges: bytes
> Header line: Content-Length: 4971169
> Header line: Connection: close
> Header line: Content-Type: application/pdf
> <snip a bunch of garbage>
> Read a total of 4971169 bytes
>  size = 4971169
>
>
> 12983/http://<hostname>/tsmdrm.pdf
>
>
>
> -------------------------------------------------------
> The SF.Net email is sponsored by EclipseCon 2004
> Premiere Conference on Open Tools Development and Integration
> See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
> http://www.eclipsecon.org/osdn
> _______________________________________________
> ht://Dig general mailing list: <[EMAIL PROTECTED]>
> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
> List information (subscribe/unsubscribe, etc.)
> https://lists.sourceforge.net/lists/listinfo/htdig-general
>



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Re: [htdig] PDF parsing question

Reply via email to