Jon, Add the lines
DOC2HTML_LOG='' export DOC2HTML_LOG to your rundig script to get more diagnostics from doc2html. (That's a pair of single quotes in the first line.) The FAQ (http://www.htdig.org/FAQ.html) may help. In particular: http://www.htdig.org/FAQ.html#q4.9 http://www.htdig.org/FAQ.html#q5.2 http://www.htdig.org/FAQ.html#q5.34 http://www.htdig.org/FAQ.html#q5.37 David Adams Corporate Information Services Information Systems Services University of Southampton ----- Original Message ----- From: "Jon Stanley" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Sunday, January 18, 2004 8:34 AM Subject: [htdig] PDF parsing question > I know that this might have been asked before, but here goes: > > I've got a bunch of PDF documents that I would like to index. Currently, > I'm only trying to index one of them. I have htdig working just fine for > HTML documents, and it appears to work fine for the PDF's, but when I do a > search, I can't find any of the content in the index. Here's all of the > relevant output from rundig -vvv. I'm using the doc2html script that came > with htdig to do a pdftotext conversion. I've verified that content > extraction is allowed with this PDF - it actaully has no security on it. > I've also modified the max_doc_size in htdig.conf to allow for this large > document. > > Any suggestion as to what I'm doing wrong? > > +href: http://<hostname>/tsmdrm.pdf (TSM DRM Guide (IBM Redbook)) > resolving 'http://<hostname>/tsmdrm.pdf' > > pushing http://<hostname>/tsmdrm.pdf > + size = 667 > 12983:12983:1:http://<hostname>/tsmdrm.pdf: Retrieval command for > http://<hostname>/tsmdrm.pdf: GET /tsmdrm.pdf HTTP/1.0 > User-Agent: htdig/3.1.6 ([EMAIL PROTECTED]) > Referer: http://<hostname>/books/ > Host: <hostname> > > Header line: HTTP/1.1 200 OK > Header line: Date: Wed, 14 Jan 2004 05:43:49 GMT > Header line: Server: Apache/1.3.28 (Unix) PHP/4.3.3 > Header line: Last-Modified: Wed, 14 Jan 2004 02:42:49 GMT > Converted Wed, 14 Jan 2004 02:42:49 GMT to Wed, 14 Jan 2004 02:42:49 > Header line: ETag: "1ae685-4bdaa1-4004aca9" > Header line: Accept-Ranges: bytes > Header line: Content-Length: 4971169 > Header line: Connection: close > Header line: Content-Type: application/pdf > <snip a bunch of garbage> > Read a total of 4971169 bytes > size = 4971169 > > > 12983/http://<hostname>/tsmdrm.pdf > > > > ------------------------------------------------------- > The SF.Net email is sponsored by EclipseCon 2004 > Premiere Conference on Open Tools Development and Integration > See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. > http://www.eclipsecon.org/osdn > _______________________________________________ > ht://Dig general mailing list: <[EMAIL PROTECTED]> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general > ------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

