According to shams khan: > I am having difficulty in indexing pdf and word documents > > .doc word documents just do not index and .pdf documents are indexed but > give me garbled info in the search results. > > I've checked over the setup of doc2html, pdf2html, CATDOC and XPDF > carefully, and I can't see where I am going wrong. > > When I run rundig, I get the following error on word documents: > > 3:3:1:http://10.5.1.35/worddocument.doc: ! UNABLE to convert > size = 8060 > > Deleted, no excerpt: 0/http://10.5.1.35/worddocument.doc
I think the whole "UNABLE to convert" subject has been flogged to death already, so I won't comment. > PDF documents seem to be indexed okay, as I get the following message: > > 14:14:1:http://10.5.1.35/pdfdocuement/.pdf: size = 22350 That just means htdig read the document, but it doesn't mean it was parsed properly. > BUT, in the search results for a pdf document I get the following formatted > results: > > [msag_3_1_theassesment.pdf] > %PDF-1.3 %���� 15 0 obj << /Linearized 1 /O 17 /H [ 1120 227 ] /L 30732 /E > 16677 /N 4 /T 30314 >> endobj xref 15 34 0000000016 00000 n 0000001027 00000 > n 0000001347 00000 n 0000001554 00000 n 0000001761 00000 n 0000001800 00000 > n 0000002309 00000 n 0000002511 00000 n 0000002699 00000 n 0000003100 00000 > n 0000003121 00000 n 0000003830 00000 n 0000004011 00000 n 0000004409 00000 > n 0000004430 00000 n 0000005141 00000 n 0000005162 00000 n 0000005917 00000 > n 0000005938 00000 n 0000006724 00000 n 0000007175 00000 n 0000007371 00000 > n 0000007392 00000 n 0000008115 00000 n 0000008136 00000 n 0000008883 00000 > n 0000008904 00000 n 0000009699 00000 n 0000013119 00000 n 0000013140 00000 > n 0000013771 00000 n 0000013849 00000 n 0000001120 00000 n 0000001326 00000 > n trailer << /Size 49 ... > http://10.5.1.35/msag_3_1_theassesment.pdf 10/28/02, 30732 bytes It seems to me that the raw pdf, or parts of it, went through directly into htdig, rather than being converted to text. I'd guess that the setting of $PDF2HTML in doc2html.pl is wrong, or if it's correctly set to the path fo pdf2html.pl, then the setting of $PDFTOTEXT in pdf2html.pl is wrong. I believe David pointed out that you set $PDF2HTML to pdf2html instead of pdf2html.pl, but that raises the question of what pdf2html is on your system, and what it would do when given a PDF file. You should also test pdftotext manually on one of these PDF files to make sure it's working correctly. > Whereas, using conv_doc.pl for pdf documents, I get:the following formatted > results (I dont know why it gives it the title of a Microsoft Word > document!... it *is* a pdf document): > > Microsoft Word - msag_3_1_theassesment.doc > Management Self-Assessment Guide MSAG 3.1 Originated by: Approved by: Page 1 ... This has been asked and answered a few times already, most recently and most clearly in... http://www.geocrawler.com/archives/3/8822/2002/11/0/10093416/ -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by: To learn the basics of securing your web site with SSL, click here to get a FREE TRIAL of a Thawte Server Certificate: http://www.gothawte.com/rd524.html _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

