Re: using doc2html (was [htdig] using conv_doc.pl to index MS Word documents)

Gilles Detillieux Tue, 19 Nov 2002 13:43:08 -0800

According to shams khan:
> I am having difficulty in indexing pdf and word documents
> 
> .doc word documents just do not index and .pdf documents are indexed but
> give me garbled info in the search results.
> 
> I've checked over the setup of doc2html, pdf2html, CATDOC and XPDF
> carefully, and I can't see where I am going wrong.
> 
> When I run rundig, I get the following error on word documents:
> 
>         3:3:1:http://10.5.1.35/worddocument.doc:  !      UNABLE to convert
> size  =  8060
> 
>         Deleted, no excerpt:  0/http://10.5.1.35/worddocument.doc


I think the whole "UNABLE to convert" subject has been flogged to death
already, so I won't comment.

> PDF documents seem to be indexed okay, as I get the following message:
> 
>         14:14:1:http://10.5.1.35/pdfdocuement/.pdf:     size  =  22350

That just means htdig read the document, but it doesn't mean it was parsed
properly.

> BUT, in the search results for a pdf document I get the following formatted
> results:
> 
> [msag_3_1_theassesment.pdf]
> %PDF-1.3 %���� 15 0 obj << /Linearized 1 /O 17 /H [ 1120 227 ] /L 30732 /E
> 16677 /N 4 /T 30314 >> endobj xref 15 34 0000000016 00000 n 0000001027 00000
> n 0000001347 00000 n 0000001554 00000 n 0000001761 00000 n 0000001800 00000
> n 0000002309 00000 n 0000002511 00000 n 0000002699 00000 n 0000003100 00000
> n 0000003121 00000 n 0000003830 00000 n 0000004011 00000 n 0000004409 00000
> n 0000004430 00000 n 0000005141 00000 n 0000005162 00000 n 0000005917 00000
> n 0000005938 00000 n 0000006724 00000 n 0000007175 00000 n 0000007371 00000
> n 0000007392 00000 n 0000008115 00000 n 0000008136 00000 n 0000008883 00000
> n 0000008904 00000 n 0000009699 00000 n 0000013119 00000 n 0000013140 00000
> n 0000013771 00000 n 0000013849 00000 n 0000001120 00000 n 0000001326 00000
> n trailer << /Size 49 ...
> http://10.5.1.35/msag_3_1_theassesment.pdf 10/28/02, 30732 bytes

It seems to me that the raw pdf, or parts of it, went through directly
into htdig, rather than being converted to text.  I'd guess that the
setting of $PDF2HTML in doc2html.pl is wrong, or if it's correctly set
to the path fo pdf2html.pl, then the setting of $PDFTOTEXT in pdf2html.pl
is wrong.  I believe David pointed out that you set $PDF2HTML to pdf2html
instead of pdf2html.pl, but that raises the question of what pdf2html is
on your system, and what it would do when given a PDF file.  You should
also test pdftotext manually on one of these PDF files to make sure it's
working correctly.

> Whereas, using conv_doc.pl for pdf documents, I get:the following formatted
> results (I dont know why it gives it the title of a Microsoft Word
> document!... it *is* a pdf document):
> 
> Microsoft Word - msag_3_1_theassesment.doc
> Management Self-Assessment Guide MSAG 3.1 Originated by: Approved by: Page 1
...

This has been asked and answered a few times already, most recently and
most clearly in...

   http://www.geocrawler.com/archives/3/8822/2002/11/0/10093416/

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing
your web site with SSL, click here to get a FREE TRIAL of a Thawte
Server Certificate: http://www.gothawte.com/rd524.html
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: using doc2html (was [htdig] using conv_doc.pl to index MS Word documents)

Reply via email to