The MIME-type is given by the lines such as:

Header line: Content-Type: application/msword

and seems fine in your case.

Despite its name, delme_word97.doc appears to be an RTF file.
You are successfully indexing RTF files I think?

I'm at a bit of loss to explain what is happening with your Word documents.
You could try changing the line

  $magic = '^\320\317\021\340';

to

  $magic = '\320\317\021';

throughout, but I'm not too hopeful.

Double check doc2html.pl wherever $CATDOC occurs.

--
David Adams
Information Systems Services
Southampton University


----- Original Message -----
From: "shams khan" <[EMAIL PROTECTED]>
To: "ht://Dig" <[EMAIL PROTECTED]>
Sent: Wednesday, November 20, 2002 3:27 PM
Subject: Re: using doc2html (was [htdig] using conv_doc.pl to index MS Word
documents)


> thanks David, Giles,
>
> after making this alteration:
>
>     my $PDF2HTML = '/usr/local/bin/pdf2html;' to my $PDF2HTML =
> '/usr/local/bin/pdf2html.pl;'
>
> the PDF indexing has started wortking, and I am getting correctly parsed
> output in the search results.
>
> Thanks!
>
> However, I am still not having any luck with indexing word documents.
>
> > Either or both the MIME-types and magic numbers are not matching those
> expected by the script.
> >
> > Add:
> >
> > DOC2HTML_LOG=""
> > export DOC2HTML
> >
> > to your rundig script for a bit more diagnostics.  Check that the script
> > runs htdig with the -vvv option which will give the MIME-type of each
> file.
>
> I have run rundig (with the -vvv option), but i'm not sure what i'm
looking
> for in terms of the MIME type, i've included the output from rundig to the
> end of this email..
>
> the files that i'm having trouble indexing are:
>
> delme_word2k.doc
> delme_word6_95.doc
>
> does the output (below) give any indication to what is going wrong ?
>
>
> > Use the unix od -c command to look at first few characters of a couple
of
> > the PDF and Word files that you are trying to index.  Do they match the
> > magic numbers programmed into doc2htm.pl?  For example, do your Word
files
> > begin with
> >
> >     \320    \317    \021    \340
> >
> > which is what you you should have in your doc2html.pl?
> >
>
> the magic numbers do match with the word documents I have been testing
with.
>
> delme_word2k.doc starts with            0000000 320 317 021....
> delme_word6_95.doc starts with        0000000 320 317 021....
> delme_word97.doc starts with            0000000 {   \   r   t   f   l   1
> ......
>
> out of the three above documents, currently word97.doc IS being indexed
and
> parsed correctly.
>
> It is the word2000.doc and word6_95.doc that aren't indexed with the
UNABLE
> TO CONVERT message.
>
> Any hints ?
>
> Many Thanks for your help !
>
> Shams
>
>
> --------------------------------
>





-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing 
your web site with SSL, click here to get a FREE TRIAL of a Thawte 
Server Certificate: http://www.gothawte.com/rd524.html
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to