Hi all,
While we 'were talking about parsing Word files with catdoc,
maybe we should look at the status of MSWordView. It reads
Word 97 files and prints out HTML. Now HTML we can index
with the HTML parser build into htdig.
This is the same schema that PDF uses. Catdoc prints out plain
text with no markup, so all the words have equal score(?).
With HTML, you have different factors so it should help on the score.
Can someone shine a light on this.
--jesse
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.