Re: [htdig3-dev] Parsing MS-Word Files

Geoff Hutchison Mon, 8 Feb 1999 09:14:45 -0500



>While we 'were talking about parsing Word files with catdoc,
>maybe we should look at the status of MSWordView. It reads
>Word 97 files and prints out HTML. Now HTML we can index
>with the HTML parser build into htdig.

Several people have pointed out the utility of having "pass-through"
ExternalParsers. So a class called something like "ExternalFilter" might be
a good idea. The filter would take the file, perform some action (say
gunzip or MSWordView) and pass it back for further parsing. The class would
look somewhat like the ExternalParser class, but a bit simpler since it
obviously doesn't actually do any parsing. :-)

The only snag in this plan is figuring out the MIME type after filtering.
In particular, an uncompress filter would be fairly general and would have
a hard time knowing what it produced. However, if we add better MIME code
to the Retriever, this can be done internally.

Cheers,
-Geoff


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Re: [htdig3-dev] Parsing MS-Word Files

Reply via email to