According to kimsg:
> I'm using HTDig 3.1.2 for NT version and I develop external parser of HTDig
> under Windwos NT environment.
> I develop that external parser for NT vesion is Windows console application
> but I have met interation with external parser and HTDig is not simple. So I
> have to modify ExternalParser.cc.
> 
> My proposal  and question.
> 
> 1. How about to change parse logic of ExternalParser.cc into Plaintext.cc.
>     - Get external parser in htdig.conf
>     - Excute this program and get temp text file.
>     - Goto Plaintext parser.

Something like this has been suggested before, but the idea was more along
the lines of an external decoder than an external parser.  The decoders
could handle any sort of document conversion, decomression, decryption,
etc., and pass along HTML, plain text or PDF to htdig for subsequent
parsing.  This is still on the TO DO list, but it may take a while before
it's implemented.  There are still a few snags to clear up first.

Until then, if you do any decoding externally, you must also parse the
document you decode externally as well, and emit records that htdig's
external parser interface expects.

> 2. How do display excerpt in case external document.

The external parser must emit an "h" record, which contains all the
parsed text for the excerpt on a single line.

If you can understand Perl scripts, you may want to have a look at
contrib/parse_doc.pl to see how it does its work.  It uses any of a
number of document to text converters (i.e. decoders) and then parses
the plain text output of these decoders into the "h" and "w" records for
the external parser interface.  For MS-Word documents, it uses catdoc
as the document to text converter, which admittedly is far from perfect,
but does work reasonably well with a lot of documents.

> ps: Please visit this
> rl( http://210.105.193.131/board/read.asp?name=databank&page=1&no=12 ), if
> you want to get ms-office parser for HTdig. Hmm, may be you get it now.

I did take a look at your web site, but didn't download the software.
Right now, the parse_doc.pl script does the job for me, and I don't
really need to index MS-Office (Word, Excel, PPT) documents -- yet.
I'd much prefer to stick to products released under GPL or other open
source licenses, if at all possible.  However, if the need arises,
I'll keep your parser in mind.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to