According to kimsg: > I'm using HTDig 3.1.2 for NT version and I develop external parser of HTDig > under Windwos NT environment. > I develop that external parser for NT vesion is Windows console application > but I have met interation with external parser and HTDig is not simple. So I > have to modify ExternalParser.cc. > > My proposal and question. > > 1. How about to change parse logic of ExternalParser.cc into Plaintext.cc. > - Get external parser in htdig.conf > - Excute this program and get temp text file. > - Goto Plaintext parser. Something like this has been suggested before, but the idea was more along the lines of an external decoder than an external parser. The decoders could handle any sort of document conversion, decomression, decryption, etc., and pass along HTML, plain text or PDF to htdig for subsequent parsing. This is still on the TO DO list, but it may take a while before it's implemented. There are still a few snags to clear up first. Until then, if you do any decoding externally, you must also parse the document you decode externally as well, and emit records that htdig's external parser interface expects. > 2. How do display excerpt in case external document. The external parser must emit an "h" record, which contains all the parsed text for the excerpt on a single line. If you can understand Perl scripts, you may want to have a look at contrib/parse_doc.pl to see how it does its work. It uses any of a number of document to text converters (i.e. decoders) and then parses the plain text output of these decoders into the "h" and "w" records for the external parser interface. For MS-Word documents, it uses catdoc as the document to text converter, which admittedly is far from perfect, but does work reasonably well with a lot of documents. > ps: Please visit this > rl( http://210.105.193.131/board/read.asp?name=databank&page=1&no=12 ), if > you want to get ms-office parser for HTdig. Hmm, may be you get it now. I did take a look at your web site, but didn't download the software. Right now, the parse_doc.pl script does the job for me, and I don't really need to index MS-Office (Word, Excel, PPT) documents -- yet. I'd much prefer to stick to products released under GPL or other open source licenses, if at all possible. However, if the need arises, I'll keep your parser in mind. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig3-dev mailing list, send a message to [EMAIL PROTECTED] containing the single word "unsubscribe" in the SUBJECT of the message.
