I'll go either way, but I still don't know how to implement the word parser, as opposed to the PDF parser or HTM parser.
Eric Anderson LanRx Network Solutions Quoting Ryan Ackley <[EMAIL PROTECTED]>: > Eric, > > The problem with antiword is that it is a native application. You must > write > a class that uses JNI to access the native code. If you link your java code > with native code you have lost one of the biggest benefits of Java, > platform > independence. I would suggest you use the library at http://textmining.org. > contrary to what David Spencer says, it should work on all documents > created > with Word 97 or above. I have literally indexed 100,000s of unique > documents > using my library. > > Ryan Ackley > > ----- Original Message ----- > From: "Eric Anderson" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Wednesday, March 05, 2003 7:14 PM > Subject: Re: my experiences - Re: Parsing Word Docs > > > > Ok. Thanks for the tip. > > > > I downloaded and compiled Antiword, and would like to now add it to my > indexing > > class. However, I'm not sure how the application would be called, and > from > > where it would be called. > > > > How will I have the class parse the document through Antiword to create > the > > keyword index, but leaving the DOC intact, as Mr. Litchfield did with > PDFBox? > > > > Your assistance is greatly appreciated. > > > > Eric Anderson > > 815-505-6132 > > > > > > Quoting David Spencer <[EMAIL PROTECTED]>: > > > > > FYI I tried the textmining.org/poi combo and on a collection of 350 > word > > > docs people have developed here over the years, and it failed on 33% of > > > them > > > with exceptions being thrown about the formats being invalid. > > > > > > I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free > > > *.exe, and > > > it worked great ( well it seemed to process all the files fine). > > > > > > I've had similar experiences with PDF - I tried the 3 or so > > > freeware/java PDF > > > text extractors and they were not as good as the exe, pdftotext, > > > from foolabs (http://www.foolabs.com/xpdf/). > > > > > > Not satisfying to a java developer but these work better than anything > > > else I can find. > > > > > > You get source and I use them on windows & linux, no prob. > > > > > > > > > > > > Eric Anderson wrote: > > > > > > >I'm interested in using the textmining/textextraction utilities using > Apache > > > > > > >POI, that Ryan was discussing. However, I'm having some difficulty > > > determining > > > >what the insertion point would be to replace the default parser with > the > > > word > > > >parser. > > > > > > > >Any assistance would be appreciated. > > > > > > > > > > > > > > > > > > > > > > > >LanRx Network Solutions, Inc. > > > >Providing Enterprise Level Solutions...On A Small Business Budget > > > > > > > >--------------------------------------------------------------------- > > > >To unsubscribe, e-mail: [EMAIL PROTECTED] > > > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > LanRx Network Solutions, Inc. > > Providing Enterprise Level Solutions...On A Small Business Budget > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > LanRx Network Solutions, Inc. Providing Enterprise Level Solutions...On A Small Business Budget --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]