Eric, The problem with antiword is that it is a native application. You must write a class that uses JNI to access the native code. If you link your java code with native code you have lost one of the biggest benefits of Java, platform independence. I would suggest you use the library at http://textmining.org. contrary to what David Spencer says, it should work on all documents created with Word 97 or above. I have literally indexed 100,000s of unique documents using my library.
Ryan Ackley ----- Original Message ----- From: "Eric Anderson" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, March 05, 2003 7:14 PM Subject: Re: my experiences - Re: Parsing Word Docs > Ok. Thanks for the tip. > > I downloaded and compiled Antiword, and would like to now add it to my indexing > class. However, I'm not sure how the application would be called, and from > where it would be called. > > How will I have the class parse the document through Antiword to create the > keyword index, but leaving the DOC intact, as Mr. Litchfield did with PDFBox? > > Your assistance is greatly appreciated. > > Eric Anderson > 815-505-6132 > > > Quoting David Spencer <[EMAIL PROTECTED]>: > > > FYI I tried the textmining.org/poi combo and on a collection of 350 word > > docs people have developed here over the years, and it failed on 33% of > > them > > with exceptions being thrown about the formats being invalid. > > > > I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free > > *.exe, and > > it worked great ( well it seemed to process all the files fine). > > > > I've had similar experiences with PDF - I tried the 3 or so > > freeware/java PDF > > text extractors and they were not as good as the exe, pdftotext, > > from foolabs (http://www.foolabs.com/xpdf/). > > > > Not satisfying to a java developer but these work better than anything > > else I can find. > > > > You get source and I use them on windows & linux, no prob. > > > > > > > > Eric Anderson wrote: > > > > >I'm interested in using the textmining/textextraction utilities using Apache > > > > >POI, that Ryan was discussing. However, I'm having some difficulty > > determining > > >what the insertion point would be to replace the default parser with the > > word > > >parser. > > > > > >Any assistance would be appreciated. > > > > > > > > > > > > > > > > > >LanRx Network Solutions, Inc. > > >Providing Enterprise Level Solutions...On A Small Business Budget > > > > > >--------------------------------------------------------------------- > > >To unsubscribe, e-mail: [EMAIL PROTECTED] > > >For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > LanRx Network Solutions, Inc. > Providing Enterprise Level Solutions...On A Small Business Budget > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]