Re: my experiences - Re: Parsing Word Docs

David Spencer Thu, 06 Mar 2003 10:30:41 -0800

Ryan Ackley wrote:

Eric,

The problem with antiword is that it is a native application. You must write a class that uses JNI to access the native code.

No you don't. Just use Runtime.exec - no JNI :)

If you link your java code
with native code you have lost one of the biggest benefits of Java, platform

Yeah but given that the source for antitext is avail and it runs on all platforms I use (windows/linux/sun) and works better than anything else (given that it seems to accept older formats than POI/textmining) it seems to get the job done better.

independence. I would suggest you use the library at http://textmining.org.
contrary to what David Spencer says, it should work on all documents created
with Word 97 or above. I have literally indexed 100,000s of unique documents
using my library.

Ryan Ackley

----- Original Message -----
From: "Eric Anderson" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, March 05, 2003 7:14 PM
Subject: Re: my experiences - Re: Parsing Word Docs

Ok. Thanks for the tip.

I downloaded and compiled Antiword, and would like to now add it to my

indexing

class. However, I'm not sure how the application would be called, and from
where it would be called.
How will I have the class parse the document through Antiword to create

the

keyword index, but leaving the DOC intact, as Mr. Litchfield did with

PDFBox?

Your assistance is greatly appreciated.

Eric Anderson
815-505-6132

Quoting David Spencer <[EMAIL PROTECTED]>:

FYI I tried the textmining.org/poi combo and on a collection of 350 word
docs people have developed here over the years, and it failed on 33% of
them
with exceptions being thrown about the formats being invalid.

I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free
*.exe, and
it worked great ( well it seemed to process all the files fine).

I've had similar experiences with PDF - I tried the 3 or so
freeware/java PDF
text extractors and they were not as good as the exe, pdftotext,
from foolabs (http://www.foolabs.com/xpdf/).

Not satisfying to a java developer but these work better than anything
else I can find.

You get source and I use them on windows & linux, no prob.

Eric Anderson wrote:

I'm interested in using the textmining/textextraction utilities using

Apache

POI, that Ryan was discussing. However, I'm having some difficulty

determining

what the insertion point would be to replace the default parser with

the

word

parser.

Any assistance would be appreciated.

LanRx Network Solutions, Inc.
Providing Enterprise Level Solutions...On A Small Business Budget

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

LanRx Network Solutions, Inc.
Providing Enterprise Level Solutions...On A Small Business Budget

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: my experiences - Re: Parsing Word Docs

Reply via email to