Ryan Ackley wrote:
As the author of both Word POI and textmining.org, I recommend using
textmining.org. POI is for general purpose manipulation of Word
documents. textmining's only purpose is extracting text.

I wish the two would collaborate though. It's true that POI contains code for writing which isn't necessary for indexing. But it's also true that POI contains code for extracting images, which for many projects *is* necessary.

Also, people recommend using POI for text extraction but the only
place I've seen an actual how-to on this is in the "Lucene in Action"
book.

It's not too difficult though:

  doc.getTextTable().getTextPieces();

Downside of that approach is that some of the text you get back isn't "text" in the sense that you might expect. (I consider it an upside myself, because sometimes it's good to find all this otherwise hidden text.)

Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://nuix.com/                               Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to