There are a number of libraries for Java that provide PDF text extraction functionality. A pretty comprehensive list is available at < http://www.geocities.com/marcoschmidt.geo/java-libraries-pdf.html >. I'm obviously biased towards recommending our solution, PDFTextStream < http://snowtide.com/home/PDFTextStream/ >; it's the fastest thing out there for Java, and it provides a very easy-to-use Lucene integration module that will have you up and running in no time < http://snowtide.com/home/PDFTextStream/techtips/easy_lucene_integration >.

For office documents, just about the only game in town that I know of is the Jakarta POI project < http://jakarta.apache.org/poi/ >. It's been quite a while since I've touched it, but it's definitely the best place to start.

Chas Emerick   |   [EMAIL PROTECTED]

PDFTextStream: fast PDF text extraction for Java apps and Lucene
http://snowtide.com/home/PDFTextStream/

On Sep 9, 2004, at 9:47 AM, <[EMAIL PROTECTED]> wrote:

Anyone know of any reliable parsers out there for pdf word
excel or powerpoint?


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to