Hi, I have written a PDF library that can be used to strip text from PDF documents. It is released under LGPL so have fun.
There is one class which can be used to easily index PDF documents. pdfparser.searchengine.lucene.LucenePDFDocument has a getDocument method which will take a PDF file and return a Lucene Document which you can add to an index. If you would like to see the quality of the text extraction you can run pdfparser.Main from the command line which will take a PDF document and write a txt file. I am looking for any input that you might have. Please mail me if you have any bugs or feature requests. The library can be retrieved from http://www.csh.rit.edu/~ben/projects/pdfparser/ -Ben Litchfield -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>