"Moturu,Praveen" wrote: > > Good Morning to you all. Can I assume none of the poeple on the lucene user > group had implemented indexing a pdf document using lucene. If some one > has.. Please help me by providing the solution.
You can try using Eytemon's PJ library (www.eytemon.com). But be aware that the code as provided does not support some features of PDF and has some bugs that prevent it from reading some PDFs. Note also that there are some inherent problems with full-text indexing of PDFs, namely that the word order in the PDF does not necessarily reflect its reading order (for example, in two-column layouts), so if your tokenizer is doing phrase analysis it may produce incorrect results. You can see this by doing a multi-word search in Acrobat Reader on a two-column document. It can also be difficult to accurately determine word boundaries because of the way that PDF can represent text strings as sequences of characters and placement instructions. The Adobe-provided C libraries have largely solved this problem but the PJ library does not--you will have to write your own algorithms to reduce text sequences with explicit kerning instructions into meaningful tokens. Not impossible but takes a little doing. If you have money to spend you could license the Adobe PDF libraries and create a Java binding for them. It does not appear that Adobe has any plans to provide a Java library for accessing PDFs, free or otherwise. However, implementing a Java PDF reader would not be too hard--I started trying to implement one just to see how hard it would be and got as a far as being able to get page objects by page number after an intense weekend's work [unfortunately my employment contract prevents me from creating open-source software without explicit approval and I didn't want to create a PDF library that wasn't open source, so I haven't done any more work on it yet]. The PDF spec (www.pdfzone.com) is pretty clear, although the PDF format is pretty convoluted (lots of byte offsets and such). But once you get the basic infrastructure in place for parsing out specific objects, the rest of it is just tedious parser implementation--there are scads of different field types once you get down to text streams. Adding the business logic to figure out where things are on the page would be more involved--you'd have to implement Adobe's layout logic. However, you need this functionality in order to correlate PDF annotations (links, bookmarks, notes) to the page objects they relate to--it's all done with bounding boxes. Cheers, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 1016 La Posada Dr., Suite 240 Austin, TX 78752 Phone: 512.656.4139 -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>