Andrew Libby wrote: > and the text needs to be retrieved for indexing. An extreeme example is > a PDF which has a considerably complicated document format.
The PJ library from www.etymon.com provides a pretty complete and easy-to-use API for getting info from PDF docs. It wouldn't be too hard to write a PDF indexer for Lucene using this library. The main challenge would be guessing word boundaries in strings where spaces have been replaced with explicit shift values by the formatter. Cheers, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 1016 La Posada Dr., Suite 240 Austin, TX 78752 Phone: 512.656.4139 -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>