Andrew Libby wrote:

> and the text needs to be retrieved for indexing.  An extreeme example is
> a PDF which has a considerably complicated document format.

The PJ library from www.etymon.com provides a pretty complete and
easy-to-use API for getting info from PDF docs. It wouldn't be too hard
to write a PDF indexer for Lucene using this library. The main challenge
would be guessing word boundaries in strings where spaces have been
replaced with explicit shift values by the formatter.

Cheers,

Eliot
-- 
W. Eliot Kimber, [EMAIL PROTECTED]
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to