Known limitations here: http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00280.html
HTH. Regards, Kelvin PS: Pj library is GPL'ed. Commercial licenses go for $5,000 per 100 copies (1 CPU per copy). ----- Original Message ----- From: "Kelvin Tan" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Friday, February 15, 2002 9:09 AM Subject: Re: indexing and searching different file formats > Uhmmm, I can contribute something which does a pretty decent job if anyone's > interested... > > Just have to clean it up a little... > > Regards, > Kelvin > ----- Original Message ----- > From: "W. Eliot Kimber" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Friday, February 15, 2002 1:10 AM > Subject: Re: indexing and searching different file formats > > > > Andrew Libby wrote: > > > > > and the text needs to be retrieved for indexing. An extreeme example is > > > a PDF which has a considerably complicated document format. > > > > The PJ library from www.etymon.com provides a pretty complete and > > easy-to-use API for getting info from PDF docs. It wouldn't be too hard > > to write a PDF indexer for Lucene using this library. The main challenge > > would be guessing word boundaries in strings where spaces have been > > replaced with explicit shift values by the formatter. > > > > Cheers, > > > > Eliot > > -- > > W. Eliot Kimber, [EMAIL PROTECTED] > > Consultant, ISOGEN International > > > > 1016 La Posada Dr., Suite 240 > > Austin, TX 78752 Phone: 512.656.4139 > > > > -- > > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > > > > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > >
PdfTextExtractor.java
Description: Binary data
-- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>