Known limitations here:
http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00280.html

HTH.

Regards,
Kelvin

PS: Pj library is GPL'ed. Commercial licenses go for $5,000 per 100 copies
(1 CPU per copy).

----- Original Message -----
From: "Kelvin Tan" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Friday, February 15, 2002 9:09 AM
Subject: Re: indexing and searching different file formats


> Uhmmm, I can contribute something which does a pretty decent job if
anyone's
> interested...
>
> Just have to clean it up a little...
>
> Regards,
> Kelvin
> ----- Original Message -----
> From: "W. Eliot Kimber" <[EMAIL PROTECTED]>
> To: "Lucene Users List" <[EMAIL PROTECTED]>
> Sent: Friday, February 15, 2002 1:10 AM
> Subject: Re: indexing and searching different file formats
>
>
> > Andrew Libby wrote:
> >
> > > and the text needs to be retrieved for indexing.  An extreeme example
is
> > > a PDF which has a considerably complicated document format.
> >
> > The PJ library from www.etymon.com provides a pretty complete and
> > easy-to-use API for getting info from PDF docs. It wouldn't be too hard
> > to write a PDF indexer for Lucene using this library. The main challenge
> > would be guessing word boundaries in strings where spaces have been
> > replaced with explicit shift values by the formatter.
> >
> > Cheers,
> >
> > Eliot
> > --
> > W. Eliot Kimber, [EMAIL PROTECTED]
> > Consultant, ISOGEN International
> >
> > 1016 La Posada Dr., Suite 240
> > Austin, TX  78752 Phone: 512.656.4139
> >
> > --
> > To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>
>

Attachment: PdfTextExtractor.java
Description: Binary data

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to