I think most of the PDF creation knowledge using Java resides in the iText and FOP projects.
both open source. I would seem that java-pdf-writing code would be a good place to start on java-pdf-reading code. just a thought. ----- Original Message ----- From: Kelvin Tan <[EMAIL PROTECTED]> To: Lucene Users List <[EMAIL PROTECTED]> Sent: Saturday, May 04, 2002 1:28 AM Subject: Re: indexing PDF files > You might want to take a look at WebSearch http://www.i2a.com/websearch/. It > has an _ok_ system going with respect to PDFs. PDFGo supports viewing of PDF > but a guy I contacted there says there's no current support for text > extraction but that he's "planning to do it". > > Definitely agreed on the PJ resources bit. Doesn't really scale well in > terms of PDF file size. > > If you haven't already seen the post, I once did a cursory examination of > the options for extracting text from PDF files via Java and the limitations > of the approaches. > http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg00280.html > > The Etymon lib is GPL'ed, so I guess that's a nice place to start. As far as > the libs I've seen so far, most of them are really concerned with the > display and manipulation of PDF pages. Since we're looking for something > less complex (i.e text extraction), maybe it's not so bad. I've spent abit > of time in this area before so feel free to email me offline about this. Not > sure how much help I can be though. > > ----- Original Message ----- > From: "petite_abeille" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Friday, May 03, 2002 10:57 PM > Subject: Re: indexing PDF files > > > > On Friday, May 3, 2002, at 03:16 PM, Moturu,Praveen wrote: > > > > > Can I assume none of the poeple on the lucene user group had > > > implemented indexing a pdf document using lucene. > > > > Who knows...?!? In any case, it's not public knowledge... > > > > > If some one has.. Please help me by providing the solution. > > > > I use to believe in Santa Claus also... ;-) > > > > All that said, there seems to be a real demand to do something about pdf > > to text conversion (in java preferably). I'm willing to invest some time > > and brain cell to nail it down, but I'm note sure where to start... > > > > I'm aware of the PJ library, but it's really a pig as far as resources > > goes. Anything else? > > > > Any (concrete) pointer appreciated. > > > > Thanks. > > > > PA. > > > > > > -- > > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>