Robert, If you supply your code I'll add it the contributions area. It would be great to have some code that already already converts the PDF directly to a Lucene Document.
--Peter On 2/16/02 8:36 PM, "Robert MacMillan" <[EMAIL PROTECTED]> wrote: > > I found that you can use the Etymon PJ classes > (http://www.etymon.com/pj/) and extract the text from PDF documents with > very little effort. The advantage with the Etymon classes is there is no > need for COM objects. It worked extremely well for the majority of > documents; at the very worst some documents would extract all the text with > some of it out of order.(That has to do more with the layout of the document > then anything else.) > > On that note, I have started working on a more-effective (and efficient) > set of classes to extract text from PDF docs. The plan was to contribute the > classes to this community and build on the functionality over time. The > process seems to be pretty straightforward and I hope to complete the first > version in the near future. > > In the intern, if anyone would like my Etymon "implementation" I'll be > happy to send off the code provided whoever requests it is aware it was > slapped together quickly for a concept-test and could/should be tightened up > a LOT. The set of classes I'm currently working on address a lot of the > limitations that are visible in the implementation. (It would probably > suffice to say it's an example of how to use the PJ classes to extract the > text from a PDF doc.) > > Cheers > > Robert MacMillan > > On 2/16/02 9:59 PM, "Ivaylo Zlatev" <[EMAIL PROTECTED]> wrote: > >> >> If you want to parse PDF documents, the best approach would be to use >> the Adobe IFilter for PDF, which is a COM component. You will need to >> write a java client, which interacts with that COM component. >> I believe it is easilly doable, but I have never done anything like >> this. >> It's a very interesting project, though. >> Also, you will have to perform the pdf-text conversion on a windows >> machine. >> >> http://www.adobe.com/support/downloads/detail.jsp?ftpID=1276 >> >> http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsr >> v/ixrefint_9sfm.asp >> >> >> Regards, >> Ivaylo Zlatev >> >> >> >> -----Original Message----- >> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] >> Sent: Saturday, February 16, 2002 7:15 AM >> To: Lucene Developers List >> Subject: RE: HTMLParser >> >> >> Hm, I thought this place would have a PDF parser, but it does not. >> It does seem to have a RTF parser: >> http://cobase-www.cs.ucla.edu/pub/javacc/ >> >> Perhaps some of these things can be adopted by Lucene, people could >> contribute Java classes for interacting with specific parsers, and all >> that could then be included in Lucene to work together with those >> DocumentHandlers mentioned a few days ago. >> >> Otis > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
