Text extraction from PDF documents is a fairly complex problem and is a delicate balance between speed/memory/accuracy/...
How are you measuring your memory usage? In my opinion your two viable options are PDFBox(directly or via slides PDFExtractor) and PDFTextStream. They both integrate with lucene fairly easily. I highly suggest you do some tests against your own set of PDF documents. A new version of PDFBox was released this weekend and does have some improvements in terms of speed and memory. Ben Litchfield PDFBox http://www.pdfbox.org/ On Mon, 12 Sep 2005 [EMAIL PROTECTED] wrote: > Thanks for reply Jeroen ...does anyone have any > experience / comments regarding the use of PDFTextStream > versus PDFExtractor for working with PDF files ...the > issue for us is that there appears to be very high > memory usage when we work with PDF's using PDFExtractor. > > I have heard that PDFTextStream may be a better solution. > > Rod > > -----Original Message----- > From: Jeroen Reijn [mailto:[EMAIL PROTECTED] > Sent: Monday, September 12, 2005 11:58 AM > To: java-user@lucene.apache.org > Subject: Re: PDFBox PDFExtractor > > Hi Rod, > > PDFBox is a seperate project. The PDFExtractor in Jakarta Slide uses > PDFBox's > functionality to extract the information from the .pdf file. > > Hope this answers your question. > > Jeroen > > > [EMAIL PROTECTED] wrote: > > Hi, > > > > > > > > I am new to Lucene and looking at some existing Lucene code.... > > > > > > > > I am confused about the relationship ( if any ) between > > > > org.apache.slide.extractor.PDFExtractor methods and org.PDFBox.cos > > methods > > > > for the purposes of working with PDF files. > > > > > > > > I have found info on the web regarding PDFBox, however, I have found > > little > > > > regarding .PDFExtractor. > > > > > > > > I am curious since we are having some issues with indexing PDF files > and > > > > I am wondering if PDFExtractor implements PDFBox or if it is a > separate > > > > utility set. > > > > > > > > Rod. > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]