Hi Dough, although that is not an immediate answer to your question could you try the PDFBox command line tool ExtractText with your PDF and see if this gives a similar result. Please try it also with the -nonSeq option. The best would be to try using pdfbox 1.8.0 in addition to 1.7.0 to see if the issue is already fixed.
BR Maruan Sahyoun Am 09.04.2013 um 14:52 schrieb Doug Sackin <dsac...@gmail.com>: > Has anyone else encountered recent problems with FlateFilter and > OutOfMemory errors? Is there anyway to trap it before it results in > OutOfMemory exception? > > Thanks > > Doug > > > On Mon, Apr 1, 2013 at 2:13 PM, Doug Sackin <dsac...@gmail.com> wrote: > >> I appear to have something similar to the bug identified and fixed in >> PDFBOX-453 - FlateFilter.decode() throwing OutOfMemoryError. >> >> I'm doing text extraction through Twister Data Framework using Tika 1.2 >> which calls PDFBox. I have PDFBox 1.7. My OS is Scientific Linux 5.8. Java >> is JDK 1.6.0_37. >> >> The offending exception is below: >> >> Caused by: java.lang.OutOfMemoryError >> at java.util.zip.Inflater.inflateBytes(Native Method) >> at java.util.zip.Inflater.inflate(Inflater.java:238) >> at java.util.zip.Inflater.inflate(Inflater.java:256) >> at >> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169) >> at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98) >> at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279) >> at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221) >> at >> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156) >> at >> org.apache.pdfbox.pdmodel.common.COSStreamArray.getUnfilteredStream(COSStreamArray.java:196) >> at >> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:108) >> at >> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:253) >> at >> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237) >> at >> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:217) >> at >> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448) >> at >> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372) >> at >> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328) >> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66) >> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153) >> at >> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) >> at >> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) >> at >> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91) >> at >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) >> at >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) >> at >> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) >> >> Before that, I have a long string of exceptions from PDFBox attempts on >> PDF files, interspersed by "FlateFilter: stop reading corrupt stream due to >> a DataFormatException". These are in the attached log file. >> >> The other exceptions are IndexOutOfBounds, ClassCastException, >> NegativeArraySizeException, NullPointerException, IOException (regarding >> font(COSName}F2}) in map{}), IllegalArgumentException. These may or may not >> be related (the exceptions are appearing on different files), but I wonder >> if they served to corrupt the stream sufficiently that PDFBox got attempted >> to inflate corrupt data. >> >> If it is the same issue, it was reported to be fixed in 0.8. If it is a >> new issue, is it possible to fix it? I cannot provide any of the source PDF >> files (client data), but I am attaching the log output containing all of >> the exception traces including the final OutOfMemoryError. >> >> Thanks for any insights. >> >> Doug >> >> >> >>