Re: OutOfMemoryError from FlatFilter (could be PDFBOX-453 again)

Maruan Sahyoun Tue, 09 Apr 2013 06:13:04 -0700

Hi Dough,

although that is not an immediate answer to your question could you try the 
PDFBox command line tool ExtractText with your PDF and see if this gives a 
similar result. Please try it also with the -nonSeq option. The best would be 
to try using pdfbox 1.8.0 in addition to 1.7.0 to see if the issue is already 
fixed.


BR
Maruan Sahyoun

Am 09.04.2013 um 14:52 schrieb Doug Sackin <dsac...@gmail.com>:

> Has anyone else encountered recent problems with FlateFilter and
> OutOfMemory errors? Is there anyway to trap it before it results in
> OutOfMemory exception?
> 
> Thanks
> 
> Doug
> 
> 
> On Mon, Apr 1, 2013 at 2:13 PM, Doug Sackin <dsac...@gmail.com> wrote:
> 
>> I appear to have something similar to the bug identified and fixed in
>> PDFBOX-453 - FlateFilter.decode() throwing OutOfMemoryError.
>> 
>> I'm doing text extraction through Twister Data Framework using Tika 1.2
>> which calls PDFBox. I have PDFBox 1.7. My OS is Scientific Linux 5.8. Java
>> is JDK 1.6.0_37.
>> 
>> The offending exception is below:
>> 
>> Caused by: java.lang.OutOfMemoryError
>>    at java.util.zip.Inflater.inflateBytes(Native Method)
>>    at java.util.zip.Inflater.inflate(Inflater.java:238)
>>    at java.util.zip.Inflater.inflate(Inflater.java:256)
>>    at
>> org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
>>    at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
>>    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:279)
>>    at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
>>    at
>> org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
>>    at
>> org.apache.pdfbox.pdmodel.common.COSStreamArray.getUnfilteredStream(COSStreamArray.java:196)
>>    at
>> org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:108)
>>    at
>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:253)
>>    at
>> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:237)
>>    at
>> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:217)
>>    at
>> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:448)
>>    at
>> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:372)
>>    at
>> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:328)
>>    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
>>    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)
>>    at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>>    at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>>    at
>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>>    at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>    at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>    at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>> 
>> Before that, I have a long string of exceptions from PDFBox attempts on
>> PDF files, interspersed by "FlateFilter: stop reading corrupt stream due to
>> a DataFormatException". These are in the attached log file.
>> 
>> The other exceptions are IndexOutOfBounds, ClassCastException,
>> NegativeArraySizeException, NullPointerException, IOException (regarding
>> font(COSName}F2}) in map{}), IllegalArgumentException. These may or may not
>> be related (the exceptions are appearing on different files), but I wonder
>> if they served to corrupt the stream sufficiently that PDFBox got attempted
>> to inflate corrupt data.
>> 
>> If it is the same issue, it was reported to be fixed in 0.8. If it is a
>> new issue, is it possible to fix it? I cannot provide any of the source PDF
>> files (client data), but I am attaching the log output containing all of
>> the exception traces including the final OutOfMemoryError.
>> 
>> Thanks for any insights.
>> 
>> Doug
>> 
>> 
>> 
>>

Re: OutOfMemoryError from FlatFilter (could be PDFBOX-453 again)

Reply via email to