Alan Burlison created PDFBOX-2493:
-------------------------------------

             Summary: OOM with corrupt PDF file
                 Key: PDFBOX-2493
                 URL: https://issues.apache.org/jira/browse/PDFBOX-2493
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.8.6
         Environment: Linux, JVM 1.8.0_25 (64-bit)
            Reporter: Alan Burlison
            Priority: Blocker


I have a large archive of PDF files, some of which are unfortunately corrupt. 
I'm scanning them using a webapp and Tika, which in turn uses PDFBox. I have 
one file which results in errors in Tika 1.4 & 1.5 but with Tika 1.6 (which 
uses PDFBox 1.8.6) as well as causing errors it also causes PDFBox to consume 
~4GB of heap before descending into a GC death-spiral. Unfortunately I can't 
provide the PDF file that causes this as the contents are confidential. As 
Tika/PDFBox are being used from inside a webapp I can cope with errors being 
thrown but the OOM caused by 1.8.6 is a blocker and I've had to revert to Tika 
1.5, which in turn uses PDFBox 1.8.4.

ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
INFO - unsupported/disabled operation: >
WARN - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:44)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:143)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:422)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:113)
INFO - unsupported/disabled operation: B110EBE04050412
WARN - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:44)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:143)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:422)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:113)
INFO - unsupported/disabled operation: 
B0F0F07100D05050603140D10093E0903DB06050E3C0405D
INFO - unsupported/disabled operation: E
INFO - unsupported/disabled operation: C
INFO - unsupported/disabled operation: 
B051A0E0C0E130B060B0C0D050640750D020E0D050DE506400C13010B050271
INFO - unsupported/disabled operation: A
INFO - unsupported/disabled operation: D
INFO - unsupported/disabled operation: B100
WARN - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:44)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:143)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:422)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:113)
INFO - unsupported/disabled operation: B5
WARN - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at org.apache.pdfbox.util.operator.MoveText.process(MoveText.java:41)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:143)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:422)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:113)
INFO - unsupported/disabled operation: 
B020903110B06050E0F051E0C67A31C05340D000E0D03070C05074
INFO - unsupported/disabled operation: B11160B1005
INFO - unsupported/disabled operation: FB
INFO - unsupported/disabled operation: B230C0B12
WARN - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:44)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:143)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:422)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:113)
INFO - unsupported/disabled operation: DE
INFO - unsupported/disabled operation: B10050E1F0506AE080B230C0B1419C50E3C0
INFO - unsupported/disabled operation: B05650A09010D0B3F1103B
WARN - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:44)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:143)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:422)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:113)
INFO - unsupported/disabled operation: B0C0D2419
INFO - unsupported/disabled operation: B040503020B0
WARN - java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:44)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:143)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:422)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:113)
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - error: array index out of bounds
java.lang.ArrayIndexOutOfBoundsException: 3337
        at 
org.apache.fontbox.ttf.GlyfSimpleDescript.readFlags(GlyfSimpleDescript.java:199)
        at 
org.apache.fontbox.ttf.GlyfSimpleDescript.<init>(GlyfSimpleDescript.java:78)
        at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:57)
        at org.apache.fontbox.ttf.GlyphTable.initData(GlyphTable.java:69)
        at 
org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
        at 
org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128)
        at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80)
        at 
org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109)
        at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25)
        at 
org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84)
        at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25)
        at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632)
        at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673)
        at 
org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:233)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:411)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:143)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:422)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:113)
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit 
exceeded
        at 
org.apache.fontbox.ttf.GlyfCompositeDescript.<init>(GlyfCompositeDescript.java:58)
        at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:62)
        at org.apache.fontbox.ttf.GlyphTable.initData(GlyphTable.java:69)
        at 
org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
        at 
org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:128)
        at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:80)
        at 
org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:109)
        at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25)
        at 
org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:84)
        at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:25)
        at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getTTFFont(PDTrueTypeFont.java:632)
        at 
org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:673)
        at 
org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:233)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:411)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:143)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:422)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:113)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to