StreamCorruptedException on bad PDF with -force
-----------------------------------------------

                 Key: PDFBOX-1151
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1151
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.6.0
         Environment: Windows Vista
Sun JDK 1.6.0_26
            Reporter: Stas Shaposhnikov


I am getting the StreamCorruptedException when trying to parse a possibly 
invalid PDF document even if the -force option is specified.

Stack trace:

java.io.StreamCorruptedException: Error: data is null
        at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)
        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:301)
        at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:221)
        at 
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:156)
        at 
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:105)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:264)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
        at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
        at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
        at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:256)
        at org.apache.pdfbox.ExtractText.main(ExtractText.java:76)
        at org.apache.pdfbox.PDFBox.main(PDFBox.java:42)

My suggestion is skip bad sub-streams without throwing exceptions in 
PDFStreamEngine.processSubStream() in case of forceParsing is true.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to