Michael McCandless created PDFBOX-1299: ------------------------------------------
Summary: BaseParser.readUntilEndOfStream can stop too early, causing IOException on valid PDFs Key: PDFBOX-1299 URL: https://issues.apache.org/jira/browse/PDFBOX-1299 Project: PDFBox Issue Type: Bug Affects Versions: 1.6.0 Reporter: Michael McCandless Attachments: TX0819_2009-07-27_Windstream-TCG_Agreement.pdf The purpose of BaseParser.readUntilEndOfStream is to scan ahead, copying bytes to the output, stopping once it sees "endstream". The problem with this approach is sometimes the stream data itself contains endstream causing readUntilEndOfStream to stop too early. This can legitimately happen when the stream is an embedded PDF; I'll attach a test PDF showing this. However, the stream dict declares the stream length (in bytes)... so it seems like we should be respecting that length (if present) and simply copy over that many bytes, instead of scanning the stream bytes for endstream? This should be a lot faster too... I imagine we always scan so that we are more robust if the length is missing/invalid? Is that why this method was used? (I don't know the history here...). If so, maybe we can have an option to use the declared stream length if present. I have a patch to use the declared stream length (if present), and it enables at least this test PDF to correctly parse. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira