[ https://issues.apache.org/jira/browse/PDFBOX-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated PDFBOX-1299: --------------------------------------- Attachment: Tracey_Prather_31-Dec-2010_211843_2011Portfolio.pdf Sorry, wrong attachment: this one is the right one. > BaseParser.readUntilEndOfStream can stop too early, causing IOException on > valid PDFs > ------------------------------------------------------------------------------------- > > Key: PDFBOX-1299 > URL: https://issues.apache.org/jira/browse/PDFBOX-1299 > Project: PDFBox > Issue Type: Bug > Affects Versions: 1.6.0 > Reporter: Michael McCandless > Attachments: PDFBOX-1299.patch, > Tracey_Prather_31-Dec-2010_211843_2011Portfolio.pdf > > > The purpose of BaseParser.readUntilEndOfStream is to scan ahead, > copying bytes to the output, stopping once it sees "endstream". > The problem with this approach is sometimes the stream data itself > contains endstream causing readUntilEndOfStream to stop too early. > This can legitimately happen when the stream is an embedded PDF; I'll > attach a test PDF showing this. > However, the stream dict declares the stream length (in bytes)... so > it seems like we should be respecting that length (if present) and > simply copy over that many bytes, instead of scanning the stream bytes > for endstream? This should be a lot faster too... > I imagine we always scan so that we are more robust if the length is > missing/invalid? Is that why this method was used? (I don't know the > history here...). If so, maybe we can have an option to use > the declared stream length if present. > I have a patch to use the declared stream length (if present), and it enables > at least this test PDF to correctly parse. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira