[
https://issues.apache.org/jira/browse/PDFBOX-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280238#comment-13280238
]
Michael McCandless commented on PDFBOX-1299:
--------------------------------------------
Thanks Timo!
> BaseParser.readUntilEndOfStream can stop too early, causing IOException on
> valid PDFs
> -------------------------------------------------------------------------------------
>
> Key: PDFBOX-1299
> URL: https://issues.apache.org/jira/browse/PDFBOX-1299
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 1.6.0
> Reporter: Michael McCandless
> Assignee: Timo Boehme
> Fix For: 1.7.0
>
> Attachments: PDFBOX-1299.patch,
> Tracey_Prather_31-Dec-2010_211843_2011Portfolio.pdf
>
>
> The purpose of BaseParser.readUntilEndOfStream is to scan ahead,
> copying bytes to the output, stopping once it sees "endstream".
> The problem with this approach is sometimes the stream data itself
> contains endstream causing readUntilEndOfStream to stop too early.
> This can legitimately happen when the stream is an embedded PDF; I'll
> attach a test PDF showing this.
> However, the stream dict declares the stream length (in bytes)... so
> it seems like we should be respecting that length (if present) and
> simply copy over that many bytes, instead of scanning the stream bytes
> for endstream? This should be a lot faster too...
> I imagine we always scan so that we are more robust if the length is
> missing/invalid? Is that why this method was used? (I don't know the
> history here...). If so, maybe we can have an option to use
> the declared stream length if present.
> I have a patch to use the declared stream length (if present), and it enables
> at least this test PDF to correctly parse.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira