[jira] [Commented] (PDFBOX-1299) BaseParser.readUntilEndOfStream can stop too early, causing IOException on valid PDFs

Michael McCandless (JIRA) Mon, 21 May 2012 08:52:42 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13280238#comment-13280238
 ]


Michael McCandless commented on PDFBOX-1299:
--------------------------------------------

Thanks Timo!
                
> BaseParser.readUntilEndOfStream can stop too early, causing IOException on 
> valid PDFs
> -------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1299
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1299
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Michael McCandless
>            Assignee: Timo Boehme
>             Fix For: 1.7.0
>
>         Attachments: PDFBOX-1299.patch, 
> Tracey_Prather_31-Dec-2010_211843_2011Portfolio.pdf
>
>
> The purpose of BaseParser.readUntilEndOfStream is to scan ahead,
> copying bytes to the output, stopping once it sees "endstream".
> The problem with this approach is sometimes the stream data itself
> contains endstream causing readUntilEndOfStream to stop too early.
> This can legitimately happen when the stream is an embedded PDF; I'll
> attach a test PDF showing this.
> However, the stream dict declares the stream length (in bytes)...  so
> it seems like we should be respecting that length (if present) and
> simply copy over that many bytes, instead of scanning the stream bytes
> for endstream?  This should be a lot faster too...
> I imagine we always scan so that we are more robust if the length is
> missing/invalid?  Is that why this method was used?  (I don't know the
> history here...).  If so, maybe we can have an option to use
> the declared stream length if present.
> I have a patch to use the declared stream length (if present), and it enables
> at least this test PDF to correctly parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1299) BaseParser.readUntilEndOfStream can stop too early, causing IOException on valid PDFs

Reply via email to