[jira] [Commented] (PDFBOX-3933) PDFParser swallows a CR at the end of a stream

Tilman Hausherr (JIRA) Tue, 19 Sep 2017 12:33:21 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172221#comment-16172221
 ]


Tilman Hausherr commented on PDFBOX-3933:
-----------------------------------------

I realize I have to retest for the sequential parser too... will do that. I 
also need to think whether to do that change in 2.*, despite that it doesn't 
fix any bug.

> PDFParser swallows a CR at the end of a stream
> ----------------------------------------------
>
>                 Key: PDFBOX-3933
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3933
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.13
>            Reporter: Petr Slaby
>         Attachments: Beispiel2.pdf, EndlinePrediction2.patch, 
> EndlinePrediction.patch
>
>
> I have a PDF which I cannot share at the moment, maybe later if I get a 
> permission from the customer. 
> The PDF is protected by an empty password, all streams are encrypted using 
> AES. The PDF consistently uses the LF character for line endings. One of the 
> streams looks like this:
> {code}
> 10 0 obj
> <</Length 9 0 R/Filter/FlateDecode/N 3/Range[0 1 0 1 0 1 ]>>
> stream
> ....<0x0D><0x0A>
> endstream
> {code}
> i.e. Length field is a reference to an object, in the content, the length 
> object is stored immediately after the stream as
> {code}
> 9 0 obj
> 2624
> endobj
> {code}
> The byte <0x0D> belongs to the stream and is not to be treated as line 
> separator in this case. The parser is not able to read the length field so it 
> manually searches for the stream end in the class EndstreamOutputStream. This 
> class searches both for the pair <0x0D><0x0A> and the single <0x0A>, so it 
> strips off the <0x0D> from this particular stream content. Since the stream 
> is encrypted, PDFBox runs into a BadPaddingException later on when trying to 
> decrypt the stream.
> The problem is reproducible using org.apache.pdfbox.PDFToImage in current 
> 1.8.14-SNAPSHOT. The same works fine in current PDFBox 2.0.x, presumably 
> because it uses the non-sequential parser by default.
> The proposed fix is to analyze the PDF content while reading it and search 
> for the CR character only if it was ever encountered as a line separator 
> prior to the stream being parsed.
> Note: I do not exactly know or understand the usage of the other classes 
> inherited from BaseParser, like PDFObjectStreamParser. Maybe the line ending 
> heuristic should be kept "as before" in these classes, by setting the new 
> field BaseParser.hasCR to true already in the constructor.
> A patch is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3933) PDFParser swallows a CR at the end of a stream

Reply via email to