[jira] [Commented] (PDFBOX-3933) PDFParser swallows a CR at the end of a stream

Tilman Hausherr (JIRA) Wed, 20 Sep 2017 08:59:37 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16173397#comment-16173397
 ]


Tilman Hausherr commented on PDFBOX-3933:
-----------------------------------------

Tests for sequential parser went fine.

I decided not to do the change for 2.0.*. The code would be different, 
skipWhiteSpaces() is done in the base class thanks to refactoring so we would 
have to either set some global variable (and break the "do one thing" rule) or 
copy the code into parseCOSStream(), which would be a step backwards. It is 
possible to create a file that would trigger the effect (e.g. by replacing the 
"L" in "/Length 9 0 R" in your file with "X"), there is no such file in the 
wild.

However I'll add the test from Tim Allison to 2.0+trunk in a separate issue. 
(Adding it here would result in a misleading release note entry)

> PDFParser swallows a CR at the end of a stream
> ----------------------------------------------
>
>                 Key: PDFBOX-3933
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3933
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.8.13
>            Reporter: Petr Slaby
>         Attachments: Beispiel2.pdf, EndlinePrediction2.patch, 
> EndlinePrediction.patch
>
>
> I have a PDF which I cannot share at the moment, maybe later if I get a 
> permission from the customer. 
> The PDF is protected by an empty password, all streams are encrypted using 
> AES. The PDF consistently uses the LF character for line endings. One of the 
> streams looks like this:
> {code}
> 10 0 obj
> <</Length 9 0 R/Filter/FlateDecode/N 3/Range[0 1 0 1 0 1 ]>>
> stream
> ....<0x0D><0x0A>
> endstream
> {code}
> i.e. Length field is a reference to an object, in the content, the length 
> object is stored immediately after the stream as
> {code}
> 9 0 obj
> 2624
> endobj
> {code}
> The byte <0x0D> belongs to the stream and is not to be treated as line 
> separator in this case. The parser is not able to read the length field so it 
> manually searches for the stream end in the class EndstreamOutputStream. This 
> class searches both for the pair <0x0D><0x0A> and the single <0x0A>, so it 
> strips off the <0x0D> from this particular stream content. Since the stream 
> is encrypted, PDFBox runs into a BadPaddingException later on when trying to 
> decrypt the stream.
> The problem is reproducible using org.apache.pdfbox.PDFToImage in current 
> 1.8.14-SNAPSHOT. The same works fine in current PDFBox 2.0.x, presumably 
> because it uses the non-sequential parser by default.
> The proposed fix is to analyze the PDF content while reading it and search 
> for the CR character only if it was ever encountered as a line separator 
> prior to the stream being parsed.
> Note: I do not exactly know or understand the usage of the other classes 
> inherited from BaseParser, like PDFObjectStreamParser. Maybe the line ending 
> heuristic should be kept "as before" in these classes, by setting the new 
> field BaseParser.hasCR to true already in the constructor.
> A patch is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3933) PDFParser swallows a CR at the end of a stream

Reply via email to