[ https://issues.apache.org/jira/browse/PDFBOX-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr updated PDFBOX-3933: ------------------------------------ Component/s: Parsing > PDFParser swallows a CR at the end of a stream > ---------------------------------------------- > > Key: PDFBOX-3933 > URL: https://issues.apache.org/jira/browse/PDFBOX-3933 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 1.8.13 > Reporter: Petr Slaby > Attachments: Beispiel2.pdf, EndlinePrediction2.patch, > EndlinePrediction.patch > > > I have a PDF which I cannot share at the moment, maybe later if I get a > permission from the customer. > The PDF is protected by an empty password, all streams are encrypted using > AES. The PDF consistently uses the LF character for line endings. One of the > streams looks like this: > {code} > 10 0 obj > <</Length 9 0 R/Filter/FlateDecode/N 3/Range[0 1 0 1 0 1 ]>> > stream > ....<0x0D><0x0A> > endstream > {code} > i.e. Length field is a reference to an object, in the content, the length > object is stored immediately after the stream as > {code} > 9 0 obj > 2624 > endobj > {code} > The byte <0x0D> belongs to the stream and is not to be treated as line > separator in this case. The parser is not able to read the length field so it > manually searches for the stream end in the class EndstreamOutputStream. This > class searches both for the pair <0x0D><0x0A> and the single <0x0A>, so it > strips off the <0x0D> from this particular stream content. Since the stream > is encrypted, PDFBox runs into a BadPaddingException later on when trying to > decrypt the stream. > The problem is reproducible using org.apache.pdfbox.PDFToImage in current > 1.8.14-SNAPSHOT. The same works fine in current PDFBox 2.0.x, presumably > because it uses the non-sequential parser by default. > The proposed fix is to analyze the PDF content while reading it and search > for the CR character only if it was ever encountered as a line separator > prior to the stream being parsed. > Note: I do not exactly know or understand the usage of the other classes > inherited from BaseParser, like PDFObjectStreamParser. Maybe the line ending > heuristic should be kept "as before" in these classes, by setting the new > field BaseParser.hasCR to true already in the constructor. > A patch is attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org