[jira] [Commented] (PDFBOX-3933) PDFParser swallows a CR at the end of a stream

Tilman Hausherr (JIRA) Mon, 18 Sep 2017 22:58:18 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171152#comment-16171152
 ]


Tilman Hausherr commented on PDFBOX-3933:
-----------------------------------------

Please try this code with the file from PDFBOX-2079:
{code}
    static public void testEmbeddedZipFile() throws IOException
    {
        //there should be 17660 bytes in the zip file.
        //in PDFBox 1.8.5, windows newline is appended to the byte stream
        //yielding 17662 bytes, which causes a problem for ZipFile in Java 1.6
        boolean ok = false;
        PDDocument doc = PDDocument.load(new File("embedded_zip.pdf"));

        PDDocumentCatalog catalog = doc.getDocumentCatalog();
        PDDocumentNameDictionary names = catalog.getNames();
        PDEmbeddedFilesNameTreeNode node = names.getEmbeddedFiles();
        int bytes = 0;
        int numAttach = 0;
        for (Map.Entry<String, COSObjectable> ent : node.getNames().entrySet())
        {
            PDComplexFileSpecification spec = (PDComplexFileSpecification) 
ent.getValue();
            PDEmbeddedFile file = spec.getEmbeddedFile();
            InputStream input = file.createInputStream();
            File testZip = new File("testzip.zip");
            OutputStream os = new FileOutputStream(testZip);
            IOUtils.copy(input, os);
            os.flush();
            os.close();
            ZipFile zipFile = new ZipFile(testZip);
            System.out.println(testZip.length());
            System.out.println(zipFile.getName());
            numAttach++;
        }
        doc.close();
    }
}
{code}
This never made it to the official tests for some reason but I remember the 
case because it was traumatizing at that time (LOL). It should print 17660 but 
it prints 17661.

> PDFParser swallows a CR at the end of a stream
> ----------------------------------------------
>
>                 Key: PDFBOX-3933
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3933
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.8.14
>            Reporter: Petr Slaby
>         Attachments: EndlinePrediction.patch
>
>
> I have a PDF which I cannot share at the moment, maybe later if I get a 
> permission from the customer. 
> The PDF is protected by an empty password, all streams are encrypted using 
> AES. The PDF consistently uses the LF character for line endings. One of the 
> streams looks like this:
> {code}
> 10 0 obj
> <</Length 9 0 R/Filter/FlateDecode/N 3/Range[0 1 0 1 0 1 ]>>
> stream
> ....<0x0D><0x0A>
> endstream
> {code}
> i.e. Length field is a reference to an object, in the content, the length 
> object is stored immediately after the stream as
> {code}
> 9 0 obj
> 2624
> endobj
> {code}
> The byte <0x0D> belongs to the stream and is not to be treated as line 
> separator in this case. The parser is not able to read the length field so it 
> manually searches for the stream end in the class EndstreamOutputStream. This 
> class searches both for the pair <0x0D><0x0A> and the single <0x0A>, so it 
> strips off the <0x0D> from this particular stream content. Since the stream 
> is encrypted, PDFBox runs into a BadPaddingException later on when trying to 
> decrypt the stream.
> The problem is reproducible using org.apache.pdfbox.PDFToImage in current 
> 1.8.14-SNAPSHOT. The same works fine in current PDFBox 2.0.x, presumably 
> because it uses the non-sequential parser by default.
> The proposed fix is to analyze the PDF content while reading it and search 
> for the CR character only if it was ever encountered as a line separator 
> prior to the stream being parsed.
> Note: I do not exactly know or understand the usage of the other classes 
> inherited from BaseParser, like PDFObjectStreamParser. Maybe the line ending 
> heuristic should be kept "as before" in these classes, by setting the new 
> field BaseParser.hasCR to true already in the constructor.
> A patch is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3933) PDFParser swallows a CR at the end of a stream

Reply via email to