[
https://issues.apache.org/jira/browse/PDFBOX-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16171152#comment-16171152
]
Tilman Hausherr commented on PDFBOX-3933:
-----------------------------------------
Please try this code with the file from PDFBOX-2079:
{code}
static public void testEmbeddedZipFile() throws IOException
{
//there should be 17660 bytes in the zip file.
//in PDFBox 1.8.5, windows newline is appended to the byte stream
//yielding 17662 bytes, which causes a problem for ZipFile in Java 1.6
boolean ok = false;
PDDocument doc = PDDocument.load(new File("embedded_zip.pdf"));
PDDocumentCatalog catalog = doc.getDocumentCatalog();
PDDocumentNameDictionary names = catalog.getNames();
PDEmbeddedFilesNameTreeNode node = names.getEmbeddedFiles();
int bytes = 0;
int numAttach = 0;
for (Map.Entry<String, COSObjectable> ent : node.getNames().entrySet())
{
PDComplexFileSpecification spec = (PDComplexFileSpecification)
ent.getValue();
PDEmbeddedFile file = spec.getEmbeddedFile();
InputStream input = file.createInputStream();
File testZip = new File("testzip.zip");
OutputStream os = new FileOutputStream(testZip);
IOUtils.copy(input, os);
os.flush();
os.close();
ZipFile zipFile = new ZipFile(testZip);
System.out.println(testZip.length());
System.out.println(zipFile.getName());
numAttach++;
}
doc.close();
}
}
{code}
This never made it to the official tests for some reason but I remember the
case because it was traumatizing at that time (LOL). It should print 17660 but
it prints 17661.
> PDFParser swallows a CR at the end of a stream
> ----------------------------------------------
>
> Key: PDFBOX-3933
> URL: https://issues.apache.org/jira/browse/PDFBOX-3933
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 1.8.14
> Reporter: Petr Slaby
> Attachments: EndlinePrediction.patch
>
>
> I have a PDF which I cannot share at the moment, maybe later if I get a
> permission from the customer.
> The PDF is protected by an empty password, all streams are encrypted using
> AES. The PDF consistently uses the LF character for line endings. One of the
> streams looks like this:
> {code}
> 10 0 obj
> <</Length 9 0 R/Filter/FlateDecode/N 3/Range[0 1 0 1 0 1 ]>>
> stream
> ....<0x0D><0x0A>
> endstream
> {code}
> i.e. Length field is a reference to an object, in the content, the length
> object is stored immediately after the stream as
> {code}
> 9 0 obj
> 2624
> endobj
> {code}
> The byte <0x0D> belongs to the stream and is not to be treated as line
> separator in this case. The parser is not able to read the length field so it
> manually searches for the stream end in the class EndstreamOutputStream. This
> class searches both for the pair <0x0D><0x0A> and the single <0x0A>, so it
> strips off the <0x0D> from this particular stream content. Since the stream
> is encrypted, PDFBox runs into a BadPaddingException later on when trying to
> decrypt the stream.
> The problem is reproducible using org.apache.pdfbox.PDFToImage in current
> 1.8.14-SNAPSHOT. The same works fine in current PDFBox 2.0.x, presumably
> because it uses the non-sequential parser by default.
> The proposed fix is to analyze the PDF content while reading it and search
> for the CR character only if it was ever encountered as a line separator
> prior to the stream being parsed.
> Note: I do not exactly know or understand the usage of the other classes
> inherited from BaseParser, like PDFObjectStreamParser. Maybe the line ending
> heuristic should be kept "as before" in these classes, by setting the new
> field BaseParser.hasCR to true already in the constructor.
> A patch is attached.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]