IO exception parsing PDF with attachment

Carlos.Fernandez Thu, 21 Apr 2011 13:34:30 -0700

Using:
- PDFBox 1.5
- Java 1.6.0_20
- Linux RHEL 5 and Windows XP

I am not intimately familiar with the PDF spec and am unsure if my current 
problem lies with PDFBox or Acrobat 9 Pro.


I have two PDFs, PDF1 and PDF2.  PDFBox can successfully parse both PDFs.
- PDF1 is created directly with Acrobat 9 Pro and uses PDF Version 1.6.
- PDF2 was created with Acrobat Distiller 7.0 on windows and uses PDF Version 
1.4

I get an IOException when I parse a copy of PDF1 that contains PDF2 as an 
attachment.  The stack trace is:

Caused by: java.io.IOException: expected='>' actual='C'
at 
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:327)
at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:996)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:521)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:859)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:826)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:797)

As far as I can tell, when PDF2 is attached to PDF1 a dictionary entry (that 
appears to be related to PDF2) is mutated.  ANSI characters are replaced with 
what appears to be a hodgepodge of UTF-16 characters.

A stand alone PDF2 contains the following entry:

37 0 obj
<</BitsPerComponent 8/ColorSpace 41 0 R/Filter/DCTDecode/Height 73/Length 
8875/Subtype/Image/Type/XObject/Width 570>>stream
...a ton of binary data...
endstream
Endobj

Once PDF2 is attached to PDF1, PDF1 contains the following entry:

41 0 obj<</Subtype/Image/Length 8875/Filter/DCTDecode/BitsPerComponent 
8/ColorSpace 25 0 R/Width 57t–wTSÙ‡oBè1´`n Q"’|"ÒI@”€ô¦¡#¨tD° (-æ
QiÒÄ™ Š
]ª”‚À“ªÄAæ†™7¾õÖzûsöÚû®½¿uÎÝ¿uhšyúzû„ì9DÖ´á7°?qÆÏ“òã™ºõnk@éC@ž€G,
·°aà*€AŠŠ"¥ÄÅÄ¥d%H Y9Y9Œ42iFA«¨€Auu¤ÉdÊ‘¿
...additional lines of what appears to be binary data...
>C
...many more lines of binary data...
endstream
Endobj

I can still open the copy of PDF1 that contains an attached PDF2 in Acrobat and 
analyze it with a few other tools, such as Appligent APGetInfo.  However, when 
I try to parse that file with PDFBox I receive the IOException.

This is certainly not the case with every PDF that contains another PDF as 
attachment.  However, I have identified two PDFs that seem to cause this 
problem when attached to any PDF I have tried.

Can any shed anymore light on what might be causing this issue?

Best regards,

Carlos

IO exception parsing PDF with attachment

Reply via email to