Using: - PDFBox 1.5 - Java 1.6.0_20 - Linux RHEL 5 and Windows XP I am not intimately familiar with the PDF spec and am unsure if my current problem lies with PDFBox or Acrobat 9 Pro.
I have two PDFs, PDF1 and PDF2. PDFBox can successfully parse both PDFs. - PDF1 is created directly with Acrobat 9 Pro and uses PDF Version 1.6. - PDF2 was created with Acrobat Distiller 7.0 on windows and uses PDF Version 1.4 I get an IOException when I parse a copy of PDF1 that contains PDF2 as an attachment. The stack trace is: Caused by: java.io.IOException: expected='>' actual='C' at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:327) at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:996) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:521) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:859) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:826) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:797) As far as I can tell, when PDF2 is attached to PDF1 a dictionary entry (that appears to be related to PDF2) is mutated. ANSI characters are replaced with what appears to be a hodgepodge of UTF-16 characters. A stand alone PDF2 contains the following entry: 37 0 obj <</BitsPerComponent 8/ColorSpace 41 0 R/Filter/DCTDecode/Height 73/Length 8875/Subtype/Image/Type/XObject/Width 570>>stream ...a ton of binary data... endstream Endobj Once PDF2 is attached to PDF1, PDF1 contains the following entry: 41 0 obj<</Subtype/Image/Length 8875/Filter/DCTDecode/BitsPerComponent 8/ColorSpace 25 0 R/Width 57t–wTSÙ‡oBè1´`n Q"’|"ÒI@”€ô¦¡#¨tD° (-æ QiÒÄ™ Š ]ª”‚À“ªÄA憙7¾õÖzûsöÚû®½¿uÎÝ¿uhšyúzû„ì9DÖ´á7°?qÆÏ“ò㙺õnk@éC@ž€G, ·°aà*€AŠŠ"¥ÄÅÄ¥d%H Y9Y9Œ42iFA«¨€Auu¤ÉdÊ‘¿ ...additional lines of what appears to be binary data... >C ...many more lines of binary data... endstream Endobj I can still open the copy of PDF1 that contains an attached PDF2 in Acrobat and analyze it with a few other tools, such as Appligent APGetInfo. However, when I try to parse that file with PDFBox I receive the IOException. This is certainly not the case with every PDF that contains another PDF as attachment. However, I have identified two PDFs that seem to cause this problem when attached to any PDF I have tried. Can any shed anymore light on what might be causing this issue? Best regards, Carlos