[
https://issues.apache.org/jira/browse/PDFBOX-813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915503#action_12915503
]
Adam Nichols commented on PDFBOX-813:
-------------------------------------
Well, I can tell you the reason it can't be parsed is because it's not a valid
PDF. If you open it and look at the bottom, you'll find that the trailer looks
like this:
trailer
<<
/Size 41
/Root 2
There's not even a newline nor carriage return after that last "2". Since this
does not conform to Adobe's PDF specification, the way this should be handled
is undefined, so throwing an exception is not unreasonable.
However, what is interesting is that if you replace PDDocument.load(inputpath,
true); with PDDocument.load(inputpath); or PDDocument.load(inputpath, false);
the exception is not thrown! I find this most interesting because force is
only passed into the parser object it's only used once in that class and it
seems to be used to prevent an exception from being thrown.
I looked into this a little further and found that if forceParsing is false,
the exception your PDF throws is an IOException and it's caught and basically
ignored by code which handles invalid PDFs which have random data after the EOF
marker. If you are blindly loading a document (aka forcing the loading), and
that document is corrupt, you can't expect that there was enough information
read to properly.
My suggestion would be to load documents without the force option and
understand that there are some non-conforming PDFs which may not be able to be
parsed and have your code handle that accordingly. This message will hit the
developers mailing list and we will discuss the possibility of deprecating the
force option on the load() method. While it may have been accurate when it was
first introduced, I feel that it's misleading now that we handle so many
different things which are out-of-spec.
> ClassCastException: COSInteger cannot be cast to COSDictionary
> --------------------------------------------------------------
>
> Key: PDFBOX-813
> URL: https://issues.apache.org/jira/browse/PDFBOX-813
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 1.2.1, 1.3.0
> Environment: Windows XP
> java version "1.6.0_12"
> Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
> Java HotSpot(TM) Client VM (build 11.2-b01, mixed mode, sharing)
> Reporter: CP
> Priority: Critical
> Attachments: CancerSummReport_34914.pdf, PDFBoxBug.java
>
>
> I get the below exceptions when calling
> pdfDoc.getDocumentCatalog().getAllPages(). The code continues after the first
> exception because I've called
> PDDocument.load("C:/CancerSummReport_34914.pdf", true) setting the load
> "force" param to true. The second exception causes the code to abort.
> (I will try uploading the PDF that causes this problem)
> 2010-09-02 16:47:47,521 [main] WARN (PDFParser.java:189) - Parsing Error,
> Skipping Object
> java.io.IOException: Error: Expected an integer type, actual='bj'
> at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1310)
> at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:497)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:179)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:768)
> at com.xyz.framework.functionalTests.PDFBoxBug.main(PDFBoxBug.java:16)
> 2010-09-02 16:47:47,552 [main] WARN (BaseParser.java:215) - Invalid
> dictionary, found:? but expected:''
> Exception in thread "main" java.lang.ClassCastException:
> org.apache.pdfbox.cos.COSInteger cannot be cast to
> org.apache.pdfbox.cos.COSDictionary
> at
> org.apache.pdfbox.pdmodel.PDDocument.getDocumentCatalog(PDDocument.java:414)
> at com.xyz.framework.functionalTests.PDFBoxBug.main(PDFBoxBug.java:18)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.