PDFBox can't parse PDF documents from jstor.org
-----------------------------------------------
Key: PDFBOX-506
URL: https://issues.apache.org/jira/browse/PDFBOX-506
Project: PDFBox
Issue Type: Bug
Reporter: Dave Engberg
Attachments: siegel.pdf
The academic repository JStor makes papers available via PDF format. The PDFs
give this origin information:
Content creator: JstorPdfGenerator v1.0
PDF producer: iText 2.0.6 (by lowagie.com)
These PDFs open fine in Acrobat, Preview, FoxIt, etc., but they throw an
exception in PDFBox:
Exception in thread "main" java.io.IOException: Error: Expected to read '%%EOF'
instead started reading '1'
at
org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1005)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:456)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:706)
at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:393)
at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:369)
at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:355)
I traced through the code, and it appears that PDFBox rejects these because
they contain a 'startxref' that is not followed by a %%EOF two lines later:
...
startxref
613364
1 0 obj
...
Here's a small patch that will accept files that are missing the EOF after the
startxref:
Index: src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java
===================================================================
--- src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java (revision
802578)
+++ src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java (working copy)
@@ -453,11 +453,9 @@
{
parseStartXref();
//verify that EOF exists
- String eof = readExpectedString( "%%EOF" );
- if( eof.indexOf( "%%EOF" )== -1 && !pdfSource.isEOF() )
- {
- throw new IOException( "expected='%%EOF' actual='" + eof +
"' next=" + readString() +
- " next=" +readString() );
+ int c = pdfSource.peek();
+ if (c == '%') {
+ readExpectedString("%%EOF");
}
isEndOfFile = true;
}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.