[jira] Created: (PDFBOX-506) PDFBox can't parse PDF documents from jstor.org

Dave Engberg (JIRA) Wed, 19 Aug 2009 12:30:47 -0700

PDFBox can't parse PDF documents from jstor.org
-----------------------------------------------


                 Key: PDFBOX-506
                 URL: https://issues.apache.org/jira/browse/PDFBOX-506
             Project: PDFBox
          Issue Type: Bug
            Reporter: Dave Engberg
         Attachments: siegel.pdf

The academic repository JStor makes papers available via PDF format.  The PDFs 
give this origin information:
  Content creator:  JstorPdfGenerator v1.0
  PDF producer:  iText 2.0.6 (by lowagie.com)

These PDFs open fine in Acrobat, Preview, FoxIt, etc., but they throw an 
exception in PDFBox:

Exception in thread "main" java.io.IOException: Error: Expected to read '%%EOF' 
instead started reading '1'
        at 
org.apache.pdfbox.pdfparser.BaseParser.readExpectedString(BaseParser.java:1005)
        at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:456)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:739)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:706)
        at org.apache.pdfbox.PDFDebugger.parseDocument(PDFDebugger.java:393)
        at org.apache.pdfbox.PDFDebugger.readPDFFile(PDFDebugger.java:369)
        at org.apache.pdfbox.PDFDebugger.main(PDFDebugger.java:355)


I traced through the code, and it appears that PDFBox rejects these because 
they contain a 'startxref' that is not followed by a %%EOF two lines later:

...
startxref
613364
1 0 obj
...


Here's a small patch that will accept files that are missing the EOF after the 
startxref:


Index: src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java
===================================================================
--- src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java    (revision 
802578)
+++ src/main/java/org/apache/pdfbox/pdfparser/PDFParser.java    (working copy)
@@ -453,11 +453,9 @@
             {  
                 parseStartXref();
                 //verify that EOF exists 
-                String eof = readExpectedString( "%%EOF" );
-                if( eof.indexOf( "%%EOF" )== -1 && !pdfSource.isEOF() )
-                {
-                    throw new IOException( "expected='%%EOF' actual='" + eof + 
"' next=" + readString() +
-                            " next=" +readString() );
+                int c = pdfSource.peek();
+                if (c == '%') {
+                    readExpectedString("%%EOF");
                 }
                 isEndOfFile = true; 
             }


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PDFBOX-506) PDFBox can't parse PDF documents from jstor.org

Reply via email to