[jira] Created: (PDFBOX-504) Can't Parse any PDF using IBM JDK

Chris Bowditch (JIRA) Thu, 13 Aug 2009 03:39:40 -0700

Can't Parse any PDF using IBM JDK
---------------------------------

                 Key: PDFBOX-504
                 URL: https://issues.apache.org/jira/browse/PDFBOX-504
             Project: PDFBox
          Issue Type: Bug
          Components: Parsing
    Affects Versions: 0.8.0-incubator
         Environment: RedHat Linux IBM JDK
            Reporter: Chris Bowditch
            Priority: Critical



All PDF (that I have tried) fail to parse using IBM JDK 1.5 on RedHat Linux. 
The error you receive is:

Exception in thread "main" java.io.IOException: Error: Expected an integer 
type, actual='Ã£ÃÃ'
        at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1220)
        at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:493)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:736)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:704)
        at org.apache.pdfbox.PDFReader.parseDocument(PDFReader.java:323)
        at org.apache.pdfbox.PDFReader.openPDFFile(PDFReader.java:286)
        at org.apache.pdfbox.PDFReader.main(PDFReader.java:271)

Although after debugging the actual error is hidden:

java.io.IOException: Error: Expected an integer type, actual='ãÏÓ'
        at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1220)
        at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:483)
        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:736)
        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:704)
        at org.apache.pdfbox.PDFReader.parseDocument(PDFReader.java:323)
        at org.apache.pdfbox.PDFReader.openPDFFile(PDFReader.java:286)
        at org.apache.pdfbox.PDFReader.main(PDFReader.java:271)

The characters shown in the hidden message occur at the start of most PDF Files 
that I have checked:

%PDF-1.4
%âãÏÓ
6 0 obj
<</Filter /FlateDecode
/Length 489
>>
stream

Tracing the code I can see the problem is down to the skipToNextObject() method 
in PDFParser class. This method is new since v0.7.4.

The code converts the array of 16 bytes to a String. The characters âãÏÓ are 
read as negative numbers in both Sun and IBM JDKs but whilst on Sun the String 
created from the byte array contains the characters on IBM JDK these characters 
are missing from the String. So when you read back 16 characters the stream 
offset is incorrect.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PDFBOX-504) Can't Parse any PDF using IBM JDK

Reply via email to