[jira] Commented: (PDFBOX-504) Can't Parse any PDF using IBM JDK

Chris Bowditch (JIRA) Thu, 20 Aug 2009 06:09:40 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745437#action_12745437
 ]


Chris Bowditch commented on PDFBOX-504:
---------------------------------------

Thanks Jeremias. Your explanation makes sense. I knew there must have been a 
better way to fix this. I tried just about every encoding other than US-ASCII :)

> Can't Parse any PDF using IBM JDK
> ---------------------------------
>
>                 Key: PDFBOX-504
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-504
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: RedHat Linux IBM JDK
>            Reporter: Chris Bowditch
>            Priority: Critical
>         Attachments: ibm-parse-bug.patch, IBMJDKParseFix.diff, readable.pdf
>
>
> All PDF (that I have tried) fail to parse using IBM JDK 1.5 on RedHat Linux. 
> The error you receive is:
> Exception in thread "main" java.io.IOException: Error: Expected an integer 
> type, actual='Ã£ÃÃ'
>         at 
> org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1220)
>         at 
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:493)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:736)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:704)
>         at org.apache.pdfbox.PDFReader.parseDocument(PDFReader.java:323)
>         at org.apache.pdfbox.PDFReader.openPDFFile(PDFReader.java:286)
>         at org.apache.pdfbox.PDFReader.main(PDFReader.java:271)
> Although after debugging the actual error is hidden:
> java.io.IOException: Error: Expected an integer type, actual='ãÏÓ'
>         at 
> org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1220)
>         at 
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:483)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:736)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:704)
>         at org.apache.pdfbox.PDFReader.parseDocument(PDFReader.java:323)
>         at org.apache.pdfbox.PDFReader.openPDFFile(PDFReader.java:286)
>         at org.apache.pdfbox.PDFReader.main(PDFReader.java:271)
> The characters shown in the hidden message occur at the start of most PDF 
> Files that I have checked:
> %PDF-1.4
> %âãÏÓ
> 6 0 obj
> <</Filter /FlateDecode
> /Length 489
> >>
> stream
> Tracing the code I can see the problem is down to the skipToNextObject() 
> method in PDFParser class. This method is new since v0.7.4.
> The code converts the array of 16 bytes to a String. The characters âãÏÓ are 
> read as negative numbers in both Sun and IBM JDKs but whilst on Sun the 
> String created from the byte array contains the characters on IBM JDK these 
> characters are missing from the String. So when you read back 16 characters 
> the stream offset is incorrect.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-504) Can't Parse any PDF using IBM JDK

Reply via email to