[jira] Updated: (PDFBOX-504) Can't Parse any PDF using IBM JDK

Jeremias Maerki (JIRA) Thu, 20 Aug 2009 04:53:42 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jeremias Maerki updated PDFBOX-504:
-----------------------------------

    Attachment: IBMJDKParseFix.diff

I've been able to reproduce the problem. But it was interesting that the same 
didn't happen on an IBM JDK 1.5 on Windows. I could only reproduce it on Linux. 
The "new String(byte[])" omits non-mappable characters on Linux but on Windows 
these are mapped to 0xFFFD (Unicode REPLACEMENT CHARACTER). If I remember 
correctly, Sun JDKs replace with question marks.

However, the proposed patch doesn't look quite right to me. It unreads bytes 
that have been converted to a String and back instead of the original bytes. 
Since we're only interested in characters found in the US-ASCII (7-bit) 
character set I tried to use new String(byte[], "US-ASCII") and that fixed the 
issue, too. Using the default encoding (new String(byte[])) is usually a bad 
idea as that can have different settings on different systems. The bug that I'm 
attaching here solves the problem in a better way IMO.

If noone beats me to it, I'll commit my patch as shown after a 72h grace period.

> Can't Parse any PDF using IBM JDK
> ---------------------------------
>
>                 Key: PDFBOX-504
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-504
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: RedHat Linux IBM JDK
>            Reporter: Chris Bowditch
>            Priority: Critical
>         Attachments: ibm-parse-bug.patch, IBMJDKParseFix.diff, readable.pdf
>
>
> All PDF (that I have tried) fail to parse using IBM JDK 1.5 on RedHat Linux. 
> The error you receive is:
> Exception in thread "main" java.io.IOException: Error: Expected an integer 
> type, actual='Ã£ÃÃ'
>         at 
> org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1220)
>         at 
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:493)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:736)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:704)
>         at org.apache.pdfbox.PDFReader.parseDocument(PDFReader.java:323)
>         at org.apache.pdfbox.PDFReader.openPDFFile(PDFReader.java:286)
>         at org.apache.pdfbox.PDFReader.main(PDFReader.java:271)
> Although after debugging the actual error is hidden:
> java.io.IOException: Error: Expected an integer type, actual='ãÏÓ'
>         at 
> org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1220)
>         at 
> org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:483)
>         at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:736)
>         at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:704)
>         at org.apache.pdfbox.PDFReader.parseDocument(PDFReader.java:323)
>         at org.apache.pdfbox.PDFReader.openPDFFile(PDFReader.java:286)
>         at org.apache.pdfbox.PDFReader.main(PDFReader.java:271)
> The characters shown in the hidden message occur at the start of most PDF 
> Files that I have checked:
> %PDF-1.4
> %âãÏÓ
> 6 0 obj
> <</Filter /FlateDecode
> /Length 489
> >>
> stream
> Tracing the code I can see the problem is down to the skipToNextObject() 
> method in PDFParser class. This method is new since v0.7.4.
> The code converts the array of 16 bytes to a String. The characters âãÏÓ are 
> read as negative numbers in both Sun and IBM JDKs but whilst on Sun the 
> String created from the byte array contains the characters on IBM JDK these 
> characters are missing from the String. So when you read back 16 characters 
> the stream offset is incorrect.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-504) Can't Parse any PDF using IBM JDK

Reply via email to