[jira] [Comment Edited] (PDFBOX-3347) COSName parsing doesn't handle ISO-8859-1 encoded bytes

John Hewson (JIRA) Tue, 10 May 2016 12:04:08 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278628#comment-15278628
 ]


John Hewson edited comment on PDFBOX-3347 at 5/10/16 7:02 PM:
--------------------------------------------------------------

Note that parsing COSName as UTF-8, which we already do, is correct. mkl makes 
good points about how we compare and write out COSName (these should indeed be 
byte-based, like COSString is in 2.0). But that's not the issue here. The SO 
user is not correct in saying that the dictionary keys are supposed to be 
ISO-8859-1 encoded (though it looks like they are in this file). ISO-8859-1 is 
not used anywhere in the PDF spec.

Looking at the {{Krematorier}} field (26th in the Fields array) we see an 
appearance stream (AP > N > 1) with a raw name of {{/R#E5cksta}}. The # is a 
hex escape character and PDFBox should be parsing #E5 as U+00E5 which is {{å}}. 
However that's not happening.




was (Author: jahewson):
Note that parsing COSName as UTF-8, which we already do, is correct. mkl makes 
good points about how we compare and write out COSName (these should indeed be 
byte-based, like COSString is in 2.0). But that's not the issue here. The SO 
user is not correct in saying that the dictionary keys are ISO-8859-1 encoded. 
ISO-8859-1 is not used anywhere in the PDF spec.

Looking at the {{Krematorier}} field (26th in the Fields array) we see an 
appearance stream (AP > N > 1) with a raw name of {{/R#E5cksta}}. The # is a 
hex escape character and PDFBox should be parsing #E5 as U+00E5 which is {{å}}. 
However that's not happening.



> COSName parsing doesn't handle ISO-8859-1 encoded bytes
> -------------------------------------------------------
>
>                 Key: PDFBOX-3347
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3347
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, Writing
>    Affects Versions: 1.8.12, 2.0.1, 2.0.2
>            Reporter: Maruan Sahyoun
>            Assignee: John Hewson
>            Priority: Minor
>             Fix For: 2.0.2
>
>
> As discussed here 
> http://stackoverflow.com/questions/36964496/pdfbox-2-0-overcoming-dictionary-key-encoding/
>  a byte sequence making up a COSName is interpreted during parsing and 
> writing where it shouldn't. Details are given my mkl's excellent analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-3347) COSName parsing doesn't handle ISO-8859-1 encoded bytes

Reply via email to