[ https://issues.apache.org/jira/browse/PDFBOX-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278628#comment-15278628 ]
John Hewson edited comment on PDFBOX-3347 at 5/10/16 7:02 PM: -------------------------------------------------------------- Note that parsing COSName as UTF-8, which we already do, is correct. mkl makes good points about how we compare and write out COSName (these should indeed be byte-based, like COSString is in 2.0). But that's not the issue here. The SO user is not correct in saying that the dictionary keys are supposed to be ISO-8859-1 encoded (though it looks like they are in this file). ISO-8859-1 is not used anywhere in the PDF spec. Looking at the {{Krematorier}} field (26th in the Fields array) we see an appearance stream (AP > N > 1) with a raw name of {{/R#E5cksta}}. The # is a hex escape character and PDFBox should be parsing #E5 as U+00E5 which is {{å}}. However that's not happening. was (Author: jahewson): Note that parsing COSName as UTF-8, which we already do, is correct. mkl makes good points about how we compare and write out COSName (these should indeed be byte-based, like COSString is in 2.0). But that's not the issue here. The SO user is not correct in saying that the dictionary keys are ISO-8859-1 encoded. ISO-8859-1 is not used anywhere in the PDF spec. Looking at the {{Krematorier}} field (26th in the Fields array) we see an appearance stream (AP > N > 1) with a raw name of {{/R#E5cksta}}. The # is a hex escape character and PDFBox should be parsing #E5 as U+00E5 which is {{å}}. However that's not happening. > COSName parsing doesn't handle ISO-8859-1 encoded bytes > ------------------------------------------------------- > > Key: PDFBOX-3347 > URL: https://issues.apache.org/jira/browse/PDFBOX-3347 > Project: PDFBox > Issue Type: Bug > Components: Parsing, Writing > Affects Versions: 1.8.12, 2.0.1, 2.0.2 > Reporter: Maruan Sahyoun > Assignee: John Hewson > Priority: Minor > Fix For: 2.0.2 > > > As discussed here > http://stackoverflow.com/questions/36964496/pdfbox-2-0-overcoming-dictionary-key-encoding/ > a byte sequence making up a COSName is interpreted during parsing and > writing where it shouldn't. Details are given my mkl's excellent analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org