[
https://issues.apache.org/jira/browse/PDFBOX-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279670#comment-15279670
]
Michael Klink commented on PDFBOX-3347:
---------------------------------------
{quote}
which must be how iText is detecting that the encoding of this name is wrong
{quote}
Actually iText in this regards is less sophisticated, it essentially assumes
each byte of the name (after resolution of the #-sequences) to be the lower
byte of U+00xx:
{code}
while (true) {
ch = file.read();
if (delims[ch + 1])
break;
if (ch == '#') {
ch = (getHex(file.read()) << 4) + getHex(file.read());
}
outBuf.append((char)ch);
}
{code}
i.e. essentially it treats it as ISO-8859-1 encoded from the start.
----
That been said, falling back to ISO-8859-1 is one option which kind of
corresponds with PDF history ("In Acrobat 4.0 and earlier versions, a name
object being treated as text will typically be interpreted in a host platform
encoding"). Alternatively a fallback to *PDFDocEncoding* would also make sense,
especially considering
{panel:title=ISO 32000-1 section _12.7.4.2.3 Check Boxes_}
name objects in the appearance dictionary are limited to *PDFDocEncoding*
{panel}
(Yes, this bit of the specification is at odds with the piece of it quoted on
stackoverflow which recommended an UTF-8 interpretation...)
Considering, though, that earlier PDF references on that issue wrote "limited
to the Latin character set", ISO-8859-1 might be more apropos.
> COSName parsing doesn't handle ISO-8859-1 encoded bytes
> -------------------------------------------------------
>
> Key: PDFBOX-3347
> URL: https://issues.apache.org/jira/browse/PDFBOX-3347
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing, Writing
> Affects Versions: 1.8.12, 2.0.1, 2.0.2
> Reporter: Maruan Sahyoun
> Assignee: John Hewson
> Priority: Minor
> Fix For: 2.0.2
>
>
> As discussed here
> http://stackoverflow.com/questions/36964496/pdfbox-2-0-overcoming-dictionary-key-encoding/
> a byte sequence making up a COSName is interpreted during parsing and
> writing where it shouldn't. Details are given my mkl's excellent analysis.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]