[jira] [Commented] (PDFBOX-3347) COSName parsing doesn't handle ISO-8859-1 encoded bytes

Michael Klink (JIRA) Tue, 10 May 2016 23:46:26 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15279670#comment-15279670
 ]


Michael Klink commented on PDFBOX-3347:
---------------------------------------

{quote}
which must be how iText is detecting that the encoding of this name is wrong
{quote}

Actually iText in this regards is less sophisticated, it essentially assumes 
each byte of the name (after resolution of the #-sequences) to be the lower 
byte of U+00xx:

{code}
                while (true) {
                    ch = file.read();
                    if (delims[ch + 1])
                        break;
                    if (ch == '#') {
                        ch = (getHex(file.read()) << 4) + getHex(file.read());
                    }
                    outBuf.append((char)ch);
                }
{code}

i.e. essentially it treats it as ISO-8859-1 encoded from the start.

----

That been said, falling back to ISO-8859-1 is one option which kind of 
corresponds with PDF history ("In Acrobat 4.0 and earlier versions, a name 
object being treated as text will typically be interpreted in a host platform 
encoding"). Alternatively a fallback to *PDFDocEncoding* would also make sense, 
especially considering 

{panel:title=ISO 32000-1 section _12.7.4.2.3 Check Boxes_}
name objects in the appearance dictionary are limited to *PDFDocEncoding*
{panel}

(Yes, this bit of the specification is at odds with the piece of it quoted on 
stackoverflow which recommended an UTF-8 interpretation...)

Considering, though, that earlier PDF references on that issue wrote "limited 
to the Latin character set", ISO-8859-1 might be more apropos.

> COSName parsing doesn't handle ISO-8859-1 encoded bytes
> -------------------------------------------------------
>
>                 Key: PDFBOX-3347
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3347
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, Writing
>    Affects Versions: 1.8.12, 2.0.1, 2.0.2
>            Reporter: Maruan Sahyoun
>            Assignee: John Hewson
>            Priority: Minor
>             Fix For: 2.0.2
>
>
> As discussed here 
> http://stackoverflow.com/questions/36964496/pdfbox-2-0-overcoming-dictionary-key-encoding/
>  a byte sequence making up a COSName is interpreted during parsing and 
> writing where it shouldn't. Details are given my mkl's excellent analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3347) COSName parsing doesn't handle ISO-8859-1 encoded bytes

Reply via email to