[jira] [Commented] (PDFBOX-3757) TTFSubsetter scrambles PostScript names and unicode codepoints when subset contains diaeresis

Tilman Hausherr (JIRA) Tue, 18 Apr 2017 09:27:02 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15973003#comment-15973003
 ]


Tilman Hausherr commented on PDFBOX-3757:
-----------------------------------------

Tthe subsetter creates a bad "post" table. This doesn't occur in production 
because PDFBox doesn't use the "post" table, see TTFSubsetterTest.java and 
TrueTypeEmbedder.java.

This code reproduces your problem:
{code}
File file = new File("DejaVuSans.ttf");
TrueTypeFont ttf = new TTFParser().parse(file);
TTFSubsetter ttfSubsetter = new TTFSubsetter(ttf);
ttfSubsetter.add('Ö');
ttfSubsetter.add('\u200A');
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ttfSubsetter.writeToStream(baos);
ttfSubsetter.writeToStream(new FileOutputStream("subset.ttf")); // look at it 
with DTL OTMaster Light 3.7

try (TrueTypeFont subset = new TTFParser(true).parse(new 
ByteArrayInputStream(baos.toByteArray())))
{
    PostScriptTable pst = subset.getPostScript();
    System.out.println(Arrays.toString(pst.getGlyphNames()));
    for (int i = 0; i < pst.getGlyphNames().length; ++i)
    {
        System.out.println(i + ": " + pst.getName(i) + ", gid: " + 
subset.nameToGID(pst.getName(i)));
        System.out.println("  " + subset.getPath(pst.getName(i)).getBounds2D());
        if (subset.getGlyph().getGlyph(i) == null)
        {
            System.out.println("  none");
        }
        else
        {
            System.out.println("  " + 
subset.getGlyph().getGlyph(i).getPath().getBounds2D());
        }
    }
}
{code}
The output is:
{code}
[.notdef, O, Odieresis, Dieresis, uni200A]
0: .notdef, gid: 0
  java.awt.geom.Rectangle2D$Float[x=102.0,y=-362.0,w=1024.0,h=1806.0]
  java.awt.geom.Rectangle2D$Float[x=102.0,y=-362.0,w=1024.0,h=1806.0]
1: O, gid: 1
  java.awt.geom.Rectangle2D$Float[x=115.0,y=-29.0,w=1382.0,h=1549.0]
  java.awt.geom.Rectangle2D$Float[x=115.0,y=-29.0,w=1382.0,h=1549.0]
2: Odieresis, gid: 2
  java.awt.geom.Rectangle2D$Float[x=115.0,y=-29.0,w=1382.0,h=1899.0]
  java.awt.geom.Rectangle2D$Float[x=115.0,y=-29.0,w=1382.0,h=1899.0]
3: Dieresis, gid: 3
  java.awt.geom.Rectangle2D$Float[x=0.0,y=0.0,w=0.0,h=0.0]
  none
4: uni200A, gid: 4
  java.awt.geom.Rectangle2D$Float[x=-809.0,y=1294.0,w=594.0,h=203.0]
  java.awt.geom.Rectangle2D$Float[x=-809.0,y=1294.0,w=594.0,h=203.0] 
{code}
So Dieresis has no glyph :-(

My theory is that the bug is in buildPostTable(), because the rest (glyf, cmap) 
is correct. I tried generating PDFs and there are no problems.

The buildPostTable() method works with the GIDs (only those where the names are 
not part of WGL4Names) in GID order. Here uni200A comes first, and then 
Dieresis. But when writing the names, it uses the name order and now Dieresis 
comes first and then uni200A.

I suspect this was introduced in rev 1645796 in PDFBOX-2565 as a refactoring 
within a bigger commit, to simplify the code. The old code used a list for the 
names and a map for the gids, the new code used only a name set ordered by 
names.

So I'm using a LinkedHashMap, where the iteration order is the insertion order.

> TTFSubsetter scrambles PostScript names and unicode codepoints when subset 
> contains diaeresis
> ---------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-3757
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3757
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 2.0.5
>            Reporter: Tobias Fischer
>            Assignee: Tilman Hausherr
>         Attachments: fontbox-2.0.5-ttfsubsetter_dieresis-scrambled-names.png, 
> fontbox-2.0.5-ttfsubsetter_scrambled-codepoints.png, 
> Subset-DejaVuSans__dieresis-scrambled-names.ttf, 
> Subset-DejaVuSans__scrambled-codepoints.ttf
>
>
> I tried to build a standalone FontSubsetter with the great fontbox tools. It 
> works so far for OpenType/TrueType fonts, but when the glyph subset contains 
> characters with diaeresis (like german umlauts äöü), the TTFSubsetter class 
> scrambles PostScript names and unicode codepoints.
> When creating a subset from DejaVuSans.ttf for example, with only those two 
> characters "Ö " (O umlaut and a hair space \u200A), the resulting font subset 
> is recognized as a valid font, but the unicode codepoint 200A in the 
> resulting font file has the postscript name "Dieresis" and the single 
> dieresis are named "uni200A". See screenshot 
> "fontbox-2.0.5-ttfsubsetter_dieresis-scrambled-names.png" and the subsetted 
> Font "Subset-DejaVuSans__dieresis-scrambled-names.ttf".
> When there are more glyphs in the subset, more whitespace, special chars and 
> umlauts, the scrambling goes even further and also scrambles unicode 
> codepoints and not only postscript names:
> glyphs in subset: "RabenköigKrmloEyGfthsTjHdAu cvFüD. w,äUp:IzWVZSN-ßLC 
> PB5M«»O2013Q©/;x978-()64XJ'!Ä?‹› ...ÜqY &amp;Öé|_•{}[]&gt;#*$^\\+"
> Resulting font: "Subset-DejaVuSans__scrambled-codepoints.ttf"
> Screenshot: "fontbox-2.0.5-ttfsubsetter_scrambled-codepoints.png"
> I considder this a bug, as it does not appear when there are no umlauts or 
> diaeresis in the subset.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-3757) TTFSubsetter scrambles PostScript names and unicode codepoints when subset contains diaeresis

Reply via email to