[ 
https://issues.apache.org/jira/browse/PDFBOX-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Ziegler updated PDFBOX-6197:
-----------------------------------
    Description: 
*Summary*

{{TTFSubsetter}} currently only writes a single Windows Unicode BMP cmap 
subtable (platform 3, encoding 1), and only when {{addAll()}} has been called. 
There is no API to inject additional cmap subtables. This makes it impossible 
to correctly re-subset TrueType fonts that rely on non-Unicode cmaps for 
rendering — in particular fonts produced by Ghostscript using its 
{{TT_BIAS=0xF000}} strategy.

*Problem*

Ghostscript embeds TrueType subsets in PDFs using a character code bias of 
{{{}0xF000{}}}. The resulting TTF contains two cmap subtables:
 * *Mac Roman (platform 1, encoding 0, format 6):* maps {{code N → glyph}} 
directly — this is the subtable viewers use for rendering
 * *Windows Symbol (platform 3, encoding 0, format 4):* maps {{0xF000+N → 
glyph}} — used by Windows-based viewers

When {{TTFSubsetter}} is used to re-subset such a font, both subtables are lost:
 # {{buildCmapTable()}} returns {{null}} when {{uniToGID}} is empty (i.e. when 
only {{addGlyphIds()}} was used)
 # Even when {{addAll()}} is called with PUA codepoints ({{{}U+F001{}}} etc.), 
only the Windows Unicode BMP subtable is written — not the Mac Roman subtable 
that viewers depend on for rendering
 # There is no API to inject additional cmap subtables with arbitrary 
platform/encoding combinations

The result is a re-subsetted TTF where viewers cannot find glyphs for the 
character codes present in the PDF content stream, producing blank/missing 
glyph rendering for Thai, CJK, and other scripts that go through this 
Ghostscript encoding path.

The patched {{TTFSubsetter.java}} is attached, based on the current development 
branch.

*Usage Example*

 
{code:java}
CmapSubtable macCmap = ttf.getCmap()
        .getSubtable(CmapTable.PLATFORM_MACINTOSH, 
CmapTable.ENCODING_MAC_ROMAN);
Map<Integer, Integer> codeToGid = new LinkedHashMap<>();
for (int code : usedCodes) {
    int gid = macCmap.getGlyphId(code);
    if (gid > 0) {
        codeToGid.put(code, gid);
    }
}
TTFSubsetter subsetter = new TTFSubsetter(ttf, keepTables);
subsetter.addGlyphIds(new HashSet<>(codeToGid.values()));
// Preserve Mac Roman cmap (primary rendering cmap)
subsetter.addCustomCmap(CmapTable.PLATFORM_MACINTOSH,
                        CmapTable.ENCODING_MAC_ROMAN,
                        codeToGid);
// Preserve Windows Symbol cmap (0xF000 bias)
Map<Integer, Integer> winSymbolMap = new LinkedHashMap<>();
codeToGid.forEach((code, gid) -> winSymbolMap.put(0xF000 + code, gid));
subsetter.addCustomCmap(CmapTable.PLATFORM_WINDOWS,
                        CmapTable.ENCODING_WIN_SYMBOL,
                        winSymbolMap);
{code}
 

  was:
*Summary*

{{TTFSubsetter}} currently only writes a single Windows Unicode BMP cmap 
subtable (platform 3, encoding 1), and only when {{addAll()}} has been called. 
There is no API to inject additional cmap subtables. This makes it impossible 
to correctly re-subset TrueType fonts that rely on non-Unicode cmaps for 
rendering — in particular fonts produced by Ghostscript using its 
{{TT_BIAS=0xF000}} strategy.



*Problem*

Ghostscript embeds TrueType subsets in PDFs using a character code bias of 
{{{}0xF000{}}}. The resulting TTF contains two cmap subtables:
 * *Mac Roman (platform 1, encoding 0, format 6):* maps {{code N → glyph}} 
directly — this is the subtable viewers use for rendering
 * *Windows Symbol (platform 3, encoding 0, format 4):* maps {{0xF000+N → 
glyph}} — used by Windows-based viewers

When {{TTFSubsetter}} is used to re-subset such a font, both subtables are lost:
 # {{buildCmapTable()}} returns {{null}} when {{uniToGID}} is empty (i.e. when 
only {{addGlyphIds()}} was used)
 # Even when {{addAll()}} is called with PUA codepoints ({{{}U+F001{}}} etc.), 
only the Windows Unicode BMP subtable is written — not the Mac Roman subtable 
that viewers depend on for rendering
 # There is no API to inject additional cmap subtables with arbitrary 
platform/encoding combinations

The result is a re-subsetted TTF where viewers cannot find glyphs for the 
character codes present in the PDF content stream, producing blank/missing 
glyph rendering for Thai, CJK, and other scripts that go through this 
Ghostscript encoding path.

The patched {{TTFSubsetter.java}} is attached, based on the current development 
branch.


> TTFSubsetter: add support for custom cmap subtables (addCustomCmapEntry / 
> addCustomCmap)
> ----------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-6197
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6197
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: FontBox
>    Affects Versions: 3.0.7 PDFBox
>            Reporter: Stefan Ziegler
>            Priority: Major
>         Attachments: TTFSubsetter.java
>
>
> *Summary*
> {{TTFSubsetter}} currently only writes a single Windows Unicode BMP cmap 
> subtable (platform 3, encoding 1), and only when {{addAll()}} has been 
> called. There is no API to inject additional cmap subtables. This makes it 
> impossible to correctly re-subset TrueType fonts that rely on non-Unicode 
> cmaps for rendering — in particular fonts produced by Ghostscript using its 
> {{TT_BIAS=0xF000}} strategy.
> *Problem*
> Ghostscript embeds TrueType subsets in PDFs using a character code bias of 
> {{{}0xF000{}}}. The resulting TTF contains two cmap subtables:
>  * *Mac Roman (platform 1, encoding 0, format 6):* maps {{code N → glyph}} 
> directly — this is the subtable viewers use for rendering
>  * *Windows Symbol (platform 3, encoding 0, format 4):* maps {{0xF000+N → 
> glyph}} — used by Windows-based viewers
> When {{TTFSubsetter}} is used to re-subset such a font, both subtables are 
> lost:
>  # {{buildCmapTable()}} returns {{null}} when {{uniToGID}} is empty (i.e. 
> when only {{addGlyphIds()}} was used)
>  # Even when {{addAll()}} is called with PUA codepoints ({{{}U+F001{}}} 
> etc.), only the Windows Unicode BMP subtable is written — not the Mac Roman 
> subtable that viewers depend on for rendering
>  # There is no API to inject additional cmap subtables with arbitrary 
> platform/encoding combinations
> The result is a re-subsetted TTF where viewers cannot find glyphs for the 
> character codes present in the PDF content stream, producing blank/missing 
> glyph rendering for Thai, CJK, and other scripts that go through this 
> Ghostscript encoding path.
> The patched {{TTFSubsetter.java}} is attached, based on the current 
> development branch.
> *Usage Example*
>  
> {code:java}
> CmapSubtable macCmap = ttf.getCmap()
>         .getSubtable(CmapTable.PLATFORM_MACINTOSH, 
> CmapTable.ENCODING_MAC_ROMAN);
> Map<Integer, Integer> codeToGid = new LinkedHashMap<>();
> for (int code : usedCodes) {
>     int gid = macCmap.getGlyphId(code);
>     if (gid > 0) {
>         codeToGid.put(code, gid);
>     }
> }
> TTFSubsetter subsetter = new TTFSubsetter(ttf, keepTables);
> subsetter.addGlyphIds(new HashSet<>(codeToGid.values()));
> // Preserve Mac Roman cmap (primary rendering cmap)
> subsetter.addCustomCmap(CmapTable.PLATFORM_MACINTOSH,
>                         CmapTable.ENCODING_MAC_ROMAN,
>                         codeToGid);
> // Preserve Windows Symbol cmap (0xF000 bias)
> Map<Integer, Integer> winSymbolMap = new LinkedHashMap<>();
> codeToGid.forEach((code, gid) -> winSymbolMap.put(0xF000 + code, gid));
> subsetter.addCustomCmap(CmapTable.PLATFORM_WINDOWS,
>                         CmapTable.ENCODING_WIN_SYMBOL,
>                         winSymbolMap);
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to