[
https://issues.apache.org/jira/browse/PDFBOX-6197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stefan Ziegler updated PDFBOX-6197:
-----------------------------------
Attachment: TTFSubsetter.java
> TTFSubsetter: add support for custom cmap subtables (addCustomCmapEntry /
> addCustomCmap)
> ----------------------------------------------------------------------------------------
>
> Key: PDFBOX-6197
> URL: https://issues.apache.org/jira/browse/PDFBOX-6197
> Project: PDFBox
> Issue Type: Improvement
> Components: FontBox
> Affects Versions: 3.0.7 PDFBox
> Reporter: Stefan Ziegler
> Priority: Major
> Attachments: TTFSubsetter.java
>
>
> *Summary*
> {{TTFSubsetter}} currently only writes a single Windows Unicode BMP cmap
> subtable (platform 3, encoding 1), and only when {{addAll()}} has been
> called. There is no API to inject additional cmap subtables. This makes it
> impossible to correctly re-subset TrueType fonts that rely on non-Unicode
> cmaps for rendering — in particular fonts produced by Ghostscript using its
> {{TT_BIAS=0xF000}} strategy.
> *Problem*
> Ghostscript embeds TrueType subsets in PDFs using a character code bias of
> {{{}0xF000{}}}. The resulting TTF contains two cmap subtables:
> * *Mac Roman (platform 1, encoding 0, format 6):* maps {{code N → glyph}}
> directly — this is the subtable viewers use for rendering
> * *Windows Symbol (platform 3, encoding 0, format 4):* maps {{0xF000+N →
> glyph}} — used by Windows-based viewers
> When {{TTFSubsetter}} is used to re-subset such a font, both subtables are
> lost:
> # {{buildCmapTable()}} returns {{null}} when {{uniToGID}} is empty (i.e.
> when only {{addGlyphIds()}} was used)
> # Even when {{addAll()}} is called with PUA codepoints ({{{}U+F001{}}}
> etc.), only the Windows Unicode BMP subtable is written — not the Mac Roman
> subtable that viewers depend on for rendering
> # There is no API to inject additional cmap subtables with arbitrary
> platform/encoding combinations
> The result is a re-subsetted TTF where viewers cannot find glyphs for the
> character codes present in the PDF content stream, producing blank/missing
> glyph rendering for Thai, CJK, and other scripts that go through this
> Ghostscript encoding path.
> The patched {{TTFSubsetter.java}} is attached, based on the current
> development branch.
> *Usage Example*
>
> {code:java}
> CmapSubtable macCmap = ttf.getCmap()
> .getSubtable(CmapTable.PLATFORM_MACINTOSH,
> CmapTable.ENCODING_MAC_ROMAN);
> Map<Integer, Integer> codeToGid = new LinkedHashMap<>();
> for (int code : usedCodes) {
> int gid = macCmap.getGlyphId(code);
> if (gid > 0) {
> codeToGid.put(code, gid);
> }
> }
> TTFSubsetter subsetter = new TTFSubsetter(ttf, keepTables);
> subsetter.addGlyphIds(new HashSet<>(codeToGid.values()));
> // Preserve Mac Roman cmap (primary rendering cmap)
> subsetter.addCustomCmap(CmapTable.PLATFORM_MACINTOSH,
> CmapTable.ENCODING_MAC_ROMAN,
> codeToGid);
> // Preserve Windows Symbol cmap (0xF000 bias)
> Map<Integer, Integer> winSymbolMap = new LinkedHashMap<>();
> codeToGid.forEach((code, gid) -> winSymbolMap.put(0xF000 + code, gid));
> subsetter.addCustomCmap(CmapTable.PLATFORM_WINDOWS,
> CmapTable.ENCODING_WIN_SYMBOL,
> winSymbolMap);
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]