[jira] [Created] (PDFBOX-6023) Japanese fonts don't display correctly

Robert Amidon (Jira) Wed, 18 Jun 2025 08:36:40 -0700

Robert Amidon created PDFBOX-6023:
-------------------------------------

             Summary: Japanese fonts don't display correctly
                 Key: PDFBOX-6023
                 URL: https://issues.apache.org/jira/browse/PDFBOX-6023
             Project: PDFBox
          Issue Type: Bug
    Affects Versions: 3.0.5 PDFBox
            Reporter: Robert Amidon



This issue is similar/related to 
https://issues.apache.org/jira/browse/PDFBOX-4572.

In some PDF files, there are non-embedded fonts that are defined using names 
that contain Japanese characters. The above linked issue provides an example.

The name of one of these fonts is called "MS Mincho" in English, but in this 
PDF file, it is specified as the Japanese transliteration, "ＭＳ明朝".

The COSName object is encoded (I think, anyway) using Shift-JIS, but existing 
PdfBox code decodes the names as UTF-8. In this instance, though, this results 
in illegal characters, so the code falls back to attempting to decode using 
windows-1252 instead.

(See 
[BaseParser::decodeBuffer|https://github.com/apache/pdfbox/blob/28d43126f0162fb56e30cd8134cc0c3246f2f23a/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/BaseParser.java#L926])

PdfBox attempts to decode the name using UTF-8, which fails with an exception 
because of illegal characters, so it falls back to decoding using windows-1252, 
which produces gibberish for the font name: 
{code:java}
‚l‚r–¾’©{code}
Because this is a nonsense font name, PdfBox code cannot identify the font when 
the text is drawn, so instead a different font is substituted (most of the time 
it seems to use MS Gothic instead, which has a completely different look from 
MS Mincho.)

If decoded properly the font name should be ＭＳ明朝. It may be difficult to 
determine under what circumstance it would be appropriate to re-decode this 
name, but in this case the {{PDCIDSystemInfo}} class provides a value from 
{{getOrdering()}} that contains {{{}Japan{}}}, which is a useful hint.

However, there is a second problem: PdfBox/FontBox code caches information 
about what fonts are available on the system and stores this information in a 
file in the user's home directory called {{{}.pdfbox.cache{}}}. This file 
contains a | separated list of certain font properties, like the font name and 
font filepath, like this...
{code:java}
AgencyFB-Bold|TTF||2bc|800|20000001|0|1|020b0804020202020204|C:\WINDOWS\FONTS\AGENCYB.TTF|1f332ce9|1707412592418
AgencyFB-Reg|TTF||190|800|20000001|0|0|020b0503020202020204|C:\WINDOWS\FONTS\AGENCYR.TTF|8f3c800d|1707412592418
Alef-Bold|TTF||2bc|0|200000b3|0|1|00000800000000000000|C:\WINDOWS\FONTS\Alef-Bold.ttf|65c72227|1734127038000
...
{code}
Note that the first item on each line, the name of the font, in the snippet 
above all use Latin script. This is true for all the fonts that are cached in 
this file, including MS Mincho, which is the transliteration of ＭＳ明朝. PdfBox is 
actually capable of reading both names from the font file's naming table, but 
it discards all but the Latin name. This means that even if the encoding 
problem described above is solved, PdfBox still won't be able to resolve the 
correct font because the name is described only with Latin characters, not 
Japanese characters like described in the PDF file.

I have drafted some changes that address these problems in a Pull Request on 
GitHub. Using the PDFDebuggerApp and the {{sample_ja.pdf}} file I can confirm 
that with these proposed changes, PdfBox now correctly resolves the font ＭＳ明朝 / 
MS Mincho, which is installed on my Windows computer via the Microsoft Japanese 
fonts pack.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Created] (PDFBOX-6023) Japanese fonts don't display correctly

Reply via email to