Robert Amidon created PDFBOX-6023:
-------------------------------------
Summary: Japanese fonts don't display correctly
Key: PDFBOX-6023
URL: https://issues.apache.org/jira/browse/PDFBOX-6023
Project: PDFBox
Issue Type: Bug
Affects Versions: 3.0.5 PDFBox
Reporter: Robert Amidon
This issue is similar/related to
https://issues.apache.org/jira/browse/PDFBOX-4572.
In some PDF files, there are non-embedded fonts that are defined using names
that contain Japanese characters. The above linked issue provides an example.
The name of one of these fonts is called "MS Mincho" in English, but in this
PDF file, it is specified as the Japanese transliteration, "MS明朝".
The COSName object is encoded (I think, anyway) using Shift-JIS, but existing
PdfBox code decodes the names as UTF-8. In this instance, though, this results
in illegal characters, so the code falls back to attempting to decode using
windows-1252 instead.
(See
[BaseParser::decodeBuffer|https://github.com/apache/pdfbox/blob/28d43126f0162fb56e30cd8134cc0c3246f2f23a/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/BaseParser.java#L926])
PdfBox attempts to decode the name using UTF-8, which fails with an exception
because of illegal characters, so it falls back to decoding using windows-1252,
which produces gibberish for the font name:
{code:java}
‚l‚r–¾’©{code}
Because this is a nonsense font name, PdfBox code cannot identify the font when
the text is drawn, so instead a different font is substituted (most of the time
it seems to use MS Gothic instead, which has a completely different look from
MS Mincho.)
If decoded properly the font name should be MS明朝. It may be difficult to
determine under what circumstance it would be appropriate to re-decode this
name, but in this case the {{PDCIDSystemInfo}} class provides a value from
{{getOrdering()}} that contains {{{}Japan{}}}, which is a useful hint.
However, there is a second problem: PdfBox/FontBox code caches information
about what fonts are available on the system and stores this information in a
file in the user's home directory called {{{}.pdfbox.cache{}}}. This file
contains a | separated list of certain font properties, like the font name and
font filepath, like this...
{code:java}
AgencyFB-Bold|TTF||2bc|800|20000001|0|1|020b0804020202020204|C:\WINDOWS\FONTS\AGENCYB.TTF|1f332ce9|1707412592418
AgencyFB-Reg|TTF||190|800|20000001|0|0|020b0503020202020204|C:\WINDOWS\FONTS\AGENCYR.TTF|8f3c800d|1707412592418
Alef-Bold|TTF||2bc|0|200000b3|0|1|00000800000000000000|C:\WINDOWS\FONTS\Alef-Bold.ttf|65c72227|1734127038000
...
{code}
Note that the first item on each line, the name of the font, in the snippet
above all use Latin script. This is true for all the fonts that are cached in
this file, including MS Mincho, which is the transliteration of MS明朝. PdfBox is
actually capable of reading both names from the font file's naming table, but
it discards all but the Latin name. This means that even if the encoding
problem described above is solved, PdfBox still won't be able to resolve the
correct font because the name is described only with Latin characters, not
Japanese characters like described in the PDF file.
I have drafted some changes that address these problems in a Pull Request on
GitHub. Using the PDFDebuggerApp and the {{sample_ja.pdf}} file I can confirm
that with these proposed changes, PdfBox now correctly resolves the font MS明朝 /
MS Mincho, which is installed on my Windows computer via the Microsoft Japanese
fonts pack.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]