[ 
https://issues.apache.org/jira/browse/PDFBOX-6023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Amidon updated PDFBOX-6023:
----------------------------------
    Description: 
This issue is similar/related to 
https://issues.apache.org/jira/browse/PDFBOX-4572.

In some PDF files, there are non-embedded fonts that are defined using names 
that contain Japanese characters. The above linked issue provides an example.

The name of one of these fonts is called "MS Mincho" in English, but in this 
PDF file, it is specified as the Japanese transliteration, "MS明朝".

The COSName object is encoded (I think, anyway) using Shift-JIS, but existing 
PdfBox code decodes the names as UTF-8. In this instance, though, this results 
in illegal characters, so the code falls back to attempting to decode using 
windows-1252 instead.

(See 
[BaseParser::decodeBuffer|https://github.com/apache/pdfbox/blob/28d43126f0162fb56e30cd8134cc0c3246f2f23a/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/BaseParser.java#L926])

PdfBox attempts to decode the name using UTF-8, which fails with an exception 
because of illegal characters, so it falls back to decoding using windows-1252, 
which produces gibberish for the font name: 
{code:java}
‚l‚r–¾’©{code}
Because this is a nonsense font name, PdfBox code cannot identify the font when 
the text is drawn, so instead a different font is substituted (most of the time 
it seems to use MS Gothic instead, which has a completely different look from 
MS Mincho.)

If decoded properly the font name should be MS明朝. It may be difficult to 
determine under what circumstance it would be appropriate to re-decode this 
name, but in this case the {{PDCIDSystemInfo}} class provides a value from 
{{getOrdering()}} that contains {{{}Japan{}}}, which is a useful hint.

However, there is a second problem: PdfBox/FontBox code caches information 
about what fonts are available on the system and stores this information in a 
file in the user's home directory called {{{}.pdfbox.cache{}}}. This file 
contains a | separated list of certain font properties, like the font name and 
font filepath, like this...
{code:java}
AgencyFB-Bold|TTF||2bc|800|20000001|0|1|020b0804020202020204|C:\WINDOWS\FONTS\AGENCYB.TTF|1f332ce9|1707412592418
AgencyFB-Reg|TTF||190|800|20000001|0|0|020b0503020202020204|C:\WINDOWS\FONTS\AGENCYR.TTF|8f3c800d|1707412592418
Alef-Bold|TTF||2bc|0|200000b3|0|1|00000800000000000000|C:\WINDOWS\FONTS\Alef-Bold.ttf|65c72227|1734127038000
...
{code}
Note that the first item on each line, the name of the font, in the snippet 
above all use Latin script. This is true for all the fonts that are cached in 
this file, including MS Mincho, which is the transliteration of MS明朝. PdfBox is 
actually capable of reading both names from the font file's naming table, but 
it discards all but the Latin name. This means that even if the encoding 
problem described above is solved, PdfBox still won't be able to resolve the 
correct font because the name is described only with Latin characters, not 
Japanese characters like described in the PDF file

  was:
This issue is similar/related to 
https://issues.apache.org/jira/browse/PDFBOX-4572.

In some PDF files, there are non-embedded fonts that are defined using names 
that contain Japanese characters. The above linked issue provides an example.

The name of one of these fonts is called "MS Mincho" in English, but in this 
PDF file, it is specified as the Japanese transliteration, "MS明朝".

The COSName object is encoded (I think, anyway) using Shift-JIS, but existing 
PdfBox code decodes the names as UTF-8. In this instance, though, this results 
in illegal characters, so the code falls back to attempting to decode using 
windows-1252 instead.

(See 
[BaseParser::decodeBuffer|https://github.com/apache/pdfbox/blob/28d43126f0162fb56e30cd8134cc0c3246f2f23a/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/BaseParser.java#L926])

PdfBox attempts to decode the name using UTF-8, which fails with an exception 
because of illegal characters, so it falls back to decoding using windows-1252, 
which produces gibberish for the font name: 
{code:java}
‚l‚r–¾’©{code}
Because this is a nonsense font name, PdfBox code cannot identify the font when 
the text is drawn, so instead a different font is substituted (most of the time 
it seems to use MS Gothic instead, which has a completely different look from 
MS Mincho.)

If decoded properly the font name should be MS明朝. It may be difficult to 
determine under what circumstance it would be appropriate to re-decode this 
name, but in this case the {{PDCIDSystemInfo}} class provides a value from 
{{getOrdering()}} that contains {{{}Japan{}}}, which is a useful hint.

However, there is a second problem: PdfBox/FontBox code caches information 
about what fonts are available on the system and stores this information in a 
file in the user's home directory called {{{}.pdfbox.cache{}}}. This file 
contains a | separated list of certain font properties, like the font name and 
font filepath, like this...
{code:java}
AgencyFB-Bold|TTF||2bc|800|20000001|0|1|020b0804020202020204|C:\WINDOWS\FONTS\AGENCYB.TTF|1f332ce9|1707412592418
AgencyFB-Reg|TTF||190|800|20000001|0|0|020b0503020202020204|C:\WINDOWS\FONTS\AGENCYR.TTF|8f3c800d|1707412592418
Alef-Bold|TTF||2bc|0|200000b3|0|1|00000800000000000000|C:\WINDOWS\FONTS\Alef-Bold.ttf|65c72227|1734127038000
...
{code}
Note that the first item on each line, the name of the font, in the snippet 
above all use Latin script. This is true for all the fonts that are cached in 
this file, including MS Mincho, which is the transliteration of MS明朝. PdfBox is 
actually capable of reading both names from the font file's naming table, but 
it discards all but the Latin name. This means that even if the encoding 
problem described above is solved, PdfBox still won't be able to resolve the 
correct font because the name is described only with Latin characters, not 
Japanese characters like described in the PDF file.

I have drafted some changes that address these problems in a Pull Request on 
GitHub. Using the PDFDebuggerApp and the {{sample_ja.pdf}} file I can confirm 
that with these proposed changes, PdfBox now correctly resolves the font MS明朝 / 
MS Mincho, which is installed on my Windows computer via the Microsoft Japanese 
fonts pack.


> Japanese fonts don't display correctly
> --------------------------------------
>
>                 Key: PDFBOX-6023
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-6023
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 3.0.5 PDFBox
>            Reporter: Robert Amidon
>            Priority: Major
>
> This issue is similar/related to 
> https://issues.apache.org/jira/browse/PDFBOX-4572.
> In some PDF files, there are non-embedded fonts that are defined using names 
> that contain Japanese characters. The above linked issue provides an example.
> The name of one of these fonts is called "MS Mincho" in English, but in this 
> PDF file, it is specified as the Japanese transliteration, "MS明朝".
> The COSName object is encoded (I think, anyway) using Shift-JIS, but existing 
> PdfBox code decodes the names as UTF-8. In this instance, though, this 
> results in illegal characters, so the code falls back to attempting to decode 
> using windows-1252 instead.
> (See 
> [BaseParser::decodeBuffer|https://github.com/apache/pdfbox/blob/28d43126f0162fb56e30cd8134cc0c3246f2f23a/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/BaseParser.java#L926])
> PdfBox attempts to decode the name using UTF-8, which fails with an exception 
> because of illegal characters, so it falls back to decoding using 
> windows-1252, which produces gibberish for the font name: 
> {code:java}
> ‚l‚r–¾’©{code}
> Because this is a nonsense font name, PdfBox code cannot identify the font 
> when the text is drawn, so instead a different font is substituted (most of 
> the time it seems to use MS Gothic instead, which has a completely different 
> look from MS Mincho.)
> If decoded properly the font name should be MS明朝. It may be difficult to 
> determine under what circumstance it would be appropriate to re-decode this 
> name, but in this case the {{PDCIDSystemInfo}} class provides a value from 
> {{getOrdering()}} that contains {{{}Japan{}}}, which is a useful hint.
> However, there is a second problem: PdfBox/FontBox code caches information 
> about what fonts are available on the system and stores this information in a 
> file in the user's home directory called {{{}.pdfbox.cache{}}}. This file 
> contains a | separated list of certain font properties, like the font name 
> and font filepath, like this...
> {code:java}
> AgencyFB-Bold|TTF||2bc|800|20000001|0|1|020b0804020202020204|C:\WINDOWS\FONTS\AGENCYB.TTF|1f332ce9|1707412592418
> AgencyFB-Reg|TTF||190|800|20000001|0|0|020b0503020202020204|C:\WINDOWS\FONTS\AGENCYR.TTF|8f3c800d|1707412592418
> Alef-Bold|TTF||2bc|0|200000b3|0|1|00000800000000000000|C:\WINDOWS\FONTS\Alef-Bold.ttf|65c72227|1734127038000
> ...
> {code}
> Note that the first item on each line, the name of the font, in the snippet 
> above all use Latin script. This is true for all the fonts that are cached in 
> this file, including MS Mincho, which is the transliteration of MS明朝. PdfBox 
> is actually capable of reading both names from the font file's naming table, 
> but it discards all but the Latin name. This means that even if the encoding 
> problem described above is solved, PdfBox still won't be able to resolve the 
> correct font because the name is described only with Latin characters, not 
> Japanese characters like described in the PDF file



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to