[ 
https://issues.apache.org/jira/browse/PDFBOX-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223545#comment-14223545
 ] 

John Hewson edited comment on PDFBOX-2509 at 11/24/14 9:56 PM:
---------------------------------------------------------------

{quote}
To add a mapping PDCIDFontType0 would need to allow ttf font
{quote}

Yes, that's right but the current design of PDFBox fonts isn't general enough 
to allow this. The code depends on a CIDFontType0 being CFF, and the handling 
of the encodings and glyphs is tightly coupled to that format. Any changes 
would be a lot of work, which is something to think about in the future. One 
major issue might be that CJK fonts aren't interchangeable across languages, so 
the same Unicode character might look very different in a Chinese vs a Korean 
font, and it would be wrong to substitute these. Many CJK fonts use CFF, so it 
should be possible to install usable fonts.

{quote}
If i have TT CID not embedded and try to map to TTF used by evince, it doesnt 
fix issue.
{quote}

I think you might be wrong about that. The message "Missing CID-keyed font" 
relates to a missing CFF font, not a missing TrueType font.

{quote}
Adding /opt/Adobe fixes CFF fonts but TT CID fonts need some other fix
{quote}

That makes sense as "Missing CID-keyed font" relates to missing CFF fonts (i.e. 
OTF).

---

I did some research and discovered that all of the fonts which are missing in 
the files above ship with Adobe Reader, here's the list I found online:

- HeiseiKakuGo-W5 (Japanese)
- HeiseiMin-W3 (Japanese)
- MHei-Medium (Traditional Chinese)
- MSung-Light (Traditional Chinese)
- STSong-Light (Simplified Chinese)
- HYGoThic-Medium (Korean)
- HYSMyeongJo-Medium (Korean)

It also looks like there is more than one weight available for some of these 
fonts, but I couldn't find a definitive list.

So I'm wondering if PDFBox could ship with some built-in mappings for these 
well-known fonts, Adobe doesn't actually ship all of these fonts with Acrobat, 
it ships some substitutes, and relies on system fonts also. For example, on my 
Mac the font HeiseiMin-W3 is substituted with HiraMinPro-W3 by Acrobat, which 
is the "Hiragino Mincho ProN W3" font which [ships with OS 
X|http://support.apple.com/en-us/HT202408].

For Linux we'll want to use free fonts, but we'd need a list of high-quality 
common mappings, perhaps [this 
list|http://en.wikipedia.org/wiki/List_of_CJK_fonts] and [this other 
list|https://wiki.archlinux.org/index.php/fonts#Japanese] could help? A good 
starting point would be to see what Acrobat on Linux does, File > Properties > 
Fonts allows you to see which substitutes were made.


was (Author: jahewson):
{quote}
To add a mapping PDCIDFontType0 would need to allow ttf font
{quote}

Yes, that's right but the current design of PDFBox fonts isn't general enough 
to allow this. The code depends on a CIDFontType0 being CFF, and the handling 
of the encodings and glyphs is tightly coupled to that format. Any changes 
would be a lot of work, which is something to think about in the future. One 
major issue might be that CJK fonts aren't interchangeable across languages, so 
the same Unicode character might look very different in a Chinese vs a Korean 
font, and it would be wrong to substitute these. Most CJK fonts use CFF anyway, 
so there probably isn't much to gain from using TTF files.

{quote}
If i have TT CID not embedded and try to map to TTF used by evince, it doesnt 
fix issue.
{quote}

I think you might be wrong about that. The message "Missing CID-keyed font" 
relates to a missing CFF font, not a missing TrueType font.

{quote}
Adding /opt/Adobe fixes CFF fonts but TT CID fonts need some other fix
{quote}

That makes sense as "Missing CID-keyed font" relates to missing CFF fonts (i.e. 
OTF).

---

I did some research and discovered that all of the fonts which are missing in 
the files above ship with Adobe Reader, here's the list I found online:

- HeiseiKakuGo-W5 (Japanese)
- HeiseiMin-W3 (Japanese)
- MHei-Medium (Traditional Chinese)
- MSung-Light (Traditional Chinese)
- STSong-Light (Simplified Chinese)
- HYGoThic-Medium (Korean)
- HYSMyeongJo-Medium (Korean)

It also looks like there is more than one weight available for some of these 
fonts, but I couldn't find a definitive list.

So I'm wondering if PDFBox could ship with some built-in mappings for these 
well-known fonts, Adobe doesn't actually ship all of these fonts with Acrobat, 
it ships some substitutes, and relies on system fonts also. For example, on my 
Mac the font HeiseiMin-W3 is substituted with HiraMinPro-W3 by Acrobat, which 
is the "Hiragino Mincho ProN W3" font which [ships with OS 
X|http://support.apple.com/en-us/HT202408].

For Linux we'll want to use free fonts, but we'd need a list of high-quality 
common mappings, perhaps [this 
list|http://en.wikipedia.org/wiki/List_of_CJK_fonts] and [this other 
list|https://wiki.archlinux.org/index.php/fonts#Japanese] could help? A good 
starting point would be to see what Acrobat on Linux does, File > Properties > 
Fonts allows you to see which substitutes were made.

> Korean Text wrong
> -----------------
>
>                 Key: PDFBOX-2509
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2509
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering
>    Affects Versions: 2.0.0
>            Reporter: simon steiner
>            Assignee: John Hewson
>             Fix For: 2.1.0
>
>         Attachments: pdfbox147.png, pdfbox238.png, pdfbox238_2.png, 
> pdfbox328.png
>
>
> http://acroeng.adobe.com/Test_Files/fonts/asian%20font%20files/Korean/nonembedded/K4SystemFontsNotEmbeded218.PDF
> and
> http://acroeng.adobe.com/Test_Files/fonts/asian%20font%20files/Korean/nonembedded/KGulimcheNotembeded218.PDF
> and
> http://acroeng.adobe.com/Test_Files/fonts/asian%20font%20files/Korean/nonembedded/VariousKFontsNotembeded218.PDF
> and
> http://acroeng.adobe.com/Test_Files/fonts//EmbeddedCmap.pdf
> and
> http://acroeng.adobe.com/Test_Files/fonts/asian%20font%20files/Japanese/nonembedded/Jun101.pdf
> and
> http://acroeng.adobe.com/Test_Files/fonts/asian%20font%20files/Japanese/nonembedded/ACPTJ_WIN_MSGothic.DOC.pdf
> java -jar ~/pdf-box-svn/app/target/pdfbox-app-2.0.0-SNAPSHOT.jar PDFToImage 
> K4SystemFontsNotEmbeded218.PDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to