Hi Paulo,

Thank you for your attention.

Yes, we know this is NOT easy, but should not be THAT difficult 

We've digged into PDFBox, and found the flow is clear:

1. The encoding "GBK-EUC-H" (informed by the font) has a corresponding CMap
(lays in the iTextAsianCmaps.jar);
2. Do a lookup using the raw char code "0xCEC4" in the CMap, we will get
"0x0ED3";
3. Do another lookup using the result of step 2 as a key in the
"Adobe-GB1-UCS2" CMap, then we get the unicode value "0x6587".

So here is what WE think iText can do to improve this (since we are not
professional on this, it's just a hint ):

1. When creating a DocumentFont, regarding to the font's encoding instead of
droping it (with a simple mapping), to create the cjkMirror. After that, we
can find the correct non-unicode CMap;
2. Add these non-unicode encoding entries to the cjkfonts.properties and
cjkencodings.properties file, so the cjkMirror can be created as a CJKFont;
3. Create another property file to map the "GBK-EUC-H" to the
"Adobe-GB1-UCS2" and the like;
4. The CJKFont need an extra translationMap to do the indirect mapping
(according to the property file created in step 3). Then the CJKFont has all
the infomation to translate the raw char code to unicode;
5. Pull up CMapAwareDocumentFont.decode() method to DocumentFont, so fonts
have a chance to delegate this request to cjkMirror (if any). The
CMapAwareDocumentFont override this method and do it's own strategy only
when his parent can't handle this.

After that, we can decode those CJK chars (of non-unicode CMap) correctly.

Regards,
Mophy

--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Problem-when-extracting-CJK-chars-from-PDF-files-tp3757883p3762061.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Get a FREE DOWNLOAD! and learn more about uberSVN rich system, 
user administration capabilities and model configuration. Take 
the hassle out of deploying and managing Subversion and the 
tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to