Hi Paulo, Thank you for your attention.
Yes, we know this is NOT easy, but should not be THAT difficult We've digged into PDFBox, and found the flow is clear: 1. The encoding "GBK-EUC-H" (informed by the font) has a corresponding CMap (lays in the iTextAsianCmaps.jar); 2. Do a lookup using the raw char code "0xCEC4" in the CMap, we will get "0x0ED3"; 3. Do another lookup using the result of step 2 as a key in the "Adobe-GB1-UCS2" CMap, then we get the unicode value "0x6587". So here is what WE think iText can do to improve this (since we are not professional on this, it's just a hint ): 1. When creating a DocumentFont, regarding to the font's encoding instead of droping it (with a simple mapping), to create the cjkMirror. After that, we can find the correct non-unicode CMap; 2. Add these non-unicode encoding entries to the cjkfonts.properties and cjkencodings.properties file, so the cjkMirror can be created as a CJKFont; 3. Create another property file to map the "GBK-EUC-H" to the "Adobe-GB1-UCS2" and the like; 4. The CJKFont need an extra translationMap to do the indirect mapping (according to the property file created in step 3). Then the CJKFont has all the infomation to translate the raw char code to unicode; 5. Pull up CMapAwareDocumentFont.decode() method to DocumentFont, so fonts have a chance to delegate this request to cjkMirror (if any). The CMapAwareDocumentFont override this method and do it's own strategy only when his parent can't handle this. After that, we can decode those CJK chars (of non-unicode CMap) correctly. Regards, Mophy -- View this message in context: http://itext-general.2136553.n4.nabble.com/Problem-when-extracting-CJK-chars-from-PDF-files-tp3757883p3762061.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Get a FREE DOWNLOAD! and learn more about uberSVN rich system, user administration capabilities and model configuration. Take the hassle out of deploying and managing Subversion and the tools developers use with it. http://p.sf.net/sfu/wandisco-d2d-2 _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
