Re: [iText-questions] Some questions about double byte characters and asian text

Y Fang Thu, 08 Oct 2009 17:50:36 -0700

Okay, thanks both of you for your replies. 

Makes me feel a bit safer about using IDENTITY_H for everything.

Cheers!

Date: Thu, 8 Oct 2009 09:15:04 -0700
From: [email protected]
To: [email protected]
Subject: Re: [iText-questions] Some questions about double byte characters      
and     asian text

The only 
exception I can think of is Form Fields... but Acrobat will automagically 
(automatic + magically, an english joke) change the font/encoding as 
needed.

To further 
encourage Identity_H: if you have the same font in a PDF with two different 
encodings, then you have two different copies of that font in your PDF.  
Inefficient, and worth avoiding.

Another 
problem I have with non-Identity encodings is that they are 
language-specific.  There are GB encodings and JP encodings and Hebrew, and 
Thai, and so on... but if you try to use a Thai character in a GB-based 
encoding 
(even one of the Unicode versions), you simply won't get the character[s] you 
want.  They don't exist in the underlying encoding.

As far as I 
can tell, this behavior is a vestigal organ (like your apendix) left over from 
The Days Before Unicode (also known as: The Dark Ages).

I suppose we 
could generate a Unicode->Identity cmap for each font... a bit of extra work 
(both CPU and dev), but the foundations are already laid in iText hither 
and yon.  It would certainly simplify the "toUnicode" map.  

--Mark Storer 
  Senior Software Engineer 
  Cardiff.com

#include <disclaimer> 
typedef std::Disclaimer<Cardiff> DisCard; 

  -----Original Message-----
From: Leonard Rosenthol 
  [mailto:[email protected]]
Sent: Thursday, October 08, 2009 7:16 
  AM
To: Post all your questions about iText here
Subject: 
  Re: [iText-questions] Some questions about double byte characters and asian 
  text

  Yes, 
  use Identity_H for everything.

  Leonard

  From: Y Fang 
  [mailto:[email protected]] 
Sent: Thursday, October 08, 2009 
  7:01 AM
To: [email protected]
Subject: 
  [iText-questions] Some questions about double byte characters and asian 
  text

  I've been looking 
  at some of the font pages in the iText Tutorial here: 
http://itextdocs.lowagie.com/tutorial/ but 
  there are two things which are confusing me regarding the writing of Asian 
  characters. 

  Firstly there is 
  the explaination of the IDENTITY_H and IDENTITY_V 
  encodings:

  "In the next 
  example, we are passing the value IDENTITY_H as encoding. 
  BaseFont.IDENTITY_H and BaseFont.IDENTITY_V are not really encodings. They 
  indicate that the unicode character wil be looked up in the font and stored 
  as-is, taking two bytes of space. It's the only way to have Asian fonts and 
  some encoding! s left out by Adobe such as Thai. For Europe or the 
  Middle-East, it is better to use an available encoding that will store a 
  single byte per character. Fonts with BaseFont.IDENTITY_H or 
  BaseFont.IDENTITY_V will always be embedded no matter what you enter as third 
  parameter."

  So working 
  through the examples there, it seems I'd be using BaseFont.CP1252 as the 
  encoding for regular English text, and BaseFond.IDENTITY_H when I need to use 
  double byte characters. So I tried to make a test pdf with some text, using 
  the font "Microsoft YaHei" (which according to the windows character map 
  contains chinese characters).

  First 
  try:

  ---------------

           Font font = new 
  Font(BaseFont.CreateFont(@"C:\Windows\Fonts\msyh.ttf", BaseFont.CP1252, 
  BaseFont.EMBEDDED), 12);

           document.Add(new Paragraph("Hello", 
  font));

           document.Add(new Paragraph("你好", 
  font));

  ---------------

  This gave a 133kb 
  pdf file where only the English "Hello" displayed (no Chinese text 
  showed up under it)

  Second 
  try:

  ---------------

           Font font = new 
  Font(BaseFont.CreateFont(@"C:\Windows\Fonts\msyh.ttf", 
  BaseFont. IDENTITY_H , BaseFont.EMBEDDED), 
  12);

           docu! ment.Add (new Paragraph("Hello", 
  font));

           document.Add(new Paragraph("你好", 
  font));

  ---------------

  This gave a 34kb 
  pdf file where the English "Hello" displayed, and below it were two correct 
  Chinese characters. 

  In both cases 
  I've asked for the font to be embedded, why is the pdf created by the first 
  try lar! ger?

  In the second 
  try, using IDENTITY_H caused both 
  the English and Chinese text to show up. So is it fine to 
  specify IDENTITY_H as the encoding even for normal English text? i.e. as 
  oppossed to something like this:

  ---------------

           Font font1 = new 
  Font(BaseFont.CreateFont(@"C:\Windows\Fonts\msyh.ttf", BaseFont.CP1252, 
  BaseFont.EMBE! DDED), 12);

         !  d ocument.Add(new Paragraph("Hello", 
  font1));

           Font font2 = new 
  Font(BaseFont.CreateFont(@"C:\Windows\Fonts\msyh.ttf", BaseFont.IDENTITY_H, 
  BaseFont.EMBEDDED), 12);

           document.Add(new Paragraph("你好", 
  font2));

  ---------------

  The problem I'm 
  trying to solve here, is that I need to create something which can accept 
text 
  input from a user. The input can be either English characters, or it 
  could be in Asian fonts...such as Chinese writing. And 
  that text needs to be written to a PDF. Is it okay to always use 
  BaseFont.IDENTITY_H? If not, and I should use BaseFont.CP1252 for English 
  text, is there any way to tell what kind of text input I'm 
  receiving?

  For example, in 
  the first try above, the Chinese font simply did not show up. Is there any 
way 
  to check whether printing! out a certain string with a certain font (in this 
  case the two chinese characters with msyh.ttf using CP1252) is going to work, 
  and if not redo it using IDENTITY_H instead?

  It seems to me I 
  should just use IDENTITY_H regardless of whether the input text I'm receiving 
  is English writing or something like 
Chinese. 

  The second thing 
  is that this tutorial page: 
http://itextdocs.lowagie.com/tutorial/fonts/getting/index.php makes 
  mention of using iTextAsian for CJK writing. When and why would you use that, 
  as oppossed to simply writing asian text using IDENTITY_H and a font which 
  contains chinese (or japanese, korean, etc.) characters like Microsoft 
  YaHei?

  If anyone could 
  give some insight here, or just point me to some relevant 
  documentation/information, would be mu! ch 
  appreciated. 

  Thanks in advan! 
  ce.  ;

  Check out The 
  Great Australian Pay Check Take a 
  peek at other people's pay and 
perks                                     
_________________________________________________________________
Get Hotmail on your iPhone Find out how here
http://windowslive.ninemsn.com.au/article.aspx?id=845706

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Re: [iText-questions] Some questions about double byte characters and asian text

Reply via email to