Re: [iText-questions] Some questions about double byte characters and asian text

Mark Storer Thu, 08 Oct 2009 09:13:46 -0700

The only exception I can think of is Form Fields... but Acrobat will 
automagically (automatic + magically, an english joke) change the font/encoding 
as needed.
 
To further encourage Identity_H: if you have the same font in a PDF with two 
different encodings, then you have two different copies of that font in your 
PDF.  Inefficient, and worth avoiding.
 
Another problem I have with non-Identity encodings is that they are 
language-specific.  There are GB encodings and JP encodings and Hebrew, and 
Thai, and so on... but if you try to use a Thai character in a GB-based 
encoding (even one of the Unicode versions), you simply won't get the 
character[s] you want.  They don't exist in the underlying encoding.
 
As far as I can tell, this behavior is a vestigal organ (like your apendix) 
left over from The Days Before Unicode (also known as: The Dark Ages).
 
I suppose we could generate a Unicode->Identity cmap for each font... a bit of 
extra work (both CPU and dev), but the foundations are already laid in iText 
hither and yon.  It would certainly simplify the "toUnicode" map.


--Mark Storer 
  Senior Software Engineer 
  Cardiff.com

#include <disclaimer> 
typedef std::Disclaimer<Cardiff> DisCard; 

-----Original Message-----
From: Leonard Rosenthol [mailto:[email protected]]
Sent: Thursday, October 08, 2009 7:16 AM
To: Post all your questions about iText here
Subject: Re: [iText-questions] Some questions about double byte characters and 
asian text



Yes, use Identity_H for everything.

 

Leonard

 

From: Y Fang [mailto:[email protected]] 
Sent: Thursday, October 08, 2009 7:01 AM
To: [email protected]
Subject: [iText-questions] Some questions about double byte characters and 
asian text

 

I've been looking at some of the font pages in the iText Tutorial here:  
http://itextdocs.lowagie.com/tutorial/ but there are two things which are 
confusing me regarding the writing of Asian characters. 

 

Firstly there is the explaination of the IDENTITY_H and IDENTITY_V encodings:

"In the next example, we are passing the value   
<http://www.1t3xt.info/api/com/lowagie/text/pdf/BaseFont.html#IDENTITY_H> 
IDENTITY_H as encoding. BaseFont.IDENTITY_H and BaseFont.IDENTITY_V are not 
really encodings. They indicate that the unicode character wil be looked up in 
the font and stored as-is, taking two bytes of space. It's the only way to have 
Asian fonts and some encoding! s left out by Adobe such as Thai. For Europe or 
the Middle-East, it is better to use an available encoding that will store a 
single byte per character. Fonts with BaseFont.IDENTITY_H or 
BaseFont.IDENTITY_V will always be embedded no matter what you enter as third 
parameter."

 

So working through the examples there, it seems I'd be using BaseFont.CP1252 as 
the encoding for regular English text, and BaseFond.IDENTITY_H when I need to 
use double byte characters. So I tried to make a test pdf with some text, using 
the font "Microsoft YaHei" (which according to the windows character map 
contains chinese characters).

First try:

---------------

            Font font = new 
Font(BaseFont.CreateFont(@"C:\Windows\Fonts\msyh.ttf", BaseFont.CP1252, 
BaseFont.EMBEDDED), 12);

            document.Add(new Paragraph("Hello", font));

            document.Add(new Paragraph("你好", font));

---------------

This gave a 133kb pdf file where only the English "Hello" displayed (no Chinese 
text showed up under it)

 

Second try:

---------------

            Font font = new 
Font(BaseFont.CreateFont(@"C:\Windows\Fonts\msyh.ttf", BaseFont. IDENTITY_H , 
BaseFont.EMBEDDED), 12);

            docu! ment.Add (new Paragraph("Hello", font));

            document.Add(new Paragraph("你好", font));

---------------

This gave a 34kb pdf file where the English "Hello" displayed, and below it 
were two correct Chinese characters. 

 

In both cases I've asked for the font to be embedded, why is the pdf created by 
the first try lar! ger?

 

In the second try, using IDENTITY_H caused both the English and Chinese text to 
show up. So is it fine to specify IDENTITY_H as the encoding even for normal 
English text? i.e. as oppossed to something like this:

---------------

            Font font1 = new 
Font(BaseFont.CreateFont(@"C:\Windows\Fonts\msyh.ttf", BaseFont.CP1252, 
BaseFont.EMBE! DDED), 12);

          !  d ocument.Add(new Paragraph("Hello", font1));

 

            Font font2 = new 
Font(BaseFont.CreateFont(@"C:\Windows\Fonts\msyh.ttf", BaseFont.IDENTITY_H, 
BaseFont.EMBEDDED), 12);

            document.Add(new Paragraph("你好", font2));

---------------

 

The problem I'm trying to solve here, is that I need to create something which 
can accept text input from a user. The input can be either English characters, 
or it could be in Asian fonts...such as Chinese writing. And that text needs to 
be written to a PDF. Is it okay to always use BaseFont.IDENTITY_H? If not, and 
I should use BaseFont.CP1252 for English text, is there any way to tell what 
kind of text input I'm receiving?

 

For example, in the first try above, the Chinese font simply did not show up. 
Is there any way to check whether printing! out a certain string with a certain 
font (in this case the two chinese characters with msyh.ttf using CP1252) is 
going to work, and if not redo it using IDENTITY_H instead?

 

It seems to me I should just use IDENTITY_H regardless of whether the input 
text I'm receiving is English writing or something like Chinese. 

 

 

 

The second thing is that this tutorial page:  
http://itextdocs.lowagie.com/tutorial/fonts/getting/index.php makes mention of 
using iTextAsian for CJK writing. When and why would you use that, as oppossed 
to simply writing asian text using IDENTITY_H and a font which contains chinese 
(or japanese, korean, etc.) characters like Microsoft YaHei?

 

 

If anyone could give some insight here, or just point me to some relevant 
documentation/information, would be mu! ch appreciated. 

 

Thanks in advan! ce.  ;

 


  _____  


Check out The Great Australian Pay Check Take a  
<http://clk.atdmt.com/NMN/go/157639755/direct/01/> peek at other people's pay 
and perks

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference

_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php
Check the site with examples before you ask questions: 
http://www.1t3xt.info/examples/
You can also search the keywords list: http://1t3xt.info/tutorials/keywords/

Re: [iText-questions] Some questions about double byte characters and asian text

Reply via email to