RE: number of bytes for simplified chinese

2004-06-28 Thread Addison Phillips [wM]
ï


Hi Duraivel,
 
Your question is incomplete. There are several Unicode encodings to 
choose from and the "number of bytes" question is influenced by your choice of 
encoding, as well as by the data you choose.
 
For example, UTF-8 is a multibyte encoding of Unicode, where each 
character is 1-, 2-, 3-, or 4-bytes long, depending on the character. The 
majority of characters written in Simplified Chinese will be three bytes long in 
this encoding.
 
UTF-16 encodes characters using two bytes per character for the vast 
majority of characters in most sets of data. Some Chinese characters are encoded 
on higher (or "supplemental") planes of Unicode and will require two 
two-byte characters (a "surrogate pair") to access them in UTF-16. These 
characters are generally considered to be quite rare in "average" data and it is 
unlikely that your data will contain more than a few of these characters in any 
event.
 
Probably, though, you are not starting your question in the right place. 
Why do you care about the number of bytes in a character? The reasons you give 
will determine whether a specific encoding is more (or less) suited for use than 
another encoding (or even character set, such as a legacy, non-Unicode, 
character set/encoding). For example, if you are trying to determine whether 
Unicode is more (or less) efficient than a legacy solution, then I think you'll 
find that the performance issues are somewhere other than the average byte count 
per character. If you are worried about storage (disk, database, etc.), then the 
specifics of your situation will determine what the "right answer" may be for 
you.
 
Best Regards,
 
Addison
Addison P. PhillipsDirector, Globalization 
ArchitecturewebMethods | Delivering Global Business Visibilityhttp://www.webMethods.comChair, W3C Internationalization 
(I18N) Working GroupChair, W3C-I18N-WG, Web Services Task Forcehttp://www.w3.org/InternationalInternationalization is 
an architecture.It is not a feature. 

  -Original Message-From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED]On Behalf Of 
  DuraivelSent: 2004å6æ27æ 23:38To: 
  [EMAIL PROTECTED]Subject: number of bytes for simplified 
  chinese
  
  hi,
   
  I would like to know the number opf bytes 
  required for simplified chinese language. Can we represent all the characters 
  of  simplified chinese in unicode using just two bytes.
   
  regards
  duraivel


Re: number of bytes for simplified chinese

2004-06-28 Thread John H. Jenkins
On Jun 27, 2004, at 11:37 PM, Duraivel wrote:
hi,
 
I would like to know the number opf bytes required for simplified 
chinese language. Can we represent all the characters of  simplified 
chinese in unicode using just two bytes.

No.  It will take up to four bytes per character, whether you're using 
UTF-8, UTF-16, or UTF-32.


John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage.mac.com/jhjenkins/



number of bytes for simplified chinese

2004-06-28 Thread Duraivel




hi,
 
I would like to know the number opf bytes required 
for simplified chinese language. Can we represent all the characters of  
simplified chinese in unicode using just two bytes.
 
regards
duraivel