RE: UTF8 vs. Unicode (UTF16) in code

Ayers, Mike Thu, 08 Mar 2001 18:01:19 -0800

        If you really want to finish the job, there's always UTF-32, which
should do rather nicely until we meet the space aliens aith the
4,293,853,186 character alphabet!


/|/|ike

P.S.  No, they're not Klingons!

> From: Ienup Sung [mailto:[EMAIL PROTECTED]]
> 
> I think we shouldn't advocate that since there will be only 43K
> CJK characters at the SIP, about 1.6K characters at SMP, and, 97 tag
> characters at SPP, we can ignore such the characters and the 
> additional planes
> of the UTF-16/32 of Unicode 3.1. Furthemore, when you're 
> doing the first i18n
> on the existing programs, you can do the whole thing at once 
> with minor
> additional cost if you choose to have support for UTF-16 
> while you're at it
> rather than do it only for BMP/UCS-2 now and later do one 
> more time of change
> even though that would be decided by each team/company who are doing
> the i18n in my opinion.
> 
> And, as we all know, we can no longer claim that the UTF-16 is a fixed
> width anymore since it is variable width now as like UTF-8; 
> we will just
> have to deal with it in my opinion.
> 
> With regards,
> 
> Ienup
> 
> 
> ] Date: Fri, 09 Mar 2001 10:48:52 -0800 (PST)
> ] From: [EMAIL PROTECTED]
> ] Subject: Re: UTF8 vs. Unicode (UTF16) in code
> ] X-Sender: [EMAIL PROTECTED]
> ] To: Ienup Sung <[EMAIL PROTECTED]>
> ] Cc: Unicode List <[EMAIL PROTECTED]>
> ] MIME-version: 1.0
> ] 
> ] Well....
> ] 
> ] Actually, there is a significant difference between being "UTF-8
> ] ignorant" and "UTF-16 ignorant". A "UTF-16 ignorant" 
> program thinks that
> ] surrogate pairs are just two characters with undefined 
> properties. Since
> ] currently there are no characters "up there" this isn't a really big
> ] deal. Shortly, when Unicode 3.1 is official, there will be 40K or so
> ] characters in the supplemental planes... but they'll be 
> relatively rare.
> ] 
> ] In most cases where one has a "character pointer", one is 
> not performing
> ] casing, line breaking, or other text interpretation that requires
> ] significant awareness of the meaning of the text. Of 
> course, it depends on
> ] the instance and the application how true that is ;-). But 
> in many cases
> ] you *can* ignore the fact that a high- or low-surrogate character is
> ] really part of something else.
> ] 
> ] With UTF-8, however, is is impossible to ignore the 
> multi-byte sequences
> ] and they can never really be treated as separate 
> characters. So I guess
> ] all I'm saying is that, depending on what you need to do 
> and what level of
> ] awareness your application needs to achieve, a pure "UCS-2 
> port" might be
> ] a better choice than UTF-8, since the specific details 
> overlooked are
> ] of a different quality.
> ] 
> ] Best Regards,.
> ] 
> ] Addison
> ] 
> ] ===============================================================
> ] Addison P. Phillips                     Globalization Architect
> ] webMethods, Inc                       http://www.webmethods.com
> ] Sunnyvale, CA, USA              mailto:[EMAIL PROTECTED]
> ] 
> ] +1 408.210.3569 (mobile)                  +1 408.962.5487 (ofc)
> ] ===============================================================
> ] "Internationalization is not a feature. It is an architecture."
RE: UTF8 vs. Unicode (UTF16) in code

Reply via email to