Re: UTF-16 inside UTF-8

YTang0648 Wed, 05 Nov 2003 18:43:21 -0800

if the string a is "b" + a surrogaet pari + "c" and I call a.indexOf("c"). What should it return 1 or 2? if then the caller than call a.charAt(2) what should I return? the low surrogate? or the "c"?

How can I return the whole surrogate pair if someone call a.charAt(1) ? or I should just return the high surrogate?

what should we return if someoen call a.substring(2) ? the low surrogate and the "c"? the high surrogate + the low surrogate plus the "c" ? error? What will happen if origionally the software do not return error code for substring and there are no excepting model to be involked?

4. Memory and Performance trade off.

You prbably can get a sense of difficulty if you look at how many specification change MS need to make to add surrogate support to the OpenType font. That is just specification change not include code changes or API changes.

'cmap' http://www.microsoft.com/typography/otspec/cmap.htm

It is easy to add surrogate support to your application if your application do nothing. It is difficult to add surrogate support (not impossible) if your application do some data processing. It is hard to add surrogate support if your software is a library which have previous defined API.

Look at

Format 4: Segment mapping to delta values

Supporting 4-byte character codes

I am not saying software should not support surrogate. I am saying don't under estimate the efforts. And while a software does upport surrogate correctly. Give them a praise instead of take it for granted. It is hard work.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

==================================
Frank Yung-Fong Tang
System Architect, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
AIM:yungfongta mailto:[EMAIL PROTECTED] Tel:650-937-2913
Yahoo! Msg: frankyungfongtan

John 3:16 "For God so loved the world that he gave his one and only Son, that whoever believes in him shall not perish but have eternal life.

Does your software display Thai language text correctly for Thailand users?
-> Basic Conceptof Thai Language linked from Frank Tang's Iñtërnâtiônàlizætiøn Secrets
Want to translate your English text to something Thailand users can understand ?
-> Try English-to-Thai machine translation at http://c3po.links.nectec.or.th/parsit/

In a message dated 11/5/2003 3:42:42 PM Pacific Standard Time, [EMAIL PROTECTED] writes:

Topic-change alert! I'm not talking about glyph support in fonts, or
bidi support, or collation, or contextual shaping, or any other aspect
of Unicode support. I'm talking about completely denying the existence
of non-BMP characters.

There are tons of applications -- Notepad is a basic example -- that
allow the entry of any arbitrary BMP character. They don't allow some
BMP characters and disallow others. That's all I'm talking about. Now,
if such an application allows BMP characters but disallows supplementary
characters, as MySQL (e.g.) does, I think that is an unnecessary
restriction.

Surrogate is defined in Unicode 2.0, which is published in 1996. Does NotePad in Windows 98 support it two years after Unicode 2.0 published? No, MS not even support Surrogaet in NotePad which came with WinME. In fact, you need to install special package into Win2K to enable Surrogate support. Why it take that long? Very simple. Because it is not as simple as you thought. If you caculate how long it take for MS to add surrogate support to the window support from the time surrogate defined in Unicode 2.0, you probably can find out how long it will take for a software to add surrogate support if they just start to add Unicode support.

One of these days I'm going to implement a "Unicode" front end that
supports Basic Latin and U+A068 YI SYLLABLE BBOP, but *no other
characters*, just to show how silly such a restriction would be.
(Remember, it's conformant as long as I don't lie about it. That
doesn't mean it's not silly.)

There are huge gap between "not silly" and "make it work". It is not that simple to make the whole software support surrogate correctly in every aspect.

> For back end software which do pure data process without keyboard
> input or text rendering, it is eaiser to implement the whole Unicode
> BMP range or even with the surrogate.

(1) "Surrogates" are only about UTF-16, not any other aspect of
Unicode.
(2) Supporting surrogates in UTF-16 is not tremendously difficult.

ok.

example to show you how difficult to support surrogate:

Example 1: I have this api

UniChar is defined to be two byte holding 16 bits.

UniChar ToLower(UniChar aChar)

Tell me how to support Surrogate?

Example 2:

I have api

int FindCharInString( String, UniChar)

Tell me what the return value should mean ? Should it mean the count of UniChar from the beginning of String or should it mean the coutn of the CHARACTER from the beginning of the String. What should I do when I start to add surrogate support?

Example 3:

I have api

int LengthOfString(String)

Should this api return the number of UniChar or the number of CHARACTER?

Example 4:

I have api

String Left(String, int a)

What should a mean, the index of the UniChar or the index of CHARACTER?

>> and implementing UTF-8 support for the entire Unicode code space is
>> about 0.1% harder than artificially crippling it by restricting it to
>> the BMP.
>
> Disagree about what you said "about 0.1 % harder".
>
> For many developers, adding 4 bytes UTF-8 to surrogate support simply
> mean open a can of worm.

See point (1) above.

> After that, they need to worry about how to
> support surrogate, which is quite complex in the api design/change.

See points (1) and (2) above.

> The work to make the converter convert UTF-8 to a surrogate pair and
> back is probably as you said "0.1 harder". But work AFTER they open
> such door is much harder to manage. As the famouse saying "Unicode is
> not the answer for Internationalization, Unicode is the question for
> the Internationalization". Thanks for all the job opportunity Unicode
> standard created (and keep creating) of us :)

See point (1) above. Other than UTF-16 surrogates -- and remember, this
is not 1993; the world of Unicode no longer revolves around the 16-bit
encoding form -- what aspect of supplementary character support is so
much more complicated than BMP support?

1. Depending technology- for example, your software depend on Tcl but Tcl8.4.4 does not support surrogate.

2. Dependnig protocols- for example GSM 03.38 only define default alphabet, UCS-2 but not UTF-8. What is the piont for a GSM gateway to take the surrogate or not. Why bother, it will not be shown on people's cell phone because of the GSM protocol anyway.

3. The definitation of API- for example-

you have String int

indexOf(int ch)
Returns the index within this string of the first occurrence of the specified character.

char

charAt(int index)
Returns the character at the specified index.

String

substring(int beginIndex)
Returns a new string that is a substring of this string.

Re: UTF-16 inside UTF-8

Reply via email to