Re: MSDN Article, Second Draft

Jungshik Shin Fri, 20 Aug 2004 21:09:04 -0700

John Tisdale wrote:

Unicode Fundamentals

Early character sets were very limited in scope. ASCII required only 7 bits
to represent its repertoire of 128 characters. ANSI pushed this scope 8 bits
which represented 256 characters while providing backward compatibility with
ASCII. Countless other character sets emerged that represented the

As is often the case, Unicode experts are not necessarily experts on 'legacy' character sets and encodings. The 'official' name of 'ASCII' is ANSI X3.4-1968 or ISO 646 (US). While dispelling myths about Unicode, I'm afraid you're spreading misinformation about what came before it. The sentence that 'ANSI pushed this scope ... represents 256 characters' is misleading. ANSI has nothing to do with various single, double, triple byte character sets that make up single and multibyte character encodings. They're devised and published by national and international standard organizations as well as various vendors. Perhaps, you'd better just get rid of the sentence 'ANSI pushed ... providing backward compatibility with ASCII'.

characters needed by various languages and language groups. The growing
complexities of managing numerous international character sets escalated the

numerous national and vendor character sets that are specific to a small subset of scripts/characters in use (or that can cover only a small subset of ....)

Two standards emerged about the same time to address this demand. The
Unicode Consortium published the Unicode Standard and the International
Organization for Standardization (ISO) offered the ISO/IEF 10646 standard.

A typo: It's ISO/IEC not ISO/IEF. Perhaps, it's not a typo. You consistently used ISO/IEF in place of ISO/IEC ;-)

Fortunately, these two standards bodies synchronized their character sets some years ago and continue to do so as new characters are added. Yet, although the character sets are mapped identically, the standards for encoding them vary in many ways (which are beyond the scope of this article).

I'm afraid that 'yet ...' can give a false impression that Unicode consortium and ISO/IEC have some differences in encoding standards especially considering that the sentence begins with 'although ....identically'.

Coded Character Sets

A coded character set (sometimes called a character repertoire) is a mapping from a set of abstract characters to a set of nonnegative, noncontiguous integers (between 0 and 1,114,111, called code points).

A 'character repertoire' is different from a coded character set in that it's more like a set of abstract characters **without** numbers associated with them. (needless to say, 'a coded character set' is a set of character-integer pairs)

Character Encoding Forms
The second component in Unicode is character encoding forms. Their purpose


I'm not sure whether 'component' is the best word to use here.

The Unicode Standard provides three forms for encoding its repertoire (UTF-8, UTF-16 and UTF-32).

Note that ISO 10646:2003 also define all three of them exactly the same as Unicode does.

> You will often find references to USC-2 and

USC-4. These are competing encoding forms offered by ISO/IEF 10646 (USC-2 is
equivalent to UTF-16 and USC-4 to UTF-32). I will not discuss the

UCS-2 IS different from UTF-16. UCS-2 can only represent a subset of characters in Unicode/ISO 10646 (namely, those in BMP). BTW, it's not USC but UCS. Also note that UTF in UTF-16/UTF-32/UTF-8 stand for either 'UCS Transformation Format' (UCS stands for Univeral Character Set, ISO 10646) or 'Unicode Transformation Format'

significant enough to limit its implementation (as at least half of the 32
bits will contain zeros in the majority of applications). Except in some
UNIX operating systems and specialized applications with specific needs,

Note that ISO C 9x specifies that wchar_t be UTF-32/UCS-4 when __STDC_ISO_10646__ is defined. Recent versions of Python also uses UTF-32 internally.

UTF-32 is seldom implemented as an end-to-end solution (yet it does have its strengths in certain applications). UTF-16 is the default means of encoding the Unicode character repertoire (which has perhaps played a role in the misnomer that Unicode is a 16-bit character set).

I would not say UTF-16 is the default means of encoding ..... It's probably the most widely used, but that's different from being the default ...unless you're talking specifically about Win32 APIs (you're not in this paragraph, right?)

UTF-8 is a variable-width encoding form based on byte-sized code units (ranging between 1 and 4 bytes per code unit).

The code unit of UTF-8 is an 8-bit byte just like the code unit of UTF-16 and that of UTF-32 are a 16-bit 'half-word' and a 32-bit 'word', respectively. A single Unicode character is represented with 1 to 4 code units (bytes) depending on what code point it's assigned in the Unicode. Please, see p. 73 of the Unicode standard 4.0

In UTF-8, the high bits of each byte are reserved to indicate where in the
unit code sequence that byte belongs. A range of 8-bit code unit values are


where in the code unit sequence that byte belongs.

reserved to indicate the leading byte and the trailing byte in the sequence.
By sequencing four bytes to represent a code unit, UTF-8 is able to
represent the entire Unicode character repertoire.


By using one to four code units (bytes) to represent a character

Character Encoding Schemes

method. This issue is not relevant with UTF-8 because it utilizes individual bytes that are encapsulated with the sequencing data (with bounded look ahead).

'because ....' reads too cryptic. Why don't you just say that 'the code unit in UTF-8 is a byte so that there's no need for serialization' (i.e. sequences of code units in UTF-8 are identical to sequences of bytes in UTF-8)

Choosing an Encoding Solution

high value in the multi-platform world of the Web. As such, HTML and current
versions of Internet Explorer running on Windows 2000 or later use the UTF-8
encoding form. If you try to force UTF-16 encoding on IE, you will encounter
an error or it will default to UTF-8 anyway.

I'm not sure what you're trying to say here although I can't agree with you more that UTF-8 is the most sensible choice to transmit information (**serve** documents) over 'mostly' byte-oriented protocols/media such as internet mail and html/xml (html/xml can be in UTF-16/UTF-32 as well). As a web user agent/**client**, MS IE can (must) render documents in UTF-16 just as well as documents in UTF-8 and many other character encodings. It even supports UTF-7.

valuable asset. The extra code and processor bandwidth required to
accommodate variable-width code units can outweigh the cost of using 32-bits
to represent each code unit.

You keep misusing 'code unit'. Code units cannot be of variable-width. It's fixed in each encoding form. It's 8-bit in UTF-8, 16-bit in UTF-16 and 32-bit in UTF-32. The last sentence should end with 'to represent each character'.

In such cases, the internal processing can be done using UTF-32 encoding and the results can be transmitted or stored in UTF-16


  can be transmitted or stored in UTF-16 or UTF-8.


  Hope this helps,

  Jungshik

Re: MSDN Article, Second Draft

Reply via email to