Eric Hacker wrote:
> Unicode is a superset of ACSII and thus all ASCII characters are Unicode.
> UTF8 is a way of encoding unicode code points for transport over the
> internet in a restricted character set. Conveniently, UTF8 uses the same
> values as ASCII for ASCII representation. Above the standard ASCII 127
> character representation, UTF8 uses multi-byte strings beginning with 0xC1.
No; the sequences for codes 128 to 255 begin with 0xC2 and 0xC3
(128-191 and 192-255 respectively). 0xC0 and 0xC1 indicate (illegal)
overlong encodings of 0-63 and 64-127 respectively.
In general, the two-byte sequences have the (binary) form:
110xxxxx 10xxxxxx
The range 0-127 (which must use the single-byte form instead)
corresponds to:
1100000x 10xxxxxx
Hence, any sequence beginning with 11000000 (0xC0) or 11000001 (0xC1)
is illegal.
--
Glynn Clements <[EMAIL PROTECTED]>