Carl R. Witty wrote:

> 1) I assume that layout processing occurs after Unicode preprocessing;
> otherwise, you can't even find the lexemes.  If so, are all Unicode
> characters assumed to be the same width?

Unicode characters ***cannot in any way*** be considered as being of
the same display width.  Many characters have intrinsic width properties,
like "halfwidth Katakana", "fullwidth ASCII", "ideographic space",
"thin space", "zero width space", and so on (most of which are
compatability characters, i.e. present only for conversion reasons).
But more importantly there are combining characters which "modify"
a "base character". For instance A (A with ring above) can be given
as an A followed by a combining ring above, i.e. two Unicode characters.
(For this and many others there is also a 'precomposed' character.) 
For many scripts vowels are combining characters.  And there may be an
indefinitely long (in principle, but three is a lot) sequence of
combining characters after each non-combining character.

What about bidirectional scripts?  Especially for the Arabic
script which is a cursive (joined) script, where in addition
vowels are combining characters.

Furthermore, Unicode characters in the "extended range" (no characters
allocated yet) are encoded using two *non-character* 16-bit codes
(when using UTF-16, which is the preferred encoding for Unicode).

What would "Unicode preprocessing" be?  UTF-16 decoding?
Java-ish escape sequence decoding?

...
> 3) What does it mean that Char can include any Unicode character?

I think it *does not* mean that a Char can hold any Unicode 
character.  I think it *does* means that it can hold any single
(UTF-16) 16-bit value.  Which is something quite different.  To store
an arbitrary Unicode character 'straight off', one would need up
to at least 21 bits to cover the UTF-16 range.  ISO/IEC 10646-1 allows
for up to 31 bits, but nobody(?) is planning to need all that.
Some use 32-bit values to store Unicode characters.  Perfectly
allowed by 10646, though not by Unicode proper.  Following Unicode
proper one would always use sequence of UTF-16 codes, in order to
be able to treat a "user perceived character" as a single entity
both for UTF-16 reasons, and also for combining sequences reasons,
independently of how the "user perceived character" was given as
Unicode characters.

                        /kent k

PS
Java gets some Unicode things wrong too.  Including that Java's
UTF-8 encoding is non-conforming (to both Unicode 2.0 and ISO/IEC
10646-1 Amd. 2).



Reply via email to