I am pretty sure that this whole discussion does more harm than good for most
people's understanding of Unicode.
It is best and (mostly) correct to think of a Unicode string as a sequence of
Unicode characters, each defined/identified by a code point (out of 10.000s
covering all languages). That is what we have today in Pharo (with the
distinction between ByteString and WideString as mostly invisible
implementation details).
To encode Unicode for external representation as bytes, we use UTF-8 like the
rest of the modern world.
So far, so good.
Why all the confusion ? Because the world is a complex place and the Unicode
standard tries to cover all possible things. Citing all these exceptions and
special cases will make people crazy and give up. I am sure that most stopped
reading this thread.
like me ;)
I will wait for a conclusion with code :)
Stef
Why then is there confusion about the seemingly simple concept of a character ?
Because Unicode allows different ways to say the same thing. The simplest
example in a common language is (the French letter é) is
LATIN SMALL LETTER E WITH ACUTE [U+00E9]
which can also be written as
LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301]
The former being a composed normal form, the latter a decomposed normal form.
(And yes, it is even much more complicated than that, it goes on for 1000s of
pages).
In the above example, the concept of character/string is indeed fuzzy.
HTH,
Sven