I am pretty sure that this whole discussion does more harm than good for most 
people's understanding of Unicode.

It is best and (mostly) correct to think of a Unicode string as a sequence of 
Unicode characters, each defined/identified by a code point (out of 10.000s 
covering all languages). That is what we have today in Pharo (with the 
distinction between ByteString and WideString as mostly invisible 
implementation details).

To encode Unicode for external representation as bytes, we use UTF-8 like the 
rest of the modern world.

So far, so good.

Why all the confusion ? Because the world is a complex place and the Unicode 
standard tries to cover all possible things. Citing all these exceptions and 
special cases will make people crazy and give up. I am sure that most stopped 
reading this thread.


like me ;)
I will wait for a conclusion with code :)

Stef


Why then is there confusion about the seemingly simple concept of a character ? 
Because Unicode allows different ways to say the same thing. The simplest 
example in a common language is (the French letter é) is

LATIN SMALL LETTER E WITH ACUTE [U+00E9]

which can also be written as

LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301]

The former being a composed normal form, the latter a decomposed normal form. 
(And yes, it is even much more complicated than that, it goes on for 1000s of 
pages).

In the above example, the concept of character/string is indeed fuzzy.

HTH,

Sven





Reply via email to