Re: [Pharo-dev] Unicode Support

Sven Van Caekenberghe Wed, 09 Dec 2015 05:34:59 -0800

> On 09 Dec 2015, at 14:16, EuanM <euan...@gmail.com> wrote:
> 
> "To encode Unicode for external representation as bytes, we use UTF-8
> like the rest of the modern world.
> 
> So far, so good.
> 
> Why all the confusion ?"


That was a rhetorical question.

I know that we lack normalization, we don't need another encoding or 
representation.

Sorting/collation can also be done regardless of encoding or representation.

These are orthogonal concerns to the working situation that we have today.

> The confusion arises because simply providing *a* valid UTF-8 encoding
> of does not ensure sortability, nor equivalence testability.
> 
> It might provide sortable strings. It might not.
> 
> It might provide a string that can be compared to another string
> successfully.  It might not.
> 
> So being able to perform valid UTF-8 encoding is *necessary*, but *not
> sufficient*.
> 
> i.e. the confusion arises because UTF-8 can provide for several
> competing, non-sortable encodings of even a single character.  This
> means that *valid* UTF-8 cannot be relied upon to provide these
> facilities *unless* all the UTF-8 strings can be relied upon to have
> been encoded to UTF-8 by the same specification of process.  i.e.
> *unless* it has gone through a process of being converted by *a
> specific* valid method of encoding to UTF-8.
> 
> Understanding the concept of abstract character is, imo key to
> understanding the differences between the various valid UTF-8 forms of
> a given abstract character.
> 
> 
> Cheers,
>    Euan
> 
> On 9 December 2015 at 10:45, Sven Van Caekenberghe <s...@stfx.eu> wrote:
>> 
>>> On 09 Dec 2015, at 10:35, Guillermo Polito <guillermopol...@gmail.com> 
>>> wrote:
>>> 
>>> 
>>>> On 8 dic 2015, at 10:07 p.m., EuanM <euan...@gmail.com> wrote:
>>>> 
>>>> "No. a codepoint is the numerical value assigned to a character. An
>>>> "encoded character" is the way a codepoint is represented in bytes
>>>> using a given encoding."
>>>> 
>>>> No.
>>>> 
>>>> A codepoint may represent a component part of an abstract character,
>>>> or may represent an abstract character, or it may do both (but not
>>>> always at the same time).
>>>> 
>>>> Codepoints represent a single encoding of a single concept.
>>>> 
>>>> Sometimes that concept represents a whole abstract character.
>>>> Sometimes it represent part of an abstract character.
>>> 
>>> Well. I do not agree with this. I agree with the quote.
>>> 
>>> Can you explain a bit more about what you mean by abstract character and 
>>> concept?
>> 
>> I am pretty sure that this whole discussion does more harm than good for 
>> most people's understanding of Unicode.
>> 
>> It is best and (mostly) correct to think of a Unicode string as a sequence 
>> of Unicode characters, each defined/identified by a code point (out of 
>> 10.000s covering all languages). That is what we have today in Pharo (with 
>> the distinction between ByteString and WideString as mostly invisible 
>> implementation details).
>> 
>> To encode Unicode for external representation as bytes, we use UTF-8 like 
>> the rest of the modern world.
>> 
>> So far, so good.
>> 
>> Why all the confusion ? Because the world is a complex place and the Unicode 
>> standard tries to cover all possible things. Citing all these exceptions and 
>> special cases will make people crazy and give up. I am sure that most 
>> stopped reading this thread.
>> 
>> Why then is there confusion about the seemingly simple concept of a 
>> character ? Because Unicode allows different ways to say the same thing. The 
>> simplest example in a common language is (the French letter é) is
>> 
>> LATIN SMALL LETTER E WITH ACUTE [U+00E9]
>> 
>> which can also be written as
>> 
>> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301]
>> 
>> The former being a composed normal form, the latter a decomposed normal 
>> form. (And yes, it is even much more complicated than that, it goes on for 
>> 1000s of pages).
>> 
>> In the above example, the concept of character/string is indeed fuzzy.
>> 
>> HTH,
>> 
>> Sven
>> 
>> 
>

Re: [Pharo-dev] Unicode Support

Reply via email to