> On 09 Dec 2015, at 14:16, EuanM <euan...@gmail.com> wrote: > > "To encode Unicode for external representation as bytes, we use UTF-8 > like the rest of the modern world. > > So far, so good. > > Why all the confusion ?"
That was a rhetorical question. I know that we lack normalization, we don't need another encoding or representation. Sorting/collation can also be done regardless of encoding or representation. These are orthogonal concerns to the working situation that we have today. > The confusion arises because simply providing *a* valid UTF-8 encoding > of does not ensure sortability, nor equivalence testability. > > It might provide sortable strings. It might not. > > It might provide a string that can be compared to another string > successfully. It might not. > > So being able to perform valid UTF-8 encoding is *necessary*, but *not > sufficient*. > > i.e. the confusion arises because UTF-8 can provide for several > competing, non-sortable encodings of even a single character. This > means that *valid* UTF-8 cannot be relied upon to provide these > facilities *unless* all the UTF-8 strings can be relied upon to have > been encoded to UTF-8 by the same specification of process. i.e. > *unless* it has gone through a process of being converted by *a > specific* valid method of encoding to UTF-8. > > Understanding the concept of abstract character is, imo key to > understanding the differences between the various valid UTF-8 forms of > a given abstract character. > > > Cheers, > Euan > > On 9 December 2015 at 10:45, Sven Van Caekenberghe <s...@stfx.eu> wrote: >> >>> On 09 Dec 2015, at 10:35, Guillermo Polito <guillermopol...@gmail.com> >>> wrote: >>> >>> >>>> On 8 dic 2015, at 10:07 p.m., EuanM <euan...@gmail.com> wrote: >>>> >>>> "No. a codepoint is the numerical value assigned to a character. An >>>> "encoded character" is the way a codepoint is represented in bytes >>>> using a given encoding." >>>> >>>> No. >>>> >>>> A codepoint may represent a component part of an abstract character, >>>> or may represent an abstract character, or it may do both (but not >>>> always at the same time). >>>> >>>> Codepoints represent a single encoding of a single concept. >>>> >>>> Sometimes that concept represents a whole abstract character. >>>> Sometimes it represent part of an abstract character. >>> >>> Well. I do not agree with this. I agree with the quote. >>> >>> Can you explain a bit more about what you mean by abstract character and >>> concept? >> >> I am pretty sure that this whole discussion does more harm than good for >> most people's understanding of Unicode. >> >> It is best and (mostly) correct to think of a Unicode string as a sequence >> of Unicode characters, each defined/identified by a code point (out of >> 10.000s covering all languages). That is what we have today in Pharo (with >> the distinction between ByteString and WideString as mostly invisible >> implementation details). >> >> To encode Unicode for external representation as bytes, we use UTF-8 like >> the rest of the modern world. >> >> So far, so good. >> >> Why all the confusion ? Because the world is a complex place and the Unicode >> standard tries to cover all possible things. Citing all these exceptions and >> special cases will make people crazy and give up. I am sure that most >> stopped reading this thread. >> >> Why then is there confusion about the seemingly simple concept of a >> character ? Because Unicode allows different ways to say the same thing. The >> simplest example in a common language is (the French letter é) is >> >> LATIN SMALL LETTER E WITH ACUTE [U+00E9] >> >> which can also be written as >> >> LATIN SMALL LETTER E [U+0065] followed by COMBINING ACUTE ACCENT [U+0301] >> >> The former being a composed normal form, the latter a decomposed normal >> form. (And yes, it is even much more complicated than that, it goes on for >> 1000s of pages). >> >> In the above example, the concept of character/string is indeed fuzzy. >> >> HTH, >> >> Sven >> >> >