Re: [Pharo-dev] Unicode Support

Ben Coman Thu, 10 Dec 2015 15:44:08 -0800

On Wed, Dec 9, 2015 at 5:35 PM, Guillermo Polito
<guillermopol...@gmail.com> wrote:
>
>> On 8 dic 2015, at 10:07 p.m., EuanM <euan...@gmail.com> wrote:
>>
>> "No. a codepoint is the numerical value assigned to a character. An
>> "encoded character" is the way a codepoint is represented in bytes
>> using a given encoding."
>>
>> No.
>>
>> A codepoint may represent a component part of an abstract character,
>> or may represent an abstract character, or it may do both (but not
>> always at the same time).
>>
>> Codepoints represent a single encoding of a single concept.
>>
>> Sometimes that concept represents a whole abstract character.
>> Sometimes it represent part of an abstract character.
>
> Well. I do not agree with this. I agree with the quote.
>
> Can you explain a bit more about what you mean by abstract character and 
> concept?


This seems to be what Swift is doing, where Strings are not composed
not of codepoints but of graphemes.

>>> "Every instance of Swift’s Character type represents a single extended 
>>> grapheme cluster. An extended grapheme cluster is a sequence** of one or 
>>> more Unicode scalars that (when combined) produce a single human-readable 
>>> character. [1]

** i.e. not an array

>>> Here’s an example. The letter é can be represented as the single Unicode 
>>> scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same 
>>> letter can also be represented as a pair of scalars—a standard letter e 
>>> (LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE ACCENT 
>>> scalar (U+0301). TheCOMBINING ACUTE ACCENT scalar is graphically applied to 
>>> the scalar that precedes it, turning an e into an éwhen it is rendered by a 
>>> Unicode-aware text-rendering system. [1]

>>> In both cases, the letter é is represented as a single Swift Character 
>>> value that represents an extended grapheme cluster. In the first case, the 
>>> cluster contains a single scalar; in the second case, it is a cluster of 
>>> two scalars:" [1]

>>> Swiftʼs string implemenation makes working with Unicode easier and 
>>> significantly less error-prone. As a programmer, you still have to be aware 
>>> of possible edge cases, but this probably cannot be avoided completely 
>>> considering the characteristics of Unicode. [2]

Indeed I've tried searched for what problems it causes and get a null
result.  So I read  *all*good*  things about Swift's unicode
implementation reducing common errors dealing with Unicode.  Can
anyone point to complaints about Swift's unicode implementation?
Maybe this...

>>> An argument could be made that the implementation of String as a sequence 
>>> that requires iterating over characters from the beginning of the string 
>>> for many operations poses a significant performance problem but I do not 
>>> think so. My guess is that Appleʼs engineers have considered the 
>>> implications of their implementation and apps that do not deal with 
>>> enormous amounts of text will be fine. Moreover, the idea that you could 
>>> get away with an implementation that supports random access of characters 
>>> is an illusion given the complexity of Unicode. [2]

Considering our common pattern: Make it work, Make it right, Make it
fast  -- maybe Strings as arrays are a premature optimisation, that
was the right choice in the past prior to Unicode, but considering
Moore's Law versus programmer time, is not the best choice now.
Should we at least start with a UnicodeString and UnicodeCharacter
that operates like Swift, and over time *maybe* move the tools to use
them.

[1] 
https://developer.apple.com/library/ios/documentation/Swift/Conceptual/Swift_Programming_Language/StringsAndCharacters.html
[2] http://oleb.net/blog/2014/07/swift-strings/

cheers -ben

>
>>
>> This is the key difference between Unicode and most character encodings.
>>
>> A codepoint does not always represent a whole character.
>>
>> On 7 December 2015 at 13:06, Henrik Johansen

Re: [Pharo-dev] Unicode Support

Reply via email to