Re: [Pharo-dev] Unicode Support

H. Hirzel Wed, 09 Dec 2015 04:06:09 -0800

See example with ANGSTROM

Abstract Characters (Unicode)
http://wiki.squeak.org/squeak/6256




On 12/9/15, Guillermo Polito <guillermopol...@gmail.com> wrote:
>
>> On 8 dic 2015, at 10:07 p.m., EuanM <euan...@gmail.com> wrote:
>>
>> "No. a codepoint is the numerical value assigned to a character. An
>> "encoded character" is the way a codepoint is represented in bytes
>> using a given encoding."
>>
>> No.
>>
>> A codepoint may represent a component part of an abstract character,
>> or may represent an abstract character, or it may do both (but not
>> always at the same time).
>>
>> Codepoints represent a single encoding of a single concept.
>>
>> Sometimes that concept represents a whole abstract character.
>> Sometimes it represent part of an abstract character.
>
> Well. I do not agree with this. I agree with the quote.
>
> Can you explain a bit more about what you mean by abstract character and
> concept?
>
>>
>> This is the key difference between Unicode and most character encodings.
>>
>> A codepoint does not always represent a whole character.
>>
>> On 7 December 2015 at 13:06, Henrik Johansen
>> <henrik.s.johan...@veloxit.no> wrote:
>>>
>>> On 07 Dec 2015, at 1:05 , EuanM <euan...@gmail.com> wrote:
>>>
>>> Hi Henry,
>>>
>>> To be honest, at some point I'm going to long for the for the much
>>> more succinct semantics of healthcare systems and sports scoring and
>>> administration systems again.  :-)
>>>
>>> codepoints are any of *either*
>>> - the representation of a component of an abstract character, *or*
>>> eg. "A" #(0041) as a component of
>>> - the sole representation of the whole of an abstract character *or* of
>>> -  a representation of an abstract character provided for backwards
>>> compatibility which is more properly represented by a series of
>>> codepoints representing a composed character
>>>
>>> e.g.
>>>
>>> The "A" #(0041) as a codepoint can be:
>>> the sole representation of the whole of an abstract character "A"
>>> #(0041)
>>>
>>> The representation of a component of the composed (i.e. preferred)
>>> version of the abstract character Å #(0041 030a)
>>>
>>> Å (#00C5) represents one valid compatibility form of the abstract
>>> character Å which is most properly represented by #(0041 030a).
>>>
>>> Å (#212b) also represents one valid compatibility form of the abstract
>>> character Å which is most properly represented by #(0041 030a).
>>>
>>> With any luck, this satisfies both our semantic understandings of the
>>> concept of "codepoint"
>>>
>>> Would you agree with that?
>>>
>>> In Unicode, codepoints are *NOT* an abstract numerical representation
>>> of a text character.
>>>
>>> At least not as we generally understand the term "text character" from
>>> our experience of non-Unicode character mappings.
>>>
>>>
>>> I agree, they are numerical representations of what Unicode refers to as
>>> characters.
>>>
>>>
>>> codepoints represent "*encoded characters*"
>>>
>>>
>>> No. a codepoint is the numerical value assigned to a character. An
>>> "encoded
>>> character" is the way a codepoint is represented in bytes using a given
>>> encoding.
>>>
>>> and "a *text element* ...
>>> is represented by a sequence of one or more codepoints".  (And the
>>> term "text element" is deliberately left undefined in the Unicode
>>> standard)
>>>
>>> Individual codepoints are very often *not* the encoded form of an
>>> abstract character that we are interested in.  Unless we are
>>> communicating to or from another system  (Which in some cases is the
>>> Smalltalk ByteString class)
>>>
>>>
>>>
>>>
>>> i.e. in other words
>>>
>>> *Some* individual codepoints *may* be a representation of a specific
>>> *abstract character*, but only in special cases.
>>>
>>> The general case in Unicode is that Unicode defines (a)
>>> representation(s) of a Unicode *abstract character*.
>>>
>>> The Unicode standard representation of an abstract character is a
>>> composed sequence of codepoints, where in some cases that sequence is
>>> as short as 1 codepoint.
>>>
>>> In other cases, Unicode has a compatibility alias of a single
>>> codepoint which is *also* a representation of an abstract character
>>>
>>> There are some cases where an abstract character can be represented by
>>> more than one single-codepoint compatibility codepoint.
>>>
>>> Cheers,
>>> Euan
>>>
>>>
>>> I agree you have a good grasp of the distinction between an abstract
>>> character (characters and character sequences which should be treated
>>> equivalent wrt, equality / sorting / display, etc.) and a character
>>> (which
>>> each have a code point assigned).
>>> That is besides the point both Sven and I tried to get through, which is
>>> the
>>> difference between a code point and the encoded form(s) of said code
>>> point.
>>> When you write:
>>> "and therefore encodable in UTF-8 as compatibility codepoint e9 hex
>>> and as the composed character #(0065 00b4) (all in hex) and as the
>>> same composed character as both
>>> #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are
>>> included"
>>>
>>> I's quite clear you confuse the two. 0xFEFF is the codepoint of the
>>> character used as bom.
>>> When you state that it can be written ffef (I assume you meant FFFE),
>>> you
>>> are again confusing the code point and its encoded value (an encoded
>>> value
>>> which only occurs in UTF16/32, no less).
>>>
>>> When this distinction is clear, it might be easier to see that value in
>>> that
>>> Strings are kept as Unicode code points arrays, and converted to encoded
>>> forms when entering/exiting the system.
>>>
>>> Cheers,
>>> Henry
>>>
>>
>
>
>

Re: [Pharo-dev] Unicode Support

Reply via email to