Re: [Pharo-dev] Unicode Support

Guillermo Polito Wed, 09 Dec 2015 01:36:48 -0800

> On 8 dic 2015, at 10:07 p.m., EuanM <euan...@gmail.com> wrote:
> 
> "No. a codepoint is the numerical value assigned to a character. An
> "encoded character" is the way a codepoint is represented in bytes
> using a given encoding."
> 
> No.
> 
> A codepoint may represent a component part of an abstract character,
> or may represent an abstract character, or it may do both (but not
> always at the same time).
> 
> Codepoints represent a single encoding of a single concept.
> 
> Sometimes that concept represents a whole abstract character.
> Sometimes it represent part of an abstract character.


Well. I do not agree with this. I agree with the quote.

Can you explain a bit more about what you mean by abstract character and 
concept?

> 
> This is the key difference between Unicode and most character encodings.
> 
> A codepoint does not always represent a whole character.
> 
> On 7 December 2015 at 13:06, Henrik Johansen
> <henrik.s.johan...@veloxit.no> wrote:
>> 
>> On 07 Dec 2015, at 1:05 , EuanM <euan...@gmail.com> wrote:
>> 
>> Hi Henry,
>> 
>> To be honest, at some point I'm going to long for the for the much
>> more succinct semantics of healthcare systems and sports scoring and
>> administration systems again.  :-)
>> 
>> codepoints are any of *either*
>> - the representation of a component of an abstract character, *or*
>> eg. "A" #(0041) as a component of
>> - the sole representation of the whole of an abstract character *or* of
>> -  a representation of an abstract character provided for backwards
>> compatibility which is more properly represented by a series of
>> codepoints representing a composed character
>> 
>> e.g.
>> 
>> The "A" #(0041) as a codepoint can be:
>> the sole representation of the whole of an abstract character "A" #(0041)
>> 
>> The representation of a component of the composed (i.e. preferred)
>> version of the abstract character Å #(0041 030a)
>> 
>> Å (#00C5) represents one valid compatibility form of the abstract
>> character Å which is most properly represented by #(0041 030a).
>> 
>> Å (#212b) also represents one valid compatibility form of the abstract
>> character Å which is most properly represented by #(0041 030a).
>> 
>> With any luck, this satisfies both our semantic understandings of the
>> concept of "codepoint"
>> 
>> Would you agree with that?
>> 
>> In Unicode, codepoints are *NOT* an abstract numerical representation
>> of a text character.
>> 
>> At least not as we generally understand the term "text character" from
>> our experience of non-Unicode character mappings.
>> 
>> 
>> I agree, they are numerical representations of what Unicode refers to as
>> characters.
>> 
>> 
>> codepoints represent "*encoded characters*"
>> 
>> 
>> No. a codepoint is the numerical value assigned to a character. An "encoded
>> character" is the way a codepoint is represented in bytes using a given
>> encoding.
>> 
>> and "a *text element* ...
>> is represented by a sequence of one or more codepoints".  (And the
>> term "text element" is deliberately left undefined in the Unicode
>> standard)
>> 
>> Individual codepoints are very often *not* the encoded form of an
>> abstract character that we are interested in.  Unless we are
>> communicating to or from another system  (Which in some cases is the
>> Smalltalk ByteString class)
>> 
>> 
>> 
>> 
>> i.e. in other words
>> 
>> *Some* individual codepoints *may* be a representation of a specific
>> *abstract character*, but only in special cases.
>> 
>> The general case in Unicode is that Unicode defines (a)
>> representation(s) of a Unicode *abstract character*.
>> 
>> The Unicode standard representation of an abstract character is a
>> composed sequence of codepoints, where in some cases that sequence is
>> as short as 1 codepoint.
>> 
>> In other cases, Unicode has a compatibility alias of a single
>> codepoint which is *also* a representation of an abstract character
>> 
>> There are some cases where an abstract character can be represented by
>> more than one single-codepoint compatibility codepoint.
>> 
>> Cheers,
>> Euan
>> 
>> 
>> I agree you have a good grasp of the distinction between an abstract
>> character (characters and character sequences which should be treated
>> equivalent wrt, equality / sorting / display, etc.) and a character (which
>> each have a code point assigned).
>> That is besides the point both Sven and I tried to get through, which is the
>> difference between a code point and the encoded form(s) of said code point.
>> When you write:
>> "and therefore encodable in UTF-8 as compatibility codepoint e9 hex
>> and as the composed character #(0065 00b4) (all in hex) and as the
>> same composed character as both
>> #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are
>> included"
>> 
>> I's quite clear you confuse the two. 0xFEFF is the codepoint of the
>> character used as bom.
>> When you state that it can be written ffef (I assume you meant FFFE), you
>> are again confusing the code point and its encoded value (an encoded value
>> which only occurs in UTF16/32, no less).
>> 
>> When this distinction is clear, it might be easier to see that value in that
>> Strings are kept as Unicode code points arrays, and converted to encoded
>> forms when entering/exiting the system.
>> 
>> Cheers,
>> Henry
>> 
>

Re: [Pharo-dev] Unicode Support

Reply via email to