Re: [Pharo-dev] Unicode Support

EuanM Tue, 08 Dec 2015 13:09:12 -0800

"No. a codepoint is the numerical value assigned to a character. An
"encoded character" is the way a codepoint is represented in bytes
using a given encoding."


No.

A codepoint may represent a component part of an abstract character,
or may represent an abstract character, or it may do both (but not
always at the same time).

Codepoints represent a single encoding of a single concept.

Sometimes that concept represents a whole abstract character.
Sometimes it represent part of an abstract character.

This is the key difference between Unicode and most character encodings.

A codepoint does not always represent a whole character.

On 7 December 2015 at 13:06, Henrik Johansen
<henrik.s.johan...@veloxit.no> wrote:
>
> On 07 Dec 2015, at 1:05 , EuanM <euan...@gmail.com> wrote:
>
> Hi Henry,
>
> To be honest, at some point I'm going to long for the for the much
> more succinct semantics of healthcare systems and sports scoring and
> administration systems again.  :-)
>
> codepoints are any of *either*
>  - the representation of a component of an abstract character, *or*
> eg. "A" #(0041) as a component of
>  - the sole representation of the whole of an abstract character *or* of
> -  a representation of an abstract character provided for backwards
> compatibility which is more properly represented by a series of
> codepoints representing a composed character
>
> e.g.
>
> The "A" #(0041) as a codepoint can be:
> the sole representation of the whole of an abstract character "A" #(0041)
>
> The representation of a component of the composed (i.e. preferred)
> version of the abstract character Å #(0041 030a)
>
> Å (#00C5) represents one valid compatibility form of the abstract
> character Å which is most properly represented by #(0041 030a).
>
> Å (#212b) also represents one valid compatibility form of the abstract
> character Å which is most properly represented by #(0041 030a).
>
> With any luck, this satisfies both our semantic understandings of the
> concept of "codepoint"
>
> Would you agree with that?
>
> In Unicode, codepoints are *NOT* an abstract numerical representation
> of a text character.
>
> At least not as we generally understand the term "text character" from
> our experience of non-Unicode character mappings.
>
>
> I agree, they are numerical representations of what Unicode refers to as
> characters.
>
>
> codepoints represent "*encoded characters*"
>
>
> No. a codepoint is the numerical value assigned to a character. An "encoded
> character" is the way a codepoint is represented in bytes using a given
> encoding.
>
> and "a *text element* ...
> is represented by a sequence of one or more codepoints".  (And the
> term "text element" is deliberately left undefined in the Unicode
> standard)
>
> Individual codepoints are very often *not* the encoded form of an
> abstract character that we are interested in.  Unless we are
> communicating to or from another system  (Which in some cases is the
> Smalltalk ByteString class)
>
>
>
>
> i.e. in other words
>
> *Some* individual codepoints *may* be a representation of a specific
> *abstract character*, but only in special cases.
>
> The general case in Unicode is that Unicode defines (a)
> representation(s) of a Unicode *abstract character*.
>
> The Unicode standard representation of an abstract character is a
> composed sequence of codepoints, where in some cases that sequence is
> as short as 1 codepoint.
>
> In other cases, Unicode has a compatibility alias of a single
> codepoint which is *also* a representation of an abstract character
>
> There are some cases where an abstract character can be represented by
> more than one single-codepoint compatibility codepoint.
>
> Cheers,
>  Euan
>
>
> I agree you have a good grasp of the distinction between an abstract
> character (characters and character sequences which should be treated
> equivalent wrt, equality / sorting / display, etc.) and a character (which
> each have a code point assigned).
> That is besides the point both Sven and I tried to get through, which is the
> difference between a code point and the encoded form(s) of said code point.
> When you write:
> "and therefore encodable in UTF-8 as compatibility codepoint e9 hex
> and as the composed character #(0065 00b4) (all in hex) and as the
> same composed character as both
> #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are
> included"
>
> I's quite clear you confuse the two. 0xFEFF is the codepoint of the
> character used as bom.
> When you state that it can be written ffef (I assume you meant FFFE), you
> are again confusing the code point and its encoded value (an encoded value
> which only occurs in UTF16/32, no less).
>
> When this distinction is clear, it might be easier to see that value in that
> Strings are kept as Unicode code points arrays, and converted to encoded
> forms when entering/exiting the system.
>
> Cheers,
> Henry
>

Re: [Pharo-dev] Unicode Support

Reply via email to