See example with ANGSTROM Abstract Characters (Unicode) http://wiki.squeak.org/squeak/6256
On 12/9/15, Guillermo Polito <guillermopol...@gmail.com> wrote: > >> On 8 dic 2015, at 10:07 p.m., EuanM <euan...@gmail.com> wrote: >> >> "No. a codepoint is the numerical value assigned to a character. An >> "encoded character" is the way a codepoint is represented in bytes >> using a given encoding." >> >> No. >> >> A codepoint may represent a component part of an abstract character, >> or may represent an abstract character, or it may do both (but not >> always at the same time). >> >> Codepoints represent a single encoding of a single concept. >> >> Sometimes that concept represents a whole abstract character. >> Sometimes it represent part of an abstract character. > > Well. I do not agree with this. I agree with the quote. > > Can you explain a bit more about what you mean by abstract character and > concept? > >> >> This is the key difference between Unicode and most character encodings. >> >> A codepoint does not always represent a whole character. >> >> On 7 December 2015 at 13:06, Henrik Johansen >> <henrik.s.johan...@veloxit.no> wrote: >>> >>> On 07 Dec 2015, at 1:05 , EuanM <euan...@gmail.com> wrote: >>> >>> Hi Henry, >>> >>> To be honest, at some point I'm going to long for the for the much >>> more succinct semantics of healthcare systems and sports scoring and >>> administration systems again. :-) >>> >>> codepoints are any of *either* >>> - the representation of a component of an abstract character, *or* >>> eg. "A" #(0041) as a component of >>> - the sole representation of the whole of an abstract character *or* of >>> - a representation of an abstract character provided for backwards >>> compatibility which is more properly represented by a series of >>> codepoints representing a composed character >>> >>> e.g. >>> >>> The "A" #(0041) as a codepoint can be: >>> the sole representation of the whole of an abstract character "A" >>> #(0041) >>> >>> The representation of a component of the composed (i.e. preferred) >>> version of the abstract character Å #(0041 030a) >>> >>> Å (#00C5) represents one valid compatibility form of the abstract >>> character Å which is most properly represented by #(0041 030a). >>> >>> Å (#212b) also represents one valid compatibility form of the abstract >>> character Å which is most properly represented by #(0041 030a). >>> >>> With any luck, this satisfies both our semantic understandings of the >>> concept of "codepoint" >>> >>> Would you agree with that? >>> >>> In Unicode, codepoints are *NOT* an abstract numerical representation >>> of a text character. >>> >>> At least not as we generally understand the term "text character" from >>> our experience of non-Unicode character mappings. >>> >>> >>> I agree, they are numerical representations of what Unicode refers to as >>> characters. >>> >>> >>> codepoints represent "*encoded characters*" >>> >>> >>> No. a codepoint is the numerical value assigned to a character. An >>> "encoded >>> character" is the way a codepoint is represented in bytes using a given >>> encoding. >>> >>> and "a *text element* ... >>> is represented by a sequence of one or more codepoints". (And the >>> term "text element" is deliberately left undefined in the Unicode >>> standard) >>> >>> Individual codepoints are very often *not* the encoded form of an >>> abstract character that we are interested in. Unless we are >>> communicating to or from another system (Which in some cases is the >>> Smalltalk ByteString class) >>> >>> >>> >>> >>> i.e. in other words >>> >>> *Some* individual codepoints *may* be a representation of a specific >>> *abstract character*, but only in special cases. >>> >>> The general case in Unicode is that Unicode defines (a) >>> representation(s) of a Unicode *abstract character*. >>> >>> The Unicode standard representation of an abstract character is a >>> composed sequence of codepoints, where in some cases that sequence is >>> as short as 1 codepoint. >>> >>> In other cases, Unicode has a compatibility alias of a single >>> codepoint which is *also* a representation of an abstract character >>> >>> There are some cases where an abstract character can be represented by >>> more than one single-codepoint compatibility codepoint. >>> >>> Cheers, >>> Euan >>> >>> >>> I agree you have a good grasp of the distinction between an abstract >>> character (characters and character sequences which should be treated >>> equivalent wrt, equality / sorting / display, etc.) and a character >>> (which >>> each have a code point assigned). >>> That is besides the point both Sven and I tried to get through, which is >>> the >>> difference between a code point and the encoded form(s) of said code >>> point. >>> When you write: >>> "and therefore encodable in UTF-8 as compatibility codepoint e9 hex >>> and as the composed character #(0065 00b4) (all in hex) and as the >>> same composed character as both >>> #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are >>> included" >>> >>> I's quite clear you confuse the two. 0xFEFF is the codepoint of the >>> character used as bom. >>> When you state that it can be written ffef (I assume you meant FFFE), >>> you >>> are again confusing the code point and its encoded value (an encoded >>> value >>> which only occurs in UTF16/32, no less). >>> >>> When this distinction is clear, it might be easier to see that value in >>> that >>> Strings are kept as Unicode code points arrays, and converted to encoded >>> forms when entering/exiting the system. >>> >>> Cheers, >>> Henry >>> >> > > >