"No. a codepoint is the numerical value assigned to a character. An "encoded character" is the way a codepoint is represented in bytes using a given encoding."
No. A codepoint may represent a component part of an abstract character, or may represent an abstract character, or it may do both (but not always at the same time). Codepoints represent a single encoding of a single concept. Sometimes that concept represents a whole abstract character. Sometimes it represent part of an abstract character. This is the key difference between Unicode and most character encodings. A codepoint does not always represent a whole character. On 7 December 2015 at 13:06, Henrik Johansen <henrik.s.johan...@veloxit.no> wrote: > > On 07 Dec 2015, at 1:05 , EuanM <euan...@gmail.com> wrote: > > Hi Henry, > > To be honest, at some point I'm going to long for the for the much > more succinct semantics of healthcare systems and sports scoring and > administration systems again. :-) > > codepoints are any of *either* > - the representation of a component of an abstract character, *or* > eg. "A" #(0041) as a component of > - the sole representation of the whole of an abstract character *or* of > - a representation of an abstract character provided for backwards > compatibility which is more properly represented by a series of > codepoints representing a composed character > > e.g. > > The "A" #(0041) as a codepoint can be: > the sole representation of the whole of an abstract character "A" #(0041) > > The representation of a component of the composed (i.e. preferred) > version of the abstract character Å #(0041 030a) > > Å (#00C5) represents one valid compatibility form of the abstract > character Å which is most properly represented by #(0041 030a). > > Å (#212b) also represents one valid compatibility form of the abstract > character Å which is most properly represented by #(0041 030a). > > With any luck, this satisfies both our semantic understandings of the > concept of "codepoint" > > Would you agree with that? > > In Unicode, codepoints are *NOT* an abstract numerical representation > of a text character. > > At least not as we generally understand the term "text character" from > our experience of non-Unicode character mappings. > > > I agree, they are numerical representations of what Unicode refers to as > characters. > > > codepoints represent "*encoded characters*" > > > No. a codepoint is the numerical value assigned to a character. An "encoded > character" is the way a codepoint is represented in bytes using a given > encoding. > > and "a *text element* ... > is represented by a sequence of one or more codepoints". (And the > term "text element" is deliberately left undefined in the Unicode > standard) > > Individual codepoints are very often *not* the encoded form of an > abstract character that we are interested in. Unless we are > communicating to or from another system (Which in some cases is the > Smalltalk ByteString class) > > > > > i.e. in other words > > *Some* individual codepoints *may* be a representation of a specific > *abstract character*, but only in special cases. > > The general case in Unicode is that Unicode defines (a) > representation(s) of a Unicode *abstract character*. > > The Unicode standard representation of an abstract character is a > composed sequence of codepoints, where in some cases that sequence is > as short as 1 codepoint. > > In other cases, Unicode has a compatibility alias of a single > codepoint which is *also* a representation of an abstract character > > There are some cases where an abstract character can be represented by > more than one single-codepoint compatibility codepoint. > > Cheers, > Euan > > > I agree you have a good grasp of the distinction between an abstract > character (characters and character sequences which should be treated > equivalent wrt, equality / sorting / display, etc.) and a character (which > each have a code point assigned). > That is besides the point both Sven and I tried to get through, which is the > difference between a code point and the encoded form(s) of said code point. > When you write: > "and therefore encodable in UTF-8 as compatibility codepoint e9 hex > and as the composed character #(0065 00b4) (all in hex) and as the > same composed character as both > #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are > included" > > I's quite clear you confuse the two. 0xFEFF is the codepoint of the > character used as bom. > When you state that it can be written ffef (I assume you meant FFFE), you > are again confusing the code point and its encoded value (an encoded value > which only occurs in UTF16/32, no less). > > When this distinction is clear, it might be easier to see that value in that > Strings are kept as Unicode code points arrays, and converted to encoded > forms when entering/exiting the system. > > Cheers, > Henry >