Re: Abstract character?
I disagree with Ken, but don't have time now to write a lengthy reply.. I'll try to get to that soon. Mark __ http://www.macchiato.com ◄ “Eppur si muove” ► - Original Message - From: "Doug Ewell" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: "Kenneth Whistler" <[EMAIL PROTECTED]> Sent: Tuesday, July 23, 2002 19:44 Subject: Re: Abstract character? > Kenneth Whistler wrote: > > >> UTF-16 does not allow the representation of an unpaired surrogate > >> 0xD800 followed by another, coincidental unpaired surrogate 0xDC00. > >> (It maps the two to U+1.) Among the standard UTFs, only UTF-32 > >> allows the two to be treated as unpaired surrogates. > > > > Actually, not that, either. > > > >> In fact, before UTF-8 was > >> "tightened up" in 3.2, the only UTF that DID NOT permit these two > >> coincidental unpaired surrogates was UTF-16. > >> > >> UTF-8: D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal) > >> UTF-32: D800 DC00 <==> D800 DC00 > > > > This is ill-formed in UTF-32, and thereby, illegal. > > I'm glad to hear that unpaired surrogates are now also illegal in > UTF-32, and presumably also in UTF-16. However, I did do my homework > before writing yesterday's post, and that wasn't the impression I got, > so I sense another opportunity to tighten up the definitions before > Unicode 4.0 is released. > > In UAX #28, "Unicode 3.2," the section on "Elimination of Irregular > Sequences" starts out talking about "transformation formats such as > UTF-8." However, the rest of the section deals exclusively with UTF-8; > UTF-16 and UTF-32 are not mentioned. > > UAX #19, "UTF-32" (written by Mark) is listed in the header block as > having been updated to Unicode 3.2, but it does not state anywhere that > unpaired surrogates are illegal. In particular, the following passages > from UAX #19 led me to believe that all code points, from 0x through > 0x10 inclusive, are legal in UTF-32: > > "UTF-32 is restricted in values to the range 0..1016, > which precisely matches the range of characters defined in the Unicode > Standard (and other standards such as XML), and those representable by > UTF-8 and UTF-16." > > "(b) An illegal UTF-32 code unit sequence is any byte sequence that > would correspond to a numeric value outside of the range 0 to > 1016. > > "(c) An irregular UTF-32 code unit sequence is an eight-byte sequence > where the first four bytes correspond to a high surrogate, and the next > four bytes correspond to a low surrogate. As a consequence of C12, these > irregular UTF-32 sequences shall not be generated by a conformant > process." > > I suggest that the Unicode 4.0 text specifically state, in unambiguous > terms, which code points are and are not valid in UTF-8, UTF-16, and > UTF-32. And if it is true that the surrogate code points 0xD800 through > 0xDFFF are illegal in UTF-32, then I suggest that UAX #18 be revised to > state this unambiguously. > > -Doug Ewell > Fullerton, California > > >
Re: Abstract character?
I typo'd: > I suggest that UAX #18 be revised to > state this unambiguously. s/#18/#19/ -Doug
Re: Abstract character?
Kenneth Whistler wrote: >> UTF-16 does not allow the representation of an unpaired surrogate >> 0xD800 followed by another, coincidental unpaired surrogate 0xDC00. >> (It maps the two to U+1.) Among the standard UTFs, only UTF-32 >> allows the two to be treated as unpaired surrogates. > > Actually, not that, either. > >> In fact, before UTF-8 was >> "tightened up" in 3.2, the only UTF that DID NOT permit these two >> coincidental unpaired surrogates was UTF-16. >> >> UTF-8: D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal) >> UTF-32: D800 DC00 <==> D800 DC00 > > This is ill-formed in UTF-32, and thereby, illegal. I'm glad to hear that unpaired surrogates are now also illegal in UTF-32, and presumably also in UTF-16. However, I did do my homework before writing yesterday's post, and that wasn't the impression I got, so I sense another opportunity to tighten up the definitions before Unicode 4.0 is released. In UAX #28, "Unicode 3.2," the section on "Elimination of Irregular Sequences" starts out talking about "transformation formats such as UTF-8." However, the rest of the section deals exclusively with UTF-8; UTF-16 and UTF-32 are not mentioned. UAX #19, "UTF-32" (written by Mark) is listed in the header block as having been updated to Unicode 3.2, but it does not state anywhere that unpaired surrogates are illegal. In particular, the following passages from UAX #19 led me to believe that all code points, from 0x through 0x10 inclusive, are legal in UTF-32: "UTF-32 is restricted in values to the range 0..1016, which precisely matches the range of characters defined in the Unicode Standard (and other standards such as XML), and those representable by UTF-8 and UTF-16." "(b) An illegal UTF-32 code unit sequence is any byte sequence that would correspond to a numeric value outside of the range 0 to 1016. "(c) An irregular UTF-32 code unit sequence is an eight-byte sequence where the first four bytes correspond to a high surrogate, and the next four bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-32 sequences shall not be generated by a conformant process." I suggest that the Unicode 4.0 text specifically state, in unambiguous terms, which code points are and are not valid in UTF-8, UTF-16, and UTF-32. And if it is true that the surrogate code points 0xD800 through 0xDFFF are illegal in UTF-32, then I suggest that UAX #18 be revised to state this unambiguously. -Doug Ewell Fullerton, California
Re: Abstract character?
Lars Marius Garshol followed up: > H. OK. So combining diacritics are also abstract characters? Yes, clearly. Each encoded character in the Unicode CCS, ipso facto, associates an abstract character with a code point. So U+0300 COMBINING GRAVE ACCENT associates the code point U+0300 with an abstract character {grave accent mark that attaches above a base form}. We have agreed to associate, further, a normative name COMBINING GRAVE ACCENT with that encoded character, to facilitate transcoding U+0300 COMBINING GRAVE ACCENT with any other CCS which might include the same abstract character. > (I > was also unclear on ZWNJ and similar things, but you explicitly > mention that above, so...) Yep, they're all abstract characters, by the nature of the beast. > > | (Note above -- abstract characters are also a concept which applies > | to other character encodings besides the Unicode Standard, and not > | all encoded characters in other character encodings automatically > | make it into the Unicode Standard, for various architectural > | reasons.) > > Right. So VIQR, for example, also has abstract characters, then? Yes, I think the character encoding model is broad enough to apply to any CCS, not just the Unicode Standard. That's basically why we can transcode between Unicode and legacy standards. > However, it does raise a new problem. Isn't the definition of 'string' > in the XPath specification then wrong? > > Strings consist of a sequence of zero or more characters, where a > character is defined as in the XML Recommendation [XML]. A single > character in XPath thus corresponds to a single Unicode abstract > character with a single corresponding Unicode scalar value (see > [Unicode]); [...] > http://www.w3.org/TR/xpath#strings > > > As far as I can tell, one of these two claims must be wrong. That is, > either a single XPath character does not necessarily correspond to a > single Unicode abstract character, or else a single XPath character > need not correspond to a single scalar value. No, I think it is correct, if somewhat convoluted. Basically, it is trying to say that each XPath character corresponds to a single Unicode encoded character. By my discussion in the previous note, a Unicode encoded character maps a (single) abstract character to a (single) code point [and because of the constraints of the UTF definitions, the only valid code points are the Unicode scalar values]. > > Does that sound reasonable? The trick for understanding Unicode -- once you have the basic encoding principle down -- is to realize that the ideal one-to-one mapping situation is violated in practice for a variety of reasons. In the clarification about abstract characters that I quoted earlier in this thread, I abridged further discussion about all the edge cases. For those strong of stomach, read on. --Ken formulation of CCS and exceptions What is the CCS? The CCS is a function f. The domain X of the function is a repertoire of abstract characters. The codomain Y of the function is the codespace. For each x [abstract character in the repertoire], f(x) [a code point in the codespace] is assigned to x. Ideally, the CCS is one-to-one, since two different abstract characters [x] cannot be mapped to the same code point [f(x)], and two different code points are not assigned to the same abstract character. The CCS is not onto, since there are many code points which are not assigned to any abstract character. In other words, conceptually a CCS is an injection. What we keep doing is expanding the domain (but not the subdomain) of the injection. (I.e., we keep adding encoded characters, but don't expand the codespace.) In actuality, the Unicode CCS is messier. As we know, it is not really one-to-one in practice: a. There are instances where two different abstract characters are mapped to the same code point. These are the "false unifications", and usually engender some more-or-less vociferous discussion about requiring a disunification. (Cf. U+0192, which unified f-hook with the florin sign.) There are also legacy overunifications which result in intentionally ambiguous characters: U+002D hyphus, U+0027 apostroquote, U+0060 gravquote, and the like. False unifications engender controversy in inverse proportion to the length of time people have lived with their ambiguities, so Unicode tends to be more smothered by nitpicking about the examples which are introduced by Unicode itself. b. There are instances where the same abstract character is mapped to more than one code point. These are the "false disunifications", and constitute the set of compatibility characters that we give singleton canonical mappings to
Re: Abstract character?
ral range for the code points, as demanded by some of the implementers. > "code point" should be defined as an integer corresponding to an > encoded character in any CCS, not just Unicode. This doesn't really work, since it doesn't account for the unassigned (reserved) code points, nor the noncharacters. The Unicode architecture for its codespace is more complex than any other CCS, precisely because the encoding is more complex: only Unicode has three bijective encoding forms, and only Unicode has noncharacters. These need to be taken into account. > The integers 0xD800..0xDFFF are legal *as code units* in UTF-16. IMHO > allowing them as code points (i.e. allowing any process to conformantly > generate unpaired surrogates) is a really bad idea. This I agree with. > The set of code > point sequences that are validly representable in each UTF should be > identical (which ensures that mappings between UTFs are bijective and > always succeed iff the input is valid in the source UTF). I also think this is of paramount importance. > I.e. U+D800..DFFF, like U+11, should be undesignated and > unrepresentable. However, you can't go quite this far. As Markus pointed out, code points themselves may have properties -- even code points which cannot, in principle, be assigned to characters. And there are already existing APIs which handle these code points. Their function is clearly *designated* by the standard, normatively; that, however, is different from saying that an abstract character could ever be assigned to them. --Ken
Re: Abstract character?
So far, the Unicode Standard has defined code points to be from the contiguous range of 0..0x10. Some definitions are fuzzy in the standard, with hopes of clarification in Unicode 4.0. It is true that UTF-16 cannot encode , but it can encode . There are at least three reasons why not to forbid the representation of surrogate code points in UTF-16 (and also UTF-32) or the code-pointed-ness of surrogates: 1. Compatibility. UTF-16 was explicitly created to be backwards compatible with UCS-2. Valid UCS-2 text must be valid UTF-16 text. In UCS-2, code points d800..dfff were legal, so they must be in UTF-16. 2. Performance. When you iterate through a UTF-16/32 string, you don't want to forbid surrogate code points because it adds complexity to your logic. In fact, iterating through UTF-16 text currently does not produce any decoding errors. When you go through you get code points d800, 0061, dc00, 10001. Similarly, you don't want to forbid appending d800 to a string because the application might deliberately append code units (and dc00 would follow), or the application might just be blind towards surrogates and pass code units through one by one (UCS-2 application) with reasonable hopes that a surrogate pair would be rejoined by default. 3. Properties. An API that takes a code point and returns a property for that code point must be able to deal with surrogate code points because there are non-trivial properties assigned to them, e.g., general category Cs. Surrogate code points have been listed in the UCD for a long time, which shows that they are different from illegal code point values like 0x11 or -1. markus
Re: Abstract character?
On 07/22/2002 03:38:50 PM Kenneth Whistler wrote: >Abstract character > > that which is encoded; an element of the repertoire (existing > independent of the character encoding standard, and often > identifiable in other character encoding standards, as well > as the Unicode Standard); the implicit basis of transcodings. [snip] >> - do (Å) and (A followed by combining ring >>above) represent the same abstract character? > >Yes. That is the implicit claim behind a specification of canonical >equivalence. This brings to mind another question: what's the relationship between character sequences and abstract characters? Does < 0041, 030A > represent a single abstract character or a sequence of abstract characters? Ken's answer above suggests a single abstract character. Actually, the question that's really bothering me is the next one. Moving one step further (perhaps you already guessed where I was going), what of < 1000, 102D, 102F >? Whether we consider it a single abstract character, or a sequence of abstract characters, the more important question to me is whether it is the same abstract character (sequence) as < 1000, 102F, 102D >. The only thing that makes sense is that they are the same abstract character sequences. But, they are not canonically equivalent! Is the contrapositive to your statement true? I.e. is it true that lack of canonical equivalence implies a distinction in abstract character (sequences)? - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>
Re: Abstract character?
-BEGIN PGP SIGNED MESSAGE- Mark Davis wrote: > A small correction to Ken's message: > > >The Unicode scalar value > >definitionally excludes D800..DFFF, which are only code unit > >values used in UTF-16, and which are not code points associated > >with any well-formed UTF code unit sequences. > > The UTC in has decided to make scalar value mean unambiguously the > code points ..D7FF, E000..10, i.e., everything but surrogate > code points. I think it would be a mistake for the standard to refer to "surrogate code points". The term "code point" is used for other CCS's where there may also be gaps in the code space; in that case, the gaps are not considered valid code points. When 0xD800..0xDFFF are used in UTF-16, they are used as code units, not code points. As Unicode code points, 0xD800..0xDFFF are (or at least should be) invalid in the same sense that 0x11 is. I.e. IMHO "Unicode scalar value" and "Unicode code point" should be synonyms, with the set of valid values 0..0xD7FF, 0xE000..0x10. "code point" should be defined as an integer corresponding to an encoded character in any CCS, not just Unicode. > While surrogate code points cannot be represented in > UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate > code points are illegal in all UTFs; notably, they are legal in > UTF-16. The integers 0xD800..0xDFFF are legal *as code units* in UTF-16. IMHO allowing them as code points (i.e. allowing any process to conformantly generate unpaired surrogates) is a really bad idea. The set of code point sequences that are validly representable in each UTF should be identical (which ensures that mappings between UTFs are bijective and always succeed iff the input is valid in the source UTF). I.e. U+D800..DFFF, like U+11, should be undesignated and unrepresentable. (As well as UTF-16, the definition of UTF-32 in UAX #19 does not specifically exclude 0xD800..0xDFFF, although the ISO 10646 definition does. In this case I think Unicode should be changed to be consistent with ISO 10646.) > Ken is pushing for this change; I believe it would be a very bad idea. What precisely do you think would be a bad idea? - -- David Hopwood <[EMAIL PROTECTED]> Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/ RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01 Nothing in this message is intended to be legally binding. If I revoke a public key but refuse to specify why, it is because the private key has been seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip -BEGIN PGP SIGNATURE- Version: 2.6.3i Charset: noconv iQEVAwUBPT0/MjkCAxeYt5gVAQEOvQf8DEmtbZpQ59nSSbVa8HN/BXCoMG/UOqYy lSknQ+dUaIS3S0QgpVSIs5tFOjShw2YZ117cXioxzADMbU2MlbY3NITJYkatbgqf UWIH9ENnqe0YDLdg1FWjyFFWuYLz1kf7c4M16OblhrHMJCjc9+Gba8dikIjJolWi WNtzfX9ftuzcvFwssReGjyemXMhN6ugeUv3T1hGXjMRT834rSG9eLEr98BWpE1xR m8wQPBWizSCDF3xFrRg6SwfSt1g+SrhGjLd/ccG96ENdC1XBHYyF4WgggdIO6Ilb 0WSaLbBV4uEPxyPihsy4pV3w8GLRXDhwpK34InLRHJFkMcgNWMTE2w== =Kn1u -END PGP SIGNATURE-
Re: Abstract character?
* Kenneth Whistler | | Abstract character | |that which is encoded; an element of the repertoire (existing |independent of the character encoding standard, and often |identifiable in other character encoding standards, as well |as the Unicode Standard); the implicit basis of transcodings. | |Note that while in some sense abstract characters exist a |priori by virtue of the nature of the units of various writing |systems, their exact nature is only pinned down at the point |that an actual encoding is done. They are not always obvious, |and many new abstract characters may arise as the result of |particular textual processing needs that can be addressed by |characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER, |etc., etc.) This helps a little, but not all that much. I think spelling out the details of how the term relates to the other terms would help. The rest of the definitions wre quite clear. * Lars Marius Garshol | | - are all assigned Unicode characters also abstract characters? * Kenneth Whistler | | Yes. Or rather: all encoded characters are assigned to abstract | characters. H. OK. So combining diacritics are also abstract characters? (I was also unclear on ZWNJ and similar things, but you explicitly mention that above, so...) | (Note above -- abstract characters are also a concept which applies | to other character encodings besides the Unicode Standard, and not | all encoded characters in other character encodings automatically | make it into the Unicode Standard, for various architectural | reasons.) Right. So VIQR, for example, also has abstract characters, then? * Lars Marius Garshol | | - do (Å) and (A followed by combining ring |above) represent the same abstract character? * Kenneth Whistler | | Yes. That is the implicit claim behind a specification of canonical | equivalence. Right. Then I think I've more-or-less got it. This helped a lot. Thank you! However, it does raise a new problem. Isn't the definition of 'string' in the XPath specification then wrong? Strings consist of a sequence of zero or more characters, where a character is defined as in the XML Recommendation [XML]. A single character in XPath thus corresponds to a single Unicode abstract character with a single corresponding Unicode scalar value (see [Unicode]); [...] http://www.w3.org/TR/xpath#strings > As far as I can tell, one of these two claims must be wrong. That is, either a single XPath character does not necessarily correspond to a single Unicode abstract character, or else a single XPath character need not correspond to a single scalar value. Does that sound reasonable? -- Lars Marius Garshol, Ontopian http://www.ontopia.net > ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >
Re: Abstract character?
Mark Davis wrote: > The UTC in has decided to make scalar value mean unambiguously the > code points ..D7FF, E000..10, i.e., everything but surrogate > code points. While surrogate code points cannot be represented in > UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate > code points are illegal in all UTFs; notably, they are legal in > UTF-16. They are not legal in UTF-16 unless you believe that the two code points (0xD800, 0xDC00) are fundamentally equivalent to the single code point 0x1 -- that is, unless you believe Unicode *is* UTF-16. UTF-16 does not allow the representation of an unpaired surrogate 0xD800 followed by another, coincidental unpaired surrogate 0xDC00. (It maps the two to U+1.) Among the standard UTFs, only UTF-32 allows the two to be treated as unpaired surrogates. In fact, before UTF-8 was "tightened up" in 3.2, the only UTF that DID NOT permit these two coincidental unpaired surrogates was UTF-16. UTF-8: D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal) UTF-32: D800 DC00 <==> D800 DC00 - but - UTF-16: D800 DC00 ==> D800 DC00 ==> 1 > Ken is pushing for this change; I believe it would be a very bad idea. > (I think the reasons have already appeared on this list, so I am not > trying to reopen the discussion; just state the current situation.) I don't recall seeing the reasons conclusively discussed on this list; I'd be happy to hear them again. I've been complaining about the paragraph after D29 for two years now. -Doug Ewell Fullerton, California
Re: Abstract character?
A small correction to Ken's message: >The Unicode scalar value >definitionally excludes D800..DFFF, which are only code unit >values used in UTF-16, and which are not code points associated >with any well-formed UTF code unit sequences. The UTC in has decided to make scalar value mean unambiguously the code points ..D7FF, E000..10, i.e., everything but surrogate code points. While surrogate code points cannot be represented in UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate code points are illegal in all UTFs; notably, they are legal in UTF-16. Ken is pushing for this change; I believe it would be a very bad idea. (I think the reasons have already appeared on this list, so I am not trying to reopen the discussion; just state the current situation.) Mark __ http://www.macchiato.com ◄ “Eppur si muove” ► - Original Message - From: "Kenneth Whistler" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Monday, July 22, 2002 13:38 Subject: Re: Abstract character? > Lars Marius Garshol asked: > > > I'm trying to find out what an abstract character is. I've been > > looking at chapter 3 of Unicode 3.0, without really achieving > > enlightenment. > > > > The term Unicode scalar value (apparently synonymous with code point) > > seems clear. It is the identifying number assigned to assigned > > Unicode characters. > > Here is one of my attempts at a more rigorous term rectification: > > Abstract character > >that which is encoded; an element of the repertoire (existing >independent of the character encoding standard, and often >identifiable in other character encoding standards, as well >as the Unicode Standard); the implicit basis of transcodings. > >Note that while in some sense abstract characters exist a >priori by virtue of the nature of the units of various writing >systems, their exact nature is only pinned down at the point >that an actual encoding is done. They are not always obvious, >and many new abstract characters may arise as the result of >particular textual processing needs that can be addressed by >characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER, >etc., etc.) > > Code point > >A number from 0..10; a "point" in the codespace 0..10. > > Encoded character > >An *association* of an abstract character with a code point. > > Unicode scalar value > >A number from 0..D7FF, E000..10; the domain of the >functions which define UTF's. The Unicode scalar value >definitionally excludes D800..DFFF, which are only code unit >values used in UTF-16, and which are not code points associated > with any well-formed UTF code unit sequences. > > Assignment (of code points) > >Refers to the process of associating abstract character with >code points. Mathematically a code point is >"assigned to" an abstract character and an abstract >character is "mapped to" a code point. > >This is distinguished from the vaguer sense of "assigned" >in general parlance as meaning "a code point given some >designated function by the standard", which would include >noncharacters and surrogates. > > > > > So far, so good. Some questions: > > > > - are all assigned Unicode characters also abstract characters? > > Yes. Or rather: all encoded characters are assigned to abstract > characters. > > (See above for my distinction between "assigned" and > "designated", which would apply to noncharacters and surrogate > code points -- neither of which classes of code points get > assigned to abstract characters.) > > > > > - it seems that not all abstract characters have code points (since > >abstract characters can be formed using combining characters). Is > > that correct? > > Yes. (Note above -- abstract characters are also a concept which > applies to other character encodings besides the Unicode Standard, > and not all encoded characters in other character encodings automatically > make it into the Unicode Standard, for various architectural reasons.) > > > > > - do () and (A followed by combining ring > >above) represent the same abstract character? > > Yes. That is the implicit claim behind a specification of canonical > equivalence. > > --Ken > > > > > Would be good if someone could clear this up. > > > > -- > > Lars Marius Garshol, Ontopian http://www.ontopia.net > > > ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no > > > > > > > > > >
Re: Abstract character?
I usually define an abstract character in talks I give as "an element of a writing system that you care about, independent of glyphs, and certainly independent of endings or specific code points". If it could be described more precisely than that, it wouldn't be "abstract", would it? :) This is usually brought up in a series of definitions leading from "character" (what we are referring to here as "abstract" character, and then: - "character list" - a list of "characters" one is interested in - "character set" - a list of "character lists", which may or may not be ordered, but still has no codepoints - "encoding scheme" - an algorithm for assigning code points to a "character set" - "code point" the representation of an "abstract character" in an "encoding scheme" - "font" - a series of glyphs that are used to display a characters represented by code points, in their immediate context All of this is filled with examples - building to an explanation of Unicode. For example, wrt "abstract character, I ask the audience to ponder if "upper case A" and "lower case a", are the same "abstract character". Also, I ask them to ponder if "lower case a" displayed in "Helvetica" is the same "character as "lower case a" in " Times Roman". Finally, how about "lower case a in 9 point Helvetica" and "lower case a in 18 point Helvetica"? And apropos a thread from last week, Unicode introduces new concepts such as "character properties" which means the anticipation and intrigue I spend time building in the audience that there is a neat solution to the historical morass I just spent 40 minutes describing, gets thoroughly dashed! Joy! Implicit in this set of definitions is of course that a "character" may or may not be of interest to all "character lists", and therefore may or may not end of represented in more than one encoding. Also note that even when it does end up in more than one, this model in no way implies a round trip capability. This leads nicely into a discussion about some very important aspects of internationalizing code and working with 3rd party components.. Barry Caplan www.i18n.com At 01:38 PM 7/22/2002 -0700, Kenneth Whistler wrote: >Lars Marius Garshol asked: > >> I'm trying to find out what an abstract character is. I've been >> looking at chapter 3 of Unicode 3.0, without really achieving >> enlightenment. >> >> The term Unicode scalar value (apparently synonymous with code point) >> seems clear. It is the identifying number assigned to assigned >> Unicode characters. > >Here is one of my attempts at a more rigorous term rectification: > >Abstract character > > that which is encoded; an element of the repertoire (existing > independent of the character encoding standard, and often > identifiable in other character encoding standards, as well > as the Unicode Standard); the implicit basis of transcodings. > > Note that while in some sense abstract characters exist a > priori by virtue of the nature of the units of various writing > systems, their exact nature is only pinned down at the point > that an actual encoding is done. They are not always obvious, > and many new abstract characters may arise as the result of > particular textual processing needs that can be addressed by > characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER, > etc., etc.) > >Code point > > A number from 0..10; a "point" in the codespace 0..10. > >Encoded character > > An *association* of an abstract character with a code point. > >Unicode scalar value > > A number from 0..D7FF, E000..10; the domain of the > functions which define UTF's. The Unicode scalar value > definitionally excludes D800..DFFF, which are only code unit > values used in UTF-16, and which are not code points associated > with any well-formed UTF code unit sequences. > >Assignment (of code points) > > Refers to the process of associating abstract character with > code points. Mathematically a code point is > "assigned to" an abstract character and an abstract > character is "mapped to" a code point. > > This is distinguished from the vaguer sense of "assigned" > in general parlance as meaning "a code point given some > designated function by the standard", which would include > noncharacters and surrogates. > >> >> So far, so good. Some questions: >> >> - are all assigned Unicode characters also abstract character
Re: Abstract character?
Lars Marius Garshol asked: > I'm trying to find out what an abstract character is. I've been > looking at chapter 3 of Unicode 3.0, without really achieving > enlightenment. > > The term Unicode scalar value (apparently synonymous with code point) > seems clear. It is the identifying number assigned to assigned > Unicode characters. Here is one of my attempts at a more rigorous term rectification: Abstract character that which is encoded; an element of the repertoire (existing independent of the character encoding standard, and often identifiable in other character encoding standards, as well as the Unicode Standard); the implicit basis of transcodings. Note that while in some sense abstract characters exist a priori by virtue of the nature of the units of various writing systems, their exact nature is only pinned down at the point that an actual encoding is done. They are not always obvious, and many new abstract characters may arise as the result of particular textual processing needs that can be addressed by characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER, etc., etc.) Code point A number from 0..10; a "point" in the codespace 0..10. Encoded character An *association* of an abstract character with a code point. Unicode scalar value A number from 0..D7FF, E000..10; the domain of the functions which define UTF's. The Unicode scalar value definitionally excludes D800..DFFF, which are only code unit values used in UTF-16, and which are not code points associated with any well-formed UTF code unit sequences. Assignment (of code points) Refers to the process of associating abstract character with code points. Mathematically a code point is "assigned to" an abstract character and an abstract character is "mapped to" a code point. This is distinguished from the vaguer sense of "assigned" in general parlance as meaning "a code point given some designated function by the standard", which would include noncharacters and surrogates. > > So far, so good. Some questions: > > - are all assigned Unicode characters also abstract characters? Yes. Or rather: all encoded characters are assigned to abstract characters. (See above for my distinction between "assigned" and "designated", which would apply to noncharacters and surrogate code points -- neither of which classes of code points get assigned to abstract characters.) > > - it seems that not all abstract characters have code points (since >abstract characters can be formed using combining characters). Is >that correct? Yes. (Note above -- abstract characters are also a concept which applies to other character encodings besides the Unicode Standard, and not all encoded characters in other character encodings automatically make it into the Unicode Standard, for various architectural reasons.) > > - do (Å) and (A followed by combining ring >above) represent the same abstract character? Yes. That is the implicit claim behind a specification of canonical equivalence. --Ken > > Would be good if someone could clear this up. > > -- > Lars Marius Garshol, Ontopian http://www.ontopia.net > > ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no > > > >
Re: Abstract character?
Lars Marius Garshol wrote: > I'm trying to find out what an abstract character is. http://www.unicode.org/reports/tr17/ http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode/ markus
Abstract character?
I'm trying to find out what an abstract character is. I've been looking at chapter 3 of Unicode 3.0, without really achieving enlightenment. The term Unicode scalar value (apparently synonymous with code point) seems clear. It is the identifying number assigned to assigned Unicode characters. So far, so good. Some questions: - are all assigned Unicode characters also abstract characters? - it seems that not all abstract characters have code points (since abstract characters can be formed using combining characters). Is that correct? - do (Å) and (A followed by combining ring above) represent the same abstract character? Would be good if someone could clear this up. -- Lars Marius Garshol, Ontopian http://www.ontopia.net > ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >