Re: Re: Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?
I agree. The remaining slots could be very well allocated for some notational "superscript" (spacing) marks, more or less formed on ligatures without being really "extenders" for graphemes as they could as well be used isolately (I can think about special marks that could be used for measurement units, or some currencies, or honorific marks, or some localized variants of symbols like "trademark" or "registered", or some localized "ampersand" or similar, or some symbol for meaning "birth/death" after or before a date, or simply the encoding of superscript digits for Western Arabic or Eastern Arabic for Persian/Urdu, which won't be "extenders" for any grapheme but used isolately). The only useful default property is the assignment of a range for strong RTL letters/digits/punctuation/symbols, because of the complexity and stability of BiDi algorithms and the security issues and that are related to them, and difficulties for the UI. On the opposite each assigned block can contain smaller subranges (sometimes smaller than a full column) for combining marks, which are spread at various positions (but without huge complexiuty for handling them in algorithms like normalizations, even if they are necessarily stabilized: the default combining class for all unencoded characters is simply 0, blocking any Bidi reordering that would break later encoded documents using the newly assigned code points: normalization will apply only to reoder or recombine them only when these codes will be assigned to known characters with a known possibly non-zero combining class, but past versions of normalizers will keep them unchanged, preserving at least the canonical equivalences). 2016-12-12 18:30 GMT+01:00 Ken Whistler : > > > > Forwarded Message > Subject: Re: Should unassigned code points in blocks reserved for > combining marks, etc be GCB extended? > Date: Mon, 12 Dec 2016 08:26:45 -0800 > From: Ken Whistler > To: Karl Williamson > > On 12/12/2016 6:59 AM, Karl Williamson wrote: > > These are currently GCB Other, but when assigned, don't we know that > > they will be Extended? So this could be done now. > > > > Short answer: No. > > Long answer: > > Every proposal to pre-assign some range of unassigned code points a > non-default character property value for that range has a bunch of > hidden costs. This proposal would be particularly costly, because it > would be smack in the middle of some of the properties with the hairiest > dependency chains. > > GCB=Extend is dependent on Grapheme_Extend=Yes. That also means that any > particular change for GCB=Extend would also get reflected into WB=Extend > and SB=Extend, which are also dependent on Grapheme_Extend=Yes. > > Grapheme_Extend=Yes is itself a mixed bag. It is derived from gc=Mn or > gc=Me, which would seem to be a natural match for the blocks "reserved" > for combining marks, but it is actually also dependent on > Other_Grapheme_Extend, which is a mixed bag of various spacing combining > marks for normalization closure, plus ZWNJ, plus tag characters, plus > spacing halfwidth dakuten. > > So that would raise complicated questions about *how* GCB=Extend would > itself be extended to include certain set ranges of unassigned code > points. Would those simply be assigned directly to Grapheme_Extend=Yes > (which would create a complicated default assignment for that derived > property, and complicate both its documentation *and* its derivation)? > Or would they be assigned directly to Other_Grapheme_Extend (which would > create a new animal in the zoo of properties -- a contributory property > which itself has ranges of unassigned code points given non-default > values). And once decided, what would be the implications for all the > documentation and the tooling? > > Any proposal like this then also has hidden costs on the committees, > because it sets up implied requirements for what can be encoded where > and what properties it has to have. Every time such defaults are set up, > it makes the documentation of what is already "pre-assigned" more > complicated and fragile. Already, a large proportion of the participants > in the maintenance committees have very murky understandings about what > can and cannot be put where in the future, and why. And that is a recipe > for mistakes in encoding. > > Finally, like it or not, there currently is no actually contract > guaranteeing that the remaining open ranges in blocks "reserved" for > combining marks will all end up gc=Mn or gc=Me, anyway. The relevant > ranges are 1ABF..1AFF, 1DF6..1DFA, and 20F1..20FF. There is nothing to > prevent the committees from deciding that one (or more) spacing > combining marks might be appropriate to encode there, or possibly even > spacing non-combining marks of some strange sort, like the spacing > Arabic letter diacritics that ended up at FBB2..FBC1. Trying to keep > those ranges free of characters that would not be Grapheme_Extend=Yes > would require some guy on the commit
Re: Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?
On Mon, 12 Dec 2016 09:30:31 -0800 Ken Whistler wrote: > On 12/12/2016 6:59 AM, Karl Williamson wrote: > > These are currently GCB Other, but when assigned, don't we know that > > they will be Extended? So this could be done now. > Any proposal like this then also has hidden costs on the committees, > because it sets up implied requirements for what can be encoded where > and what properties it has to have. Every time such defaults are set > up, it makes the documentation of what is already "pre-assigned" more > complicated and fragile. Already, a large proportion of the > participants in the maintenance committees have very murky > understandings about what can and cannot be put where in the future, > and why. And that is a recipe for mistakes in encoding. How does this differ from U+0816 SAMARITAN MARK IN changing from bidi_class=R to bidi_class=NSM upon assignment? The idea is to reduce the damage done by the use of obsolete versions of the Unicode database. > Finally, like it or not, there currently is no actually contract > guaranteeing that the remaining open ranges in blocks "reserved" for > combining marks will all end up gc=Mn or gc=Me, anyway. The relevant > ranges are 1ABF..1AFF, 1DF6..1DFA, and 20F1..20FF. There is nothing to > prevent the committees from deciding that one (or more) spacing > combining marks might be appropriate to encode there, or possibly even > spacing non-combining marks of some strange sort, like the spacing > Arabic letter diacritics that ended up at FBB2..FBC1. Trying to keep > those ranges free of characters that would not be Grapheme_Extend=Yes > would require some guy on the committee to be aware of the arcane > dependencies for segmentation properties, and then to police such > decisions in perpetuity -- or at least until the blocks in question > filled up with non-problematical characters. What is the down side of a code point changing from Graphme_Extend=Yes to Grapheme_Extend=No when it is assigned? Richard.
Fwd: Re: Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?
Forwarded Message Subject: Re: Should unassigned code points in blocks reserved for combining marks, etc be GCB extended? Date: Mon, 12 Dec 2016 08:26:45 -0800 From: Ken Whistler To: Karl Williamson On 12/12/2016 6:59 AM, Karl Williamson wrote: These are currently GCB Other, but when assigned, don't we know that they will be Extended? So this could be done now. Short answer: No. Long answer: Every proposal to pre-assign some range of unassigned code points a non-default character property value for that range has a bunch of hidden costs. This proposal would be particularly costly, because it would be smack in the middle of some of the properties with the hairiest dependency chains. GCB=Extend is dependent on Grapheme_Extend=Yes. That also means that any particular change for GCB=Extend would also get reflected into WB=Extend and SB=Extend, which are also dependent on Grapheme_Extend=Yes. Grapheme_Extend=Yes is itself a mixed bag. It is derived from gc=Mn or gc=Me, which would seem to be a natural match for the blocks "reserved" for combining marks, but it is actually also dependent on Other_Grapheme_Extend, which is a mixed bag of various spacing combining marks for normalization closure, plus ZWNJ, plus tag characters, plus spacing halfwidth dakuten. So that would raise complicated questions about *how* GCB=Extend would itself be extended to include certain set ranges of unassigned code points. Would those simply be assigned directly to Grapheme_Extend=Yes (which would create a complicated default assignment for that derived property, and complicate both its documentation *and* its derivation)? Or would they be assigned directly to Other_Grapheme_Extend (which would create a new animal in the zoo of properties -- a contributory property which itself has ranges of unassigned code points given non-default values). And once decided, what would be the implications for all the documentation and the tooling? Any proposal like this then also has hidden costs on the committees, because it sets up implied requirements for what can be encoded where and what properties it has to have. Every time such defaults are set up, it makes the documentation of what is already "pre-assigned" more complicated and fragile. Already, a large proportion of the participants in the maintenance committees have very murky understandings about what can and cannot be put where in the future, and why. And that is a recipe for mistakes in encoding. Finally, like it or not, there currently is no actually contract guaranteeing that the remaining open ranges in blocks "reserved" for combining marks will all end up gc=Mn or gc=Me, anyway. The relevant ranges are 1ABF..1AFF, 1DF6..1DFA, and 20F1..20FF. There is nothing to prevent the committees from deciding that one (or more) spacing combining marks might be appropriate to encode there, or possibly even spacing non-combining marks of some strange sort, like the spacing Arabic letter diacritics that ended up at FBB2..FBC1. Trying to keep those ranges free of characters that would not be Grapheme_Extend=Yes would require some guy on the committee to be aware of the arcane dependencies for segmentation properties, and then to police such decisions in perpetuity -- or at least until the blocks in question filled up with non-problematical characters. So the long answer is also: No. --Ken
Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?
These are currently GCB Other, but when assigned, don't we know that they will be Extended? So this could be done now.