date:20161212

Re: Re: Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?

2016-12-12 Thread Philippe Verdy

I agree. The remaining slots could be very well allocated for some
notational "superscript" (spacing) marks, more or less formed on ligatures
without being really "extenders" for graphemes as they could as well be
used isolately (I can think about special marks that could be used for
measurement units, or some currencies, or honorific marks, or some
localized variants of symbols like "trademark" or "registered", or some
localized "ampersand" or similar, or some symbol for meaning "birth/death"
after or before a date, or simply the encoding of superscript digits for
Western Arabic or Eastern Arabic for Persian/Urdu, which won't be
"extenders" for any grapheme but used isolately).

The only useful default property is the assignment of a range for strong
RTL letters/digits/punctuation/symbols, because of the complexity and
stability of BiDi algorithms and the security issues and that are related
to them, and difficulties for the UI. On the opposite each assigned block
can contain smaller subranges (sometimes smaller than a full column) for
combining marks, which are spread at various positions (but without huge
complexiuty for handling them in algorithms like normalizations, even if
they are necessarily stabilized: the default combining class for all
unencoded characters is simply 0, blocking any Bidi reordering that would
break later encoded documents using the newly assigned code points:
normalization will apply only to reoder or recombine them only when these
codes will be assigned to known characters with a known possibly non-zero
combining class, but past versions of normalizers will keep them unchanged,
preserving at least the canonical equivalences).

2016-12-12 18:30 GMT+01:00 Ken Whistler :

>
>
>
>  Forwarded Message 
> Subject: Re: Should unassigned code points in blocks reserved for
> combining marks, etc be GCB extended?
> Date: Mon, 12 Dec 2016 08:26:45 -0800
> From: Ken Whistler  
> To: Karl Williamson  
>
> On 12/12/2016 6:59 AM, Karl Williamson wrote:
> > These are currently GCB Other, but when assigned, don't we know that
> > they will be Extended?  So this could be done now.
> >
>
> Short answer: No.
>
> Long answer:
>
> Every proposal to pre-assign some range of unassigned code points a
> non-default character property value for that range has a bunch of
> hidden costs. This proposal would be particularly costly, because it
> would be smack in the middle of some of the properties with the hairiest
> dependency chains.
>
> GCB=Extend is dependent on Grapheme_Extend=Yes. That also means that any
> particular change for GCB=Extend would also get reflected into WB=Extend
> and SB=Extend, which are also dependent on Grapheme_Extend=Yes.
>
> Grapheme_Extend=Yes is itself a mixed bag. It is derived from gc=Mn or
> gc=Me, which would seem to be a natural match for the blocks "reserved"
> for combining marks, but it is actually also dependent on
> Other_Grapheme_Extend, which is a mixed bag of various spacing combining
> marks for normalization closure, plus ZWNJ, plus tag characters, plus
> spacing halfwidth dakuten.
>
> So that would raise complicated questions about *how* GCB=Extend would
> itself be extended to include certain set ranges of unassigned code
> points. Would those simply be assigned directly to Grapheme_Extend=Yes
> (which would create a complicated default assignment for that derived
> property, and complicate both its documentation *and* its derivation)?
> Or would they be assigned directly to Other_Grapheme_Extend (which would
> create a new animal in the zoo of properties -- a contributory property
> which itself has ranges of unassigned code points given non-default
> values). And once decided, what would be the implications for all the
> documentation and the tooling?
>
> Any proposal like this then also has hidden costs on the committees,
> because it sets up implied requirements for what can be encoded where
> and what properties it has to have. Every time such defaults are set up,
> it makes the documentation of what is already "pre-assigned" more
> complicated and fragile. Already, a large proportion of the participants
> in the maintenance committees have very murky understandings about what
> can and cannot be put where in the future, and why. And that is a recipe
> for mistakes in encoding.
>
> Finally, like it or not, there currently is no actually contract
> guaranteeing that the remaining open ranges in blocks "reserved" for
> combining marks will all end up gc=Mn or gc=Me, anyway. The relevant
> ranges are 1ABF..1AFF, 1DF6..1DFA, and 20F1..20FF. There is nothing to
> prevent the committees from deciding that one (or more) spacing
> combining marks might be appropriate to encode there, or possibly even
> spacing non-combining marks of some strange sort, like the spacing
> Arabic letter diacritics that ended up at FBB2..FBC1. Trying to keep
> those ranges free of characters that would not be Grapheme_Extend=Yes
> would require some guy on the commit

Re: Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?

2016-12-12 Thread Richard Wordingham

On Mon, 12 Dec 2016 09:30:31 -0800
Ken Whistler  wrote:

> On 12/12/2016 6:59 AM, Karl Williamson wrote:

> > These are currently GCB Other, but when assigned, don't we know that
> > they will be Extended?  So this could be done now.

> Any proposal like this then also has hidden costs on the committees,
> because it sets up implied requirements for what can be encoded where
> and what properties it has to have. Every time such defaults are set
> up, it makes the documentation of what is already "pre-assigned" more
> complicated and fragile. Already, a large proportion of the
> participants in the maintenance committees have very murky
> understandings about what can and cannot be put where in the future,
> and why. And that is a recipe for mistakes in encoding.

How does this differ from U+0816 SAMARITAN MARK IN changing from
bidi_class=R to bidi_class=NSM upon assignment?

The idea is to reduce the damage done by the use of obsolete versions of
the Unicode database.
 
> Finally, like it or not, there currently is no actually contract
> guaranteeing that the remaining open ranges in blocks "reserved" for
> combining marks will all end up gc=Mn or gc=Me, anyway. The relevant
> ranges are 1ABF..1AFF, 1DF6..1DFA, and 20F1..20FF. There is nothing to
> prevent the committees from deciding that one (or more) spacing
> combining marks might be appropriate to encode there, or possibly even
> spacing non-combining marks of some strange sort, like the spacing
> Arabic letter diacritics that ended up at FBB2..FBC1. Trying to keep
> those ranges free of characters that would not be Grapheme_Extend=Yes
> would require some guy on the committee to be aware of the arcane
> dependencies for segmentation properties, and then to police such
> decisions in perpetuity -- or at least until the blocks in question
> filled up with non-problematical characters.

What is the down side of a code point changing from Graphme_Extend=Yes
to Grapheme_Extend=No when it is assigned?

Richard.

Fwd: Re: Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?

2016-12-12 Thread Ken Whistler

 Forwarded Message 
Subject: 	Re: Should unassigned code points in blocks reserved for 
combining marks, etc be GCB extended?

Date:   Mon, 12 Dec 2016 08:26:45 -0800
From:   Ken Whistler 
To: Karl Williamson 

On 12/12/2016 6:59 AM, Karl Williamson wrote:

These are currently GCB Other, but when assigned, don't we know that
they will be Extended?  So this could be done now.

Short answer: No.

Long answer:

Every proposal to pre-assign some range of unassigned code points a
non-default character property value for that range has a bunch of
hidden costs. This proposal would be particularly costly, because it
would be smack in the middle of some of the properties with the hairiest
dependency chains.

GCB=Extend is dependent on Grapheme_Extend=Yes. That also means that any
particular change for GCB=Extend would also get reflected into WB=Extend
and SB=Extend, which are also dependent on Grapheme_Extend=Yes.

Grapheme_Extend=Yes is itself a mixed bag. It is derived from gc=Mn or
gc=Me, which would seem to be a natural match for the blocks "reserved"
for combining marks, but it is actually also dependent on
Other_Grapheme_Extend, which is a mixed bag of various spacing combining
marks for normalization closure, plus ZWNJ, plus tag characters, plus
spacing halfwidth dakuten.

So that would raise complicated questions about *how* GCB=Extend would
itself be extended to include certain set ranges of unassigned code
points. Would those simply be assigned directly to Grapheme_Extend=Yes
(which would create a complicated default assignment for that derived
property, and complicate both its documentation *and* its derivation)?
Or would they be assigned directly to Other_Grapheme_Extend (which would
create a new animal in the zoo of properties -- a contributory property
which itself has ranges of unassigned code points given non-default
values). And once decided, what would be the implications for all the
documentation and the tooling?

Any proposal like this then also has hidden costs on the committees,
because it sets up implied requirements for what can be encoded where
and what properties it has to have. Every time such defaults are set up,
it makes the documentation of what is already "pre-assigned" more
complicated and fragile. Already, a large proportion of the participants
in the maintenance committees have very murky understandings about what
can and cannot be put where in the future, and why. And that is a recipe
for mistakes in encoding.

Finally, like it or not, there currently is no actually contract
guaranteeing that the remaining open ranges in blocks "reserved" for
combining marks will all end up gc=Mn or gc=Me, anyway. The relevant
ranges are 1ABF..1AFF, 1DF6..1DFA, and 20F1..20FF. There is nothing to
prevent the committees from deciding that one (or more) spacing
combining marks might be appropriate to encode there, or possibly even
spacing non-combining marks of some strange sort, like the spacing
Arabic letter diacritics that ended up at FBB2..FBC1. Trying to keep
those ranges free of characters that would not be Grapheme_Extend=Yes
would require some guy on the committee to be aware of the arcane
dependencies for segmentation properties, and then to police such
decisions in perpetuity -- or at least until the blocks in question
filled up with non-problematical characters.

So the long answer is also: No.

--Ken

Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?

2016-12-12 Thread Karl Williamson

These are currently GCB Other, but when assigned, don't we know that 
they will be Extended?  So this could be done now.

Re: Re: Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?

Re: Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?

Fwd: Re: Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?

Should unassigned code points in blocks reserved for combining marks, etc be GCB extended?

4 matches

Site Navigation

Mail list logo

Footer information