Re: Extended grapheme cluster stability

Asmus Freytag via Unicode Wed, 23 May 2018 07:49:10 -0700

On thing to bear in mind about breaks: Unicode is plain-text and not"final rendered text".

Many types of breaks depend on things like actual font selection, columnwidth and other factors determined by styling. They are therefore notnecessarily stable from a plain text perspective (the same goes forthings not specified by Unicode, like hyphenation, because hyphenation,for example, depends on the actual language associated with a text,something not part of the plain text back-bone).

The moral is that if you need a frozen representation of text that doesnot behave differently if accessed, iterated, viewed etc. at differenttimes, you need to have some kind of rich-text format that can representall segmentation choices. If, on the other hand, you are doing a liveinteraction with the text, then Unicode segmentation gives you the "bestavailable" algorithm - which may change over time as new informationbecomes available about what constitutes best practice.

For many writing systems, the understanding of best practice is stillquite limited at this point - in the sense that even if it is known, itis not widely available and therefore there has not yet been a chance tovalidate and standardize it. (Setting aside areas of actual innovation,like emoji). For these reasons, it would be outright detrimental if anyof these algorithms are "frozen" -- however, the hope is that updatesare handled with some sensitivity to avoid unnecessary disruption ofsettled practice.


A./


On 5/22/2018 5:43 AM, Martinho Fernandes via Unicode wrote:

On 22.05.18 12:51, Martinho Fernandes via Unicode wrote:

Hello,

None of the *_Break properties are stable, as far as I can see in
https://www.unicode.org/policies/stability_policy.html. If I understand
correctly, this means that, at least in theory, it is possible that in
Unicode version X a sequence of characters AB forms an extended grapheme
cluster, i.e. A × B in the notation used in the algorithm description
and in the test data, but then in Unicode version X+1, that changes to A
÷ B.

Am I reading this correctly or is this not possible? Or is it possible
in theory but not in practice? Or maybe it has happened before?

Hmm, to answer my own question, yes, this has happened before. In
Unicode 8 there were no breaks between regional indicators. In Unicode 9
now there are no breaks "between regional indicator (RI) symbols if
there is an odd number of RI characters before the break point". I has
also happened in the direction break=>no break, with when emoji ZWJ
sequences were introduced.

Re: Extended grapheme cluster stability

Reply via email to