On thing to bear in mind about breaks: Unicode is plain-text and not "final rendered text".

Many types of breaks depend on things like actual font selection, column width and other factors determined by styling. They are therefore not necessarily stable from a plain text perspective (the same goes for things not specified by Unicode, like hyphenation, because hyphenation, for example, depends on the actual language associated with a text, something not part of the plain text back-bone).

The moral is that if you need a frozen representation of text that does not behave differently if accessed, iterated, viewed etc. at different times, you need to have some kind of rich-text format that can represent all segmentation choices. If, on the other hand, you are doing a live interaction with the text, then Unicode segmentation gives you the "best available" algorithm - which may change over time as new information becomes available about what constitutes best practice.

For many writing systems, the understanding of best practice is still quite limited at this point - in the sense that even if it is known, it is not widely available and therefore there has not yet been a chance to validate and standardize it. (Setting aside areas of actual innovation, like emoji). For these reasons, it would be outright detrimental if any of these algorithms are "frozen" -- however, the hope is that updates are handled with some sensitivity to avoid unnecessary disruption of settled practice.

A./


On 5/22/2018 5:43 AM, Martinho Fernandes via Unicode wrote:
On 22.05.18 12:51, Martinho Fernandes via Unicode wrote:

Hello,

None of the *_Break properties are stable, as far as I can see in
https://www.unicode.org/policies/stability_policy.html. If I understand
correctly, this means that, at least in theory, it is possible that in
Unicode version X a sequence of characters AB forms an extended grapheme
cluster, i.e. A × B in the notation used in the algorithm description
and in the test data, but then in Unicode version X+1, that changes to A
÷ B.

Am I reading this correctly or is this not possible? Or is it possible
in theory but not in practice? Or maybe it has happened before?

Hmm, to answer my own question, yes, this has happened before. In
Unicode 8 there were no breaks between regional indicators. In Unicode 9
now there are no breaks "between regional indicator (RI) symbols if
there is an odd number of RI characters before the break point". I has
also happened in the direction break=>no break, with when emoji ZWJ
sequences were introduced.


Reply via email to