On Tue, 22 May 2018 14:43:23 +0200 Martinho Fernandes via Unicode <[email protected]> wrote:
> On 22.05.18 12:51, Martinho Fernandes via Unicode wrote: > > > Hello, > > > > None of the *_Break properties are stable, as far as I can see in > > https://www.unicode.org/policies/stability_policy.html. If I > > understand correctly, this means that, at least in theory, it is > > possible that in Unicode version X a sequence of characters AB > > forms an extended grapheme cluster, i.e. A × B in the notation used > > in the algorithm description and in the test data, but then in > > Unicode version X+1, that changes to A ÷ B. > > > > Am I reading this correctly or is this not possible? Or is it > > possible in theory but not in practice? Or maybe it has happened > > before? > Hmm, to answer my own question, yes, this has happened before. In > Unicode 8 there were no breaks between regional indicators. In > Unicode 9 now there are no breaks "between regional indicator (RI) > symbols if there is an odd number of RI characters before the break > point". I has also happened in the direction break=>no break, with > when emoji ZWJ sequences were introduced. These are more refinements of the algorithm than fundamental changes. However, many of the breaks are inherently uncertain and may therefore be tailored. English has uncertainties as to word boundaries, but the author's decision is represented in writing, e.g. 'beam width' v. 'beamwidth'. In writing systems without visible boundaries between words, such as Thai, such vacillation could occur between software versions rather than between version of Unicode. Line break opportunities can in practice vacillate in such writing systems, e.g. between breaks at syllable boundaries and breaks at word boundaries. Formal extended grapheme cluster boundaries have varied in normal, well established text. In Thai, left matras and consonants were briefly part of the same grapheme cluster. When that formal property was implemented in editors, there were howls of pain from Thailand, and the change was promptly reversed. I do not believe one rules suits all Indic consonant clusters. While splitting X virama | Y makes sense for Devanagari with its half-forms, X | coeng Y makes no sense for scripts where it is the second consonant that changes shape. It makes even less sense when some combinations of 'coeng Y' are encoded separately, as in mainland SE Asia. These combinations are categorised as marks. In Burma, the syllable boundary comes after U+1A58 TAI THAM SIGN MAI KANG LAI. In Laos, it comes before it. We came very close to extended grapheme clusters being extended to whole aksharas in Unicode 11.0. My view is that Unicode has attempted to conflate several concepts in grapheme cluster, and it just doesn't work. Richard.

