When I use http://unicode.org/cldr/utility/breaks.jsp, it does show the sequence 𑒏�𑒺 as just two grapheme clusters.
In #29 we are specifically not concerned about ill-formed text (or other degenerate cases). I suppose it would be possible to handle isolated surrogates in different way (eg always breaking) if it represented a common problem, but someone would have to make a very good case for that. Mark <https://google.com/+MarkDavis> *— Il meglio è l’inimico del bene —* On Sun, Oct 4, 2015 at 3:02 PM, Richard Wordingham < richard.wording...@ntlworld.com> wrote: > In the absence of a specific tailoring, is the combination of a lone > surrogate and a combining mark a user-perceived character? Does a lone > surrogate constitute a user-perceived character? > > The problem I have is that because of an application-specific bug, > when I attempt to enter the sequence <U+1148F TIRHUTA LETTER KA, > U+114BA TIRHUTA SIGN CANDRABINDU>, I appear to be gettig the UTF-16 code > unit sequence <D805 DC8F D805 D805 DCBA>, which is being interpreted as > the codepoint sequence <U+1148F, U+D805, U+114BA>. > > (The problem seems to arise because I use a sequence of two key strokes > to enter candrabindu, and the application or input mechanism has to undo > the entry of a supplementary character entered in response to the first > keystroke. I've reported the problem as Bug 94753.) > > Because the lone surrogate is interpreted as the start of a > user-perceived character, I can move the cursor to between U+1148F and > U+D805. Then pressing the 'delete' key (as opposed to the 'rubout' > key) will delete the U+D805. However, if the lone surrogate plus > combining mark is a user-perceived character, then all I will be left > with is <U+1148F>. At present the offending application is treating > Tirhuta combining marks as user-perceived characters, but I suspect the > application has simply not caught up with Unicode Version 7 yet. > > Richard. >