Philippe Verdy said: > > The same thing can be said about any inserted invisible character, > > combining or not. > > > > How is: <a, ring above, null, dot below> supposed to be different from > > <a, dot below, null, ring above> > > > > How is: <a, ring above, LRM, dot below> supposed to be different from > > <a, dot below, LRM, ring above> > > > > In display, they might not be distinct, unless you were doing some > > kind of show-hidden display. Yet these sequences are not canonically > > equivalent, and the presence of an embedded control character or an > > embedded format control character would block canonical reordering. > > > I disagree with you, using a LRM mark in the middle of a combining > sequence is conforming to canonicalization rules but is clearly > ill-formed,
It is not. TUS 4.0, p. 71: D17a Defective combining character sequence: A combining character sequence that does not start with a base character. * Defective combining character sequences occur when a sequence of combining characters appears at the start of a string or follows a control or format character. Such sequences are defective from the point of view of handling of combining marks, but are not ill-formed. ^^^^^^^^^^^^^^^^^^^^^^ > as well as using a NULL control in the middle, which > breaks the combining sequence. I'm not claiming it doesn't break the combining sequence. Of course it does. It creates a defective combining character sequence, and that poses a challenge for rendering, since it departs from the usual expectations for normal combining character sequences. The renderer has to split hairs between the fact that it is dealing with a defective combining character sequence and the fact that it is dealing with a default ignorable character which is supposed to be ignored for text processes it is not immediately applicable to. But I challenge you to find anything in the standard that *prohibits* such sequences from occurring. And *if* they occur, they are not canonically equivalent, which was the point I was making to Kent. > The proposal to use CGJ however is legal: it does not break the > combining sequences and grapheme clusters, and thus the whole > encoded sequence encoded with CGJ will be considered by > rendering engines, where CGJ is a no-op for rendering but not for > the canonical ordering ... Well, yes, which is why I have been advocating it as the solution to the Biblical Hebrew text representation problem. I agree with you about that. But it need not be characterized as "legal" in opposition to the other examples I cited above. All of these sequences are "legal" and allowed by the standard. --Ken