Philippe Verdy said:

> > The same thing can be said about any inserted invisible character,
> > combining or not.
> > 
> > How is: <a, ring above, null, dot below> supposed to be different from
> >         <a, dot below, null, ring above>
> > 
> > How is: <a, ring above, LRM, dot below> supposed to be different from
> >         <a, dot below, LRM, ring above>
> > 
> > In display, they might not be distinct, unless you were doing some
> > kind of show-hidden display. Yet these sequences are not canonically
> > equivalent, and the presence of an embedded control character or an
> > embedded format control character would block canonical reordering.
> 
> 
> I disagree with you, using a LRM mark in the middle of a combining
> sequence is conforming to canonicalization rules but is clearly
> ill-formed, 

It is not. TUS 4.0, p. 71:

D17a Defective combining character sequence: A combining character
     sequence that does not start with a base character.
     
     * Defective combining character sequences occur when a sequence
       of combining characters appears at the start of a string or
       follows a control or format character. Such sequences are
       defective from the point of view of handling of combining
       marks, but are not ill-formed.
              ^^^^^^^^^^^^^^^^^^^^^^

> as well as using a NULL control in the middle, which
> breaks the combining sequence.

I'm not claiming it doesn't break the combining sequence. Of
course it does. It creates a defective combining character
sequence, and that poses a challenge for rendering, since it
departs from the usual expectations for normal combining
character sequences. The renderer has to split hairs between
the fact that it is dealing with a defective combining
character sequence and the fact that it is dealing with a
default ignorable character which is supposed to be ignored
for text processes it is not immediately applicable to.

But I challenge you to find anything in the standard that
*prohibits* such sequences from occurring.

And *if* they occur, they are not canonically equivalent, which
was the point I was making to Kent.

> The proposal to use CGJ however is legal: it does not break the
> combining sequences and grapheme clusters, and thus the whole
> encoded sequence encoded with CGJ will be considered by
> rendering engines, where CGJ is a no-op for rendering but not for
> the canonical ordering ...

Well, yes, which is why I have been advocating it as the
solution to the Biblical Hebrew text representation problem.
I agree with you about that. But it need not be characterized
as "legal" in opposition to the other examples I cited above.
All of these sequences are "legal" and allowed by the
standard.

--Ken


Reply via email to