On the surface, this does seem confusing: it seems like it could imply that 
there might be an existing problem with normalization — a sequence that should 
be equivalent to its NFD form but that would have marks in a different order in 
the NFD form. 

However, wrt that apparent contradiction, it's important to keep in mind that 
canonical combining classes are used in conjunction with Unicode normalization, 
and that all defined normalization forms begin with decomposition followed by 
canonical ordering of marks.

So, for instance, consider a character sequence < 0F81, 0F84 >. The canonical 
combining class of 0F81, implying that nothing would re-order around that 
character in canonical ordering. And compare that with the equivalent 
decomposed sequence (using the decomposition mapping for 0F81), < 0F71, 0F80, 
0F84 >. The canonical combining classes of those characters, in sequence, are < 
129, 130, 9 >, and so you might expect those would canonically reorder in the 
order 9 < 129 < 130, hence a sequence < 0F84, 0F71, 0F80 >. Yet the sequence < 
0F84, 0F71, 0F80 > clearly is not equivalent to the original sequence < 0F81, 
0F84 >.

The fallacy in that reasoning is the step of considering canonical ordering of 
the non-decomposed sequence < 0F81, 0F84 >. Canonical reordering is only ever 
intended to be done on fully decomposed sequences.

That explains why there isn't any contradiction regarding normalization.

But now to get to your question: isn't it a discrepancy to have a mark with 
ccc=0 decompose to a sequence of marks with ccc > 0? 

The only potential discrepancy that would matter would be if there were a 
problem with normalization. That's because canonical combining classes only 
have relevance in relation to normalization. And I've explained above why there 
isn't any such issue.

So, with that in mind... Every combining mark must be assigned some canonical 
combining class. In this case, we're considering a mark that's a precomposed 
form for a sequence of marks with different combining classes, 129 and 130. If 
0F81 were assigned ccc = 129, that would seem strange (and you or someone else 
would eventually ask for an explanation). Likewise, if 0F81 were assigned ccc = 
130. The likely reason why 0F81 as assigned to class 0 is that in needed to be 
assigned to _some_ class and class 0 was the least strange choice. 

Note that 0F81 could have been assigned to _any_ canonical combining class and 
it would not have had any effect on normalization: The canonical combining 
class of a combining mark with a canonical decomposition mapping is never used! 
Only the ccc for characters in the fully decomposed sequence matters. Even so, 
I think it's fair to say that ccc = 0 is the least strange assignment for 0F81.

Likewise for 0F73 and 0F75.



Peter


-----Original Message-----
From: Unicode <[email protected]> On Behalf Of Diego Frias via 
Unicode
Sent: July 29, 2025 10:48 AM
To: [email protected]
Subject: U+0F81 Canonical Combining Class?

The Tibetan Unicode block contains a number of characters (U+0F73, U+0F75, 
U+0F81) that have a canonical combining class value of zero, and have non-empty 
decomposition mappings. This is not out of the ordinary, but upon inspecting 
the code points that they map to, I found that the canonical combining class of 
each decomposition code point is greater than zero.

In the case of U+0F81, the decomposition mapping is: U+0F71 U+0F80. Both U+0F71 
and U+0F80 have canonical combining class values greater than zero, so U+0F81 
decomposes solely into combining marks, yet has a canonical combining class 
value of zero.

What is the reasoning behind this discrepancy? It is my understanding that 
U+0F81 (TIBETAN VOWEL SIGN REVERSED II, ཱྀ) is supposed to be a combining mark. 
Also, the Tibetan block is the only block that contains code points with this 
behavior. It is likely that I'm misunderstanding the semantics of the canonical 
combining class system.


Diego Frias

Reply via email to