Peter Kirk asked: > It does look very odd that 1D28 has been separated from the other pi's, > 1D29 from the other rho's etc. Is there a good reason for that? I know > everyone hates the UPA (except for Uralicists presumably), but these > letters are still clearly variants of pi and rho. The same applies to > the Latin small caps of course - why are they collated separately at the > first level when all other font variants are not?
The reason for this is that these are *small capital* variants. Small capitals were never given compatibility decomposition mappings in UnicodeData.txt. Thus, because compatibility decomposition mappings are used for the first, automated cut at tertiary weighting distinctions, small capitals don't get autoweighted as tertiary variants. Instead, the input file is generated in such a way that they get primary weights right after the group of characters associated with the primary weight of the base character. If you look further in the collation charts outside of Greek, you will find that this is done consistently this way for the Latin letters. So "fixing" it for the few Greek small capitals from UPA would introduce an inconsistency between Greek and Latin weighting. Also, "fixing" it would be a non-trivial task, since it would either require introducing another distinct tertiary weight into the table or would require treating "small capital" as a secondary weight distinction. The latter would be easier to implement, but then would lead to arguments among the perfectionists as to why "small capital" should be a secondary weight distinction when capital versus small is a tertiary weight distinction. And so on and so on... In any case, these small capitals are very, very unlikely to count much in sorting of any real corpus of data, and even if they do, the mechanism of tailoring is always available for people to tweak the table into exactly the behavior they prefer. --Ken