Peter Kirk asked:

> It does look very odd that 1D28 has been separated from the other pi's, 
> 1D29 from the other rho's etc. Is there a good reason for that? I know 
> everyone hates the UPA (except for Uralicists presumably), but these 
> letters are still clearly variants of pi and rho. The same applies to 
> the Latin small caps of course - why are they collated separately at the 
> first level when all other font variants are not?

The reason for this is that these are *small capital* variants.
Small capitals were never given compatibility decomposition mappings
in UnicodeData.txt. Thus, because compatibility decomposition
mappings are used for the first, automated cut at tertiary
weighting distinctions, small capitals don't get autoweighted
as tertiary variants. Instead, the input file is generated in
such a way that they get primary weights right after the group
of characters associated with the primary weight of the base
character.

If you look further in the collation charts outside of Greek, you
will find that this is done consistently this way for the Latin
letters. So "fixing" it for the few Greek small capitals from
UPA would introduce an inconsistency between Greek and Latin
weighting. Also, "fixing" it would be a non-trivial task, since
it would either require introducing another distinct tertiary
weight into the table or would require treating "small capital"
as a secondary weight distinction. The latter would be easier
to implement, but then would lead to arguments among the
perfectionists as to why "small capital" should be a secondary
weight distinction when capital versus small is a tertiary
weight distinction. And so on and so on...

In any case, these small capitals are very, very unlikely to
count much in sorting of any real corpus of data, and even if
they do, the mechanism of tailoring is always available for
people to tweak the table into exactly the behavior they
prefer.

--Ken


Reply via email to