Steffen Daode Nurpmeso observed: > Hello, in UAX #44 i read > > Simple_Titlecase_Mapping ... > Note: If this field is null, then the Simple_Titlecase_Mapping > is the same as the Simple_Uppercase_Mapping for this character. > > So a parser has to be aware of this, automatically falling back to > the uppercase mapping (index 12) when there is no explicit > titlecase mapping (index 14). > > Given this the following surprised me: > > ?0[steffen@sherwood unicode]$ <UnicodeData.txt awk 'BEGIN{FS=";"}\ > {if (length($15) && $15 = $13) print}' |wc -l > 1051 > ?0[steffen@sherwood unicode]$ <UnicodeData.txt awk 'BEGIN{FS=";"}\ > {if (length($15) && $15 != $13) print}' |wc -l > 12 > > (I.e., 1051 times the redundant mapping is defined.)
Prior to Unicode 5.2, the relevant documentation (in UCD.html) used to say: The simple titlecase may be omitted in the data file if the titlecase is the same as the uppercase. Someone correctly pointed out that that statement was ambiguous. It was corrected to the current note, which is both correct and states the intention of the simple titlecase mapping: that it be equivalent to the simple uppercase mapping unless it isn't, in which case a different explicit value will be in the field (the 12 cases you noted). The redundant titlecase mapping values were not *removed* from the data file, as there was a significant chance that that would disrupt parsers which had long been using conventions which expected explicit values in the field. --Ken