2011/7/14 <announceme...@unicode.org>: > The Unicode Technical Committee has posted a new issue for public review and > comment. Details are on the following web page: > > http://www.unicode.org/review/ > > Review period for the new item closes on July 27, 2011. > > Please see the page for links to discussion and relevant documents. Briefly, > the new issue is: > > > PRI #200 Draft UTR #49: Unicode Character Categories
Here is a copy of my comment posted to the Online Report (this may still be commented) : [quote] It looks like the subcatories for [Letter] are not very well formulated in the current CharacterCategories.txt datafile, and in fact inconsistant. The most obvious level-2 suncategory should include [Consonnant], [Vowel], and [Half-consonnant]. Other distinctions like [Digraph] should be moved in a lower category. Note that [Consonnant] has been applied to the full basic Arabic abjad, but not to the similar Hebrew abjad. In fact, it also should make distinctions between true [Consonnant]s and [Half-consonant]s, the later including letters that can act either as consonnants (acting like a mute or stop consonnant with a default inherent or implied vowel, possibly modified by acting as an holder for an optional vocalic diacritic/mark), or as vowels (e.g. Alef and Yod in Arabic or Hebrew; Y in Latin; RA and LA in Indic scripts), depending on their context. Yes, it may be fuzzy with some languages using the same script (e.g. W in German is undoubtly a consonnant, but in many languages this is most often a gliding consonnant ; or V in Roman Latin where there was no distinction with U; but at least, categorizing as [Half-consonnant] will trigger the ambiguity of its use. Then the third level should be for case distinctions [Lowercase], [Uppercase], [Titlecase] and [Uncased] (in scripts that have case distinctions). The last level can then be used for [Ligatured] (such as Œ and Æ, even if they are still considered as a plain letter, this still allows spcific languages to consider them as letter pairs for collation purpose), [Digraph] (such as IJ), [Final] (e.g. Greek final sigma) The content of this (informative) file should also be consistant with the content of the DUCET (which obviously contain case distinctions at the third level). However secondary differences exposed inthe ducet (e.g. for diacritic differences) should probably not be categorized. And like the DUCET, it should be tailorable in applications or in specific languages (for example in the CLDR database), so that these categories are just the default ones used when there's no tailoring. I do think that such possible tailoring should be explicitly in the draft UTR #49 ! [/quote] -- Philippe.