"Martin J. Dürst" <[email protected]> wrote: > > On 2010/07/29 13:33, karl williamson wrote: > > Asmus Freytag wrote: > >> On 7/25/2010 6:05 PM, Martin J. Dürst wrote: > > >>> Well, there actually is such a script, namely Han. The digits (一、 > >>> 二、三、四、五、六、七、八、九、〇) are used both as letters and as > >>> decimal place-value digits, and they are scattered widely, and of > >>> course there are is a lot of modern living practice. > > >> The situation is worse than you indicate, because the same characters > >> are also used as elements in a system that doesn't use place-value, > >> but uses special characters to show powers of 10. > > No. Sequences of numeric Kanji are also used in names and word-plays, > and as sequences of individual small numbers.
(1) Existing exception : There's one example of a digit which has a numeric type = decimal, AND is encoded in a "scattered" way: 19DA;6618;᧚;New Tai Lue Tham Digit One;Nd;0;L;...;1;1;1;N The other decimal nine digits for the Tham variant of the New Tai Lue digits are borrowed from another sequence of decimal digits, starting at U+19D0 (for digit zero) with the exception of U+19D1 which is replaced (for digit one). Both sets are assigned in the same "New_Tai_Lue" script property value. So the additional stability proposal will not be enforceable. (2) Arabic digits : Such case was avoided for the Eastern/Extended variant of Arabo-Indic digits in U+06F0..U+06F9, without borrowing the common forms for the Standard variant in U+0660.U+0669: they were reencoded separately to create a complete sequence of 10 digits, even if most of them (all except 4 to 6) are exactly similar and belong to the same unified "script". But what is even more "strange" is that the Standard Arabic digits are assigned to the "Common" script, when the Eastern/Extended variant is assigned to the "Arabic" script (look at the Unicode script property value, from the file "Scripts-5.2.0.txt" in the UCD). If you just look at this property, you may think that the Extended/Eastern digits are the standard ones for the Arabic script: this is a side-effect of unification of Western and Eastern variants of the Arabic script. (3) Unification of the Arabic script: Ideally, there should be two additional separate ISO 15924 script codes for the Western and Eastern variants the Arabic script (possibly [Arbs] for Standard/Western, and [Arbx] for Extended/Eastern), and the Unicode "script" property value alias for the Western and Eastern digits or letters should be segregated, using a separate Script property value (splitting the Arabic script, where it is significant, just like it occured for Georgian and Greek/Coptic alphabets). Nothing will be changed for the existing Arabic script, but the "Extended/Eastern Arabic" script (assigned with a new ISO 15924 code and mapped with a new property alias in Unicode), will still borrow most of its letters from the standard script without reencoding them. No character or block will be renamed (and I DO NOT propose to disunifying existing common Arabic letters, or assigning them in the "Common" script), it should just be a better sub-classification, where the characters are clearly distinguished between the two variants. Most Arabic characters should remain in the common "Arabic" script, and those that are differentiated should be assigned in a "Standard_Arabic" or "Extended_Arabic" script. But this may cause some complication for the script inheritance in spans of texts (because the "Arabic" script property value would behave a bit like what the "Common" does for alphabetic scripts, i.e. like a group of scripts). Such change for the assigned script property value (if it's not already stabilized) would require documentation, and changes in a few other core or derived datafiles: - PropertyValueAliases.txt (adding two new property values for "sc"): sc ; Arab ; Arabic # All forms, includes "sc=Arbc", "sc=Arbs" and "sc=Arbx" in regexps) sc ; Arbc ; Common_Arabic sc ; Arbs ; Standard_Arabic # (also includes "sc=Arbc" in regexps) sc ; Arbx ; Extended_Arabic # (also includes "sc=Arbc" in regexps) - Script.txt (assigning the two new property values to remap existing "Arabic") - Arabic-Shaping.txt (possibly adding comments at end of lines where this is not the Common Arabic) - Joining-Groups.txt (same remark) - Bidi-Mirroring.txt (same remark) And in the description of some standard script identification and segmentation algorithms. I don't know if IDNA should continue to use "Arab" (all forms) or if it should segregate "Arbs" and "Arbx" (to avoid mixing digits that are visually confusable), as it uses such segmentation (note that these characters are canonically different, for normalization purposes). Philippe.

