Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

Philippe Verdy Thu, 29 Jul 2010 06:12:00 -0700

"Martin J. Dürst" <[email protected]> wrote:
>
> On 2010/07/29 13:33, karl williamson wrote:
> > Asmus Freytag wrote:
> >> On 7/25/2010 6:05 PM, Martin J. Dürst wrote:
>
> >>> Well, there actually is such a script, namely Han. The digits (一、
> >>> 二、三、四、五、六、七、八、九、〇) are used both as letters and as
> >>> decimal place-value digits, and they are scattered widely, and of
> >>> course there are is a lot of modern living practice.
>
> >> The situation is worse than you indicate, because the same characters
> >> are also used as elements in a system that doesn't use place-value,
> >> but uses special characters to show powers of 10.
>
> No. Sequences of numeric Kanji are also used in names and word-plays,
> and as sequences of individual small numbers.


  (1) Existing exception :

There's one example of a digit which has a numeric type = decimal, AND
is encoded in a "scattered" way:

19DA;6618;᧚;New Tai Lue Tham Digit One;Nd;0;L;...;1;1;1;N

The other decimal nine digits for the Tham variant of the New Tai Lue
digits are borrowed from another sequence of decimal digits, starting
at U+19D0 (for digit zero) with the exception of U+19D1 which is
replaced (for digit one). Both sets are assigned in the same
"New_Tai_Lue" script property value.

So the additional stability proposal will not be enforceable.


  (2) Arabic digits :

Such case was avoided for the Eastern/Extended variant of Arabo-Indic
digits in U+06F0..U+06F9, without borrowing the common forms for the
Standard variant in U+0660.U+0669: they were reencoded separately to
create a complete sequence of 10 digits, even if most of them (all
except 4 to 6) are exactly similar and belong to the same unified
"script".

But what is even more "strange" is that the Standard Arabic digits are
assigned to the "Common" script, when the Eastern/Extended variant is
assigned to the "Arabic" script (look at the Unicode script property
value, from the file "Scripts-5.2.0.txt" in the UCD).

If you just look at this property, you may think that the
Extended/Eastern digits are the standard ones for the Arabic script:
this is a side-effect of unification of Western and Eastern variants
of the Arabic script.


  (3) Unification of the Arabic script:

Ideally, there should be two additional separate ISO 15924 script
codes for the Western and Eastern variants the Arabic script (possibly
[Arbs] for Standard/Western, and [Arbx] for Extended/Eastern), and the
Unicode "script" property value alias for the Western and Eastern
digits or letters should be segregated, using a separate Script
property value (splitting the Arabic script, where it is significant,
just like it occured for Georgian and Greek/Coptic alphabets).

Nothing will be changed for the existing Arabic script, but the
"Extended/Eastern Arabic" script (assigned with a new ISO 15924 code
and mapped with a new property alias in Unicode), will still borrow
most of its letters from the standard script without reencoding them.

No character or block will be renamed (and I DO NOT propose to
disunifying existing common Arabic letters, or assigning them in the
"Common" script), it should just be a better sub-classification, where
the characters are clearly distinguished between the two variants.

Most Arabic characters should remain in the common "Arabic" script,
and those that are differentiated should be assigned in a
"Standard_Arabic" or "Extended_Arabic" script. But this may cause some
complication for the script inheritance in spans of texts (because the
"Arabic" script property value would behave a bit like what the
"Common" does for alphabetic scripts, i.e. like a group of scripts).

Such change for the assigned script property value (if it's not
already stabilized) would require documentation, and changes in a few
other core or derived datafiles:

- PropertyValueAliases.txt (adding two new property values for "sc"):
sc ; Arab      ; Arabic # All forms, includes "sc=Arbc", "sc=Arbs" and
"sc=Arbx" in regexps)
sc ; Arbc      ; Common_Arabic
sc ; Arbs      ; Standard_Arabic # (also includes "sc=Arbc" in regexps)
sc ; Arbx      ; Extended_Arabic # (also includes "sc=Arbc" in regexps)

- Script.txt (assigning the two new property values to remap existing "Arabic")
- Arabic-Shaping.txt (possibly adding comments at end of lines where
this is not the Common Arabic)
- Joining-Groups.txt (same remark)
- Bidi-Mirroring.txt (same remark)

And in the description of some standard script identification and
segmentation algorithms. I don't know if IDNA should continue to use
"Arab" (all forms) or if it should segregate "Arbs" and "Arbx" (to
avoid mixing digits that are visually confusable), as it uses such
segmentation (note that these characters are canonically different,
for normalization purposes).

Philippe.

Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

Reply via email to