RE: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

CE Whitehead Sun, 01 Aug 2010 18:53:58 -0700


> Date: Fri, 30 Jul 2010 06:04:26 +0200
> From: [email protected]
> To: [email protected]; [email protected]
> CC: [email protected]; [email protected]; [email protected]; 
> [email protected]
>
> For Arabic ther are clearly two separate sets of digits, but the
> possibility of mixing them arbitrarily is still a problem for IDNA (if
> both sets are accepted)
Both are accepted (according to online info and Martin Duerst, as I understand 
things).
, notably because most digits (except 4 to 6)
> are completely identical. So registries will have to:
> - either accept one set and reject the other one
> - accept both, but only one within the same domain label, reserving
> also the label using the other set (as if they were canonically
> equivalent).
Saudi's registry is folding these; that is, somehow the .sa registry plans to 
have that done -- though I did not realize the registry could control folding, 
but perhaps it is just recommending folding.
>
> Such equivalences (which are definitely not canonical)
Yes.
> can be handled
> by tailored collation compares (operating at collation level 2 only,
> when non-IDN registries operate only at level 1),
So you are proposing something like folding these in string prep? But I am 
confused about why level 1 would not work (sorry to ask a dumb question).
> where IDN registries
> will use their own tailoring. I just see the IDN "StringPrep" as a
> particular application of the general concept of collation mappings
> (except that it was not designed on linguistic bases, but an IDN
> registry can be viewed as a locale for collation purposes).
The Saudi registry's policy is to accept both number sets it seems and then 
fold the two varieties into the non-Eastern variety (both varieties are 
apparently available on the Saudi keyboard -- or should be). But there are 
other registries that will handle Arabic script domains.
> All these
> complex rules and mappings of IDN can be written in terms of a set
> collation rules, added on top of the DUCET.
>

O.k. -- a possibility. One can add these to the DUCET, but collation is always 
tailorable, according to the whims of the application programmer (the browser 
developer), as far as I understand things. But it's better than not having a 
standard, and not specifying what to do (so that each registry and application 
programmer might very likely handle these differently).

(NOTE: Bank1 in Persian and Bank1 in Arabic will look identical, except a 
different number 1 will be used in each case --- unless something can be worked 
out as a standard.

According to Saudi Arabia's registry [for the domain .sa] [which recommends 
something like Phillipe has suggested]:
". . . both sets should be supported in the user interface and both are folded 
to one set (Set I)  at the preparation of internationalized strings [e.g., 
"stringprep"] phase."
[But I am confused: how does Saudi Arabia's registry control stringprep ?;

see: http://www.iana.org/domains/idn-tables/tables/sa_ar_1.0.html]

On the other hand, tr36 recommends an alert for such confusables, if I 
understand things [ 
http://www.unicode.org/reports/tr36/proposed.html#Visual_Spoofing_Recommendation
].)

In any case the only other two countries that will be able to register 
Arabic-language domains at this point, as far as I can tell, are Egypt and the 
United Arab Emirates.
(see: http://www.itp.net/580094-gulf-countries-can-now-register-arabic-domains
http://www.idnnews.com/?p=9809). However, I do not know if all three policies 
will be the same as Saudi's, or what other countries (Iran, Pakistan, India) 
will register Arabic-script domains soon. And I do not know what each browser 
developer will do about confusables including numbers (I checked a little -- I 
found various policies).


(Of course, a smart banker would not register a bank1 in the 
Mideast/Arabic-Indic digit system)


Best,

--C. E. Whitehead
[email protected]
RE: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

Reply via email to