Re: UTS#10 (collation) : French backwards level 2, and word-breakers.

2010-07-29 Thread Werner LEMBERG

> Instead of continuing the discussion with a back and forth in email,
> I decided instead to write a Unicode Technical Note on the general
> topic, including a case study of alternative orderings for a French
> topic list.

Very nice!  It would help a lot if you add the actual collation
weights (or small code snippets for ICU) to achieve your results as an
appendix.


Werner



Re: High dot/dot above punctuation?

2010-07-29 Thread Martin J. Dürst

Hello Joanma,

On 2010/07/30 12:05, Juanma Barranquero wrote:

On Fri, Jul 30, 2010 at 04:52, "Martin J. Dürst"  wrote:


It's very clear that we would get nowhere if we wanted to encode
all these.


The comment I respondend to talked about characters that are already encoded.


Sorry, I didn't get that.


In simpler words, you cannot use the needs of discussions about encoding
(the meta-level) to determine encodings.


Discussing arabic versus latin numerals is not more meta-level that
talking about upper vs. lowercase.


Yes indeed. If these distinctions were only necessary when talking 
*about* these characters (meta-level) rather than when just using them 
(non-meta), then I would indeed agree that there is no reason to encode 
them separately.


Regards,Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp



Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

2010-07-29 Thread Philippe Verdy
"karl williamson"  wrote:
> This discussion doesn't make sense to me.  The original proposal to
> encode 19DA says that there is one set of digits in New Tai Lue, but
> there is an extra digit '1' (the one that got put at 19DA), used when
> the other digit '1' is visually confusable with another character in the
> script, which it resembles.  That makes it sound like the two are
> essentially used as glyph variants of each other, and are
> interchangeable as far as the computer recognizing an input number.

Yes, the exception will work for recognizing this digit as an
exception for INPUT, but you still have a problem for output, because
your library will need to know when to output the variant : if you
always use the default digit 1, you'll create a string that is
possibly confusable to the reader, notably if it appears alone with no
other digit.

So you'll still need an exception to change one or several of these
digits 1, to use the variant, or you'll decide to always use the
variant (which causes no confusion), but I'm not sure that such use
would be valid in the target language. There are possibly complex
rules deciding when the variant is needed and accepted, or when the
default variant is preferable and not confusable.

For Arabic ther are clearly two separate sets of digits, but the
possibility of mixing them arbitrarily is still a problem for IDNA (if
both sets are accepted), notably because most digits (except 4 to 6)
are completely identical. So registries will have to:
- either accept one set and reject the other one
- accept both, but only one within the same domain label, reserving
also the label using the other set (as if they were canonically
equivalent).

Such equivalences (which are definitely not canonical) can be handled
by tailored collation compares (operating at collation level 2 only,
when non-IDN registries operate only at level 1), where IDN registries
will use their own tailoring. I just see the IDN "StringPrep" as a
particular application of the general concept of collation mappings
(except that it was not designed on linguistic bases, but an IDN
registry can be viewed as a locale for collation purposes). All these
complex rules and mappings of IDN can be written in terms of a set
collation rules, added on top of the DUCET.



Re: High dot/dot above punctuation?

2010-07-29 Thread Juanma Barranquero
On Fri, Jul 30, 2010 at 04:52, "Martin J. Dürst"  wrote:

> It's very clear that we would get nowhere if we wanted to encode
> all these.

The comment I respondend to talked about characters that are already encoded.

> In simpler words, you cannot use the needs of discussions about encoding
> (the meta-level) to determine encodings.

Discussing arabic versus latin numerals is not more meta-level that
talking about upper vs. lowercase.

    Juanma




Re: High dot/dot above punctuation?

2010-07-29 Thread Martin J. Dürst



On 2010/07/29 19:51, Juanma Barranquero wrote:

On Thu, Jul 29, 2010 at 10:15, Khaled Hosny  wrote:


Also, I don't buy in Unicode idea of
encoding different sets of decimal digits separately, they are all
different graphical presentations of the same thing.


Not in a document where the author is discussing the differences
between them, for example.


The "where the author is discussing the differences" doesn't help in 
deciding whether to encode one or two characters. A document may discuss 
the roman and italic versions of a character, or the Times and Palatino 
versions of a character, or different versions of Times fonts for the 
same character, and so on. It's very clear that we would get nowhere if 
we wanted to encode all these.


In simpler words, you cannot use the needs of discussions about encoding 
(the meta-level) to determine encodings.


Regards,Martin.


--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp



Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

2010-07-29 Thread karl williamson

Asmus Freytag wrote:

Having Nd be limited to characters that

a) are used in decimal radix numbers
b) are part of a complete, ordered sequence 0..9

would make this property regular enough to serve
implementers. You could script the creation of
relevant data for your implementation based on that
property.

*Exceptions* exist and need to be documented.
Having exceptions machine readable is not as
important, but having implementers understand
them is.

Therefore, the best thing is for these to become
something other than Nd, but to retain their numeric
type of digit.

Together with a detailed explanation of each in
the appropriate script chapter, AND a complete
summary of all exceptional cases in a central
place (section 4.6 comes to mind) would provide
implementers with the information they need.

The exceptional cases that I'm aware of are

a) Arabic using two complete series of digits
b) New Thai Lue using an extra digit 1
c) Han digits being scattered and used in two
different types of numeric expressions
d) ASCII digits being used for some scripts
as preferred decimal-radix digits, because
their native number system is not, or not
exclusively decimal-radix

The above information belongs in section 4.6
in summary form, or simply as table of pointers
to each script chapter that contains a description
of unusual numeric behavior for decimal-radix
digits.

(A separate table pulling together all the descriptions
of non-decimal radix number systems that are
discussed in the Standard would equally be useful
for the readers).
A./



This sounds good to me.



Re: Digit/letter variants in the "same" unified script

2010-07-29 Thread Martin J. Dürst



On 2010/07/30 5:00, CE Whitehead wrote:


Hi.  Regarding your proposal, for IDN's, I have a security concern:

In the list of unicode allowed characters, the Eastern set of numbers seems to 
be allowed;(http://unicode.org/reports/tr36/idn-chars.html)
Saudi Arabia has got the other set in its allowed list
(http://www.iana.org/domains/idn-tables/tables/sa_ar_1.0.html)
so I gather both are allowed in IDN's.


Yes indeed.


You would then have mixed scripts in IDN's for Arab with either Arbx alone or 
Arbs (if those are the names chosen).
You do not want to display a mixed script warning for that.
(That would be tantamount to my security event viewer's displaying a login 
failure in addition to a login success everytime I login successfully; you 
start to ignore the failure messages.)
(I cannot find these digits in the normalization charts. Sorry.   I suppose 
however that they do not normalize to one another because that would destroy 
sequential processing of them -- which is what Karl is looking for -- although 
sequential processing does not apply to idn's; too bad they cannot just be 
normalized in idn's, that there cannot be a different standard for idn's . . . 
would that be an option?  That's kind of a wild idea too.)


Yes, they indeed don't normalize. They were discussed at length on the 
IDN list. Each registry can decide what works best for them (e.g. Saudi 
Arabia only allows the Arabic digits, Iran only allows the Eastern 
Arabic digits (in both cases, this may be in addition to 0-9), or some 
registry may allow both sequences and either reserve a name using the 
other sequence than a registration, or register both in parallel 
(bundle). It is because of these various options that the IDN specs 
don't make a final decision here.


Regards,   Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp



Re: UTS#10 (collation) : French backwards level 2, and word-breakers.

2010-07-29 Thread Kenneth Whistler
A couple of weeks ago, in this thread Philippe Verdy said:

> Breaking on words, even if it requirs a very modest buffering, 
> will significantly improve the processing time, 
> because each word in the long texts will be scanned only 
> once, and all the rest will occur within the small and 
> constantly reused buffer.
...
> I don't forget that in most practical cases, sorts will operate 
> on texts whose collation keys have been only partly 
> generated and truncated, because they really speed up and 
> reduce the number of compares to perform  ...

and so on.

Instead of continuing the discussion with a back and forth in
email, I decided instead to write a Unicode Technical Note
on the general topic, including a case study of alternative
orderings for a French topic list.

Those who are interested in collation and in the particular issues
that were discussed in this thread may wish to take a look:

http://www.unicode.org/notes/tn34/

--Ken




Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

2010-07-29 Thread Asmus Freytag

Having Nd be limited to characters that

a) are used in decimal radix numbers
b) are part of a complete, ordered sequence 0..9

would make this property regular enough to serve
implementers. You could script the creation of
relevant data for your implementation based on that
property.

*Exceptions* exist and need to be documented.
Having exceptions machine readable is not as
important, but having implementers understand
them is.

Therefore, the best thing is for these to become
something other than Nd, but to retain their numeric
type of digit.

Together with a detailed explanation of each in
the appropriate script chapter, AND a complete
summary of all exceptional cases in a central
place (section 4.6 comes to mind) would provide
implementers with the information they need.

The exceptional cases that I'm aware of are

a) Arabic using two complete series of digits
b) New Thai Lue using an extra digit 1
c) Han digits being scattered and used in two
different types of numeric expressions
d) ASCII digits being used for some scripts
as preferred decimal-radix digits, because
their native number system is not, or not
exclusively decimal-radix

The above information belongs in section 4.6
in summary form, or simply as table of pointers
to each script chapter that contains a description
of unusual numeric behavior for decimal-radix
digits.

(A separate table pulling together all the descriptions
of non-decimal radix number systems that are
discussed in the Standard would equally be useful
for the readers).
A./



Re: [ISO15924] Typo for Egyptian_Hierog(l)yphs

2010-07-29 Thread Kenneth Whistler
Philippe Verdy noted:

>
> Everywhere below, the Unicode property value alias is missing an 'l'.
> 
> - In HTML table 1:
> Egyp  050 Egyptian hieroglyphshiéroglyphes égyptiens  Egyptian
> _Hierogyphs   2009-06-01

etc.

These errors in the tables have been corrected by the Registration 
Authority.

--Ken





Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

2010-07-29 Thread karl williamson

Mark Davis ☕ wrote:


Mark

/— Il meglio è l’inimico del bene —/


On Thu, Jul 29, 2010 at 05:57, Philippe Verdy > wrote:


"Martin J. Dürst" mailto:due...@it.aoyama.ac.jp>> wrote:
 >
 > On 2010/07/29 13:33, karl williamson wrote:
 > > Asmus Freytag wrote:
 > >> On 7/25/2010 6:05 PM, Martin J. Dürst wrote:
 >
 > >>> Well, there actually is such a script, namely Han. The digits
(一、
 > >>> 二、三、四、五、六、七、八、九、〇) are used both as letters
and as
 > >>> decimal place-value digits, and they are scattered widely, and of
 > >>> course there are is a lot of modern living practice.
 >
 > >> The situation is worse than you indicate, because the same
characters
 > >> are also used as elements in a system that doesn't use
place-value,
 > >> but uses special characters to show powers of 10.
 >
 > No. Sequences of numeric Kanji are also used in names and word-plays,
 > and as sequences of individual small numbers.

 (1) Existing exception :

There's one example of a digit which has a numeric type = decimal, AND
is encoded in a "scattered" way:

19DA;6618;᧚;New Tai Lue Tham Digit One;Nd;0;L;...;1;1;1;N

The other decimal nine digits for the Tham variant of the New Tai Lue
digits are borrowed from another sequence of decimal digits, starting
at U+19D0 (for digit zero) with the exception of U+19D1 which is
replaced (for digit one). Both sets are assigned in the same
"New_Tai_Lue" script property value.

So the additional stability proposal will not be enforceable.


On the contrary. Were we do want such a policy, the implication would be 
either to:
(a) change the type of 19DA from Nd to No (what I think would be the 
right thing to do)

(b) grandfather in the character.


This discussion doesn't make sense to me.  The original proposal to 
encode 19DA says that there is one set of digits in New Tai Lue, but 
there is an extra digit '1' (the one that got put at 19DA), used when 
the other digit '1' is visually confusable with another character in the 
script, which it resembles.  That makes it sound like the two are 
essentially used as glyph variants of each other, and are 
interchangeable as far as the computer recognizing an input number.


Thus, it is appropriate to keep it as Nd, and it isn't scattered, 
because it is adjacent to the block of 10 digits.  My original proposal 
accounted for this case, asking that the slot or two immediately above 
the digit '9' be unassigned initially in a new script encoding, just in 
case a situation like this one arises again.


One thing that I should have brought up earlier in this discussion is 
that, as an implementor, I can deal with existing exceptions.  I may not 
want to, and may choose not to if my subjective calculation of 
benefit/cost indicates it's not worthwhile.  Given the existing pattern 
of code point assignments, I saw an efficient way to implement things. 
 And, if future Unicode versions retain this pattern, neither I nor my 
successors will have to change our code to move to that new version. 
Changing code takes a significant amount of time and effort.  Keeping 
new versions of Unicode using the same paradigms as previous versions 
means that implementations of those new versions will be available 
sooner than otherwise, and even that they get adopted at all.  I was 
unaware of the subtleties in Han and Arabic, but those can be handled as 
exceptions, but making new exceptions is really contrary to Unicode's 
interests.  So it really isn't about current counter examples; there's 
nothing much that can be done about them.  It's about adopting 
guidelines to keep from unnecessarily creating new exceptions.




RE: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

2010-07-29 Thread CE Whitehead





> Date: Thu, 29 Jul 2010 14:57:17 +0200
> Subject: Digit/letter variants in the "same" unified script (was: stability 
> policy on numeric type = decimal)
> From: verd...@wanadoo.fr
> To: due...@it.aoyama.ac.jp; pub...@khwilliamson.com
> CC: asm...@ix.netcom.com; kent.karlsso...@telia.com; unicode@unicode.org
> 
> "Martin J. Dürst"  wrote:
>>
>> On 2010/07/29 13:33, karl williamson wrote:
>>> Asmus Freytag wrote:
 On 7/25/2010 6:05 PM, Martin J. Dürst wrote:
>>
> Well, there actually is such a script, namely Han. The digits (一、
> 二、三、四、五、六、七、八、九、〇) are used both as letters and as
> decimal place-value digits, and they are scattered widely, and of
> course there are is a lot of modern living practice.
>>
 The situation is worse than you indicate, because the same characters
 are also used as elements in a system that doesn't use place-value,
 but uses special characters to show powers of 10.
>>
>> No. Sequences of numeric Kanji are also used in names and word-plays,
>> and as sequences of individual small numbers.
> 
> (1) Existing exception :
> 
> There's one example of a digit which has a numeric type = decimal, AND
> is encoded in a "scattered" way:
> 
> 19DA;6618;᧚;New Tai Lue Tham Digit One;Nd;0;L;...;1;1;1;N
> 
> The other decimal nine digits for the Tham variant of the New Tai Lue
> digits are borrowed from another sequence of decimal digits, starting
> at U+19D0 (for digit zero) with the exception of U+19D1 which is
> replaced (for digit one). Both sets are assigned in the same
> "New_Tai_Lue" script property value.
> 
> So the additional stability proposal will not be enforceable.
> 
> 
> (2) Arabic digits :
> 
> Such case was avoided for the Eastern/Extended variant of Arabo-Indic
> digits in U+06F0..U+06F9, without borrowing the common forms for the
> Standard variant in U+0660.U+0669: they were reencoded separately to
> create a complete sequence of 10 digits, even if most of them (all
> except 4 to 6) are exactly similar and belong to the same unified
> "script".
> 
> But what is even more "strange" is that the Standard Arabic digits are
> assigned to the "Common" script, when the Eastern/Extended variant is
> assigned to the "Arabic" script (look at the Unicode script property
> value, from the file "Scripts-5.2.0.txt" in the UCD).
> 
> If you just look at this property, you may think that the
> Extended/Eastern digits are the standard ones for the Arabic script:
> this is a side-effect of unification of Western and Eastern variants
> of the Arabic script.
> 
> 
> (3) Unification of the Arabic script:
> 
> Ideally, there should be two additional separate ISO 15924 script
> codes for the Western and Eastern variants the Arabic script (possibly
> [Arbs] for Standard/Western, and [Arbx] for Extended/Eastern), and the
> Unicode "script" property value alias for the Western and Eastern
> . . . 
> Most Arabic characters should remain in the common "Arabic" script,
> and those that are differentiated should be assigned in a
> "Standard_Arabic" or "Extended_Arabic" script. But this may cause some
> complication for the script inheritance in spans of texts (because the
> "Arabic" script property value would behave a bit like what the
> "Common" does for alphabetic scripts, i.e. like a group of scripts).
> 
> Such change for the assigned script property value (if it's not
> already stabilized) would require documentation, and changes in a few
> other core or derived datafiles:
> 
> - PropertyValueAliases.txt (adding two new property values for "sc"):
> sc ; Arab ; Arabic # All forms, includes "sc=Arbc", "sc=Arbs" and
> "sc=Arbx" in regexps)
> sc ; Arbc ; Common_Arabic
> sc ; Arbs ; Standard_Arabic # (also includes "sc=Arbc" in regexps)
> sc ; Arbx ; Extended_Arabic # (also includes "sc=Arbc" in regexps)
> 
> - Script.txt (assigning the two new property values to remap existing 
> "Arabic")
> - Arabic-Shaping.txt (possibly adding comments at end of lines where
> this is not the Common Arabic)
> - Joining-Groups.txt (same remark)
> - Bidi-Mirroring.txt (same remark)
> 
> And in the description of some standard script identification and
> segmentation algorithms. I don't know if IDNA should continue to use
> "Arab" (all forms) or if it should segregate "Arbs" and "Arbx" (to
> avoid mixing digits that are visually confusable), as it uses such
> segmentation (note that these characters are canonically different,
> for normalization purposes).
> 
> Philippe.
> 
 
Hi.  Regarding your proposal, for IDN's, I have a security concern:

In the list of unicode allowed characters, the Eastern set of numbers seems to 
be allowed;(http://unicode.org/reports/tr36/idn-chars.html)
Saudi Arabia has got the other set in its allowed list 
(http://www.iana.org/domains/idn-tables/tables/sa_ar_1.0.html)
so I gather both are allowed in IDN's. 
 
You would then have mixed scripts in IDN's for Arab with either Arbx alone or 
Arbs (if those are the names chosen).
You do not want to display a mix

Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

2010-07-29 Thread Mark Davis ☕
That just really isn't a script issue; it is more an issue of which language
orthographies use which characters, and we have provision for that
information in CLDR.

Mark

*— Il meglio è l’inimico del bene —*


On Thu, Jul 29, 2010 at 09:07, Philippe Verdy  wrote:

> "Mark Davis ☕" 
> > It is not so strange. Read
> > http://www.unicode.org/reports/tr24/proposed.html#Multiple_Script_Values
> ,
> > and other parts of #24 describing Common.
>
> It is exactly because I had read this proposed update for UTS#24 that
> I used my argument (if not, I would have not spoken about the
> ExtendedScript property in my report : isn't it made to use more
> precise mappings to ISO 15924, including script variants ?).
>
> Nothing would be special about "Common" : "sc=Arabic" alias "sc=Arab"
> could use the same formalism (also used for and "Hani", "Jpan" that
> are defined as multiple scripts or script variants) to subdivide it
> with the new "extended script" property.
>
> It's true that for now, Unicode is unable to make distinctions between
> "Hans" and "Hant" on just the encoded abstract characters (so for them
> we have "sc=Hani" only, but an "extended script" property could make
> more precise mappings, without being completely bound to the stability
> policy).
>
> But it does not mean that texts and localization resources can't make
> such distinctions by external tagging, or in stylesheets, or in
> romanization schemes. And librarians (and book readers) already make
> distinctions as well between  Eastern and Western versions of the
> unified Arabic.
>
> It could even have benefit within IDNA to help diagnose those digits
> that have confusable forms in the two variants (even if there's a work
> in progress for defining the confusables needed for IDNA), and adding
> the extra ISO 15924 codes (for Arabic variants) won't break Unicode
> (after all there are already variants for Latin and Sinograms, exactly
> because of these "font variants").
>


Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

2010-07-29 Thread Philippe Verdy
"Mark Davis ☕" 
> It is not so strange. Read
> http://www.unicode.org/reports/tr24/proposed.html#Multiple_Script_Values,
> and other parts of #24 describing Common.

It is exactly because I had read this proposed update for UTS#24 that
I used my argument (if not, I would have not spoken about the
ExtendedScript property in my report : isn't it made to use more
precise mappings to ISO 15924, including script variants ?).

Nothing would be special about "Common" : "sc=Arabic" alias "sc=Arab"
could use the same formalism (also used for and "Hani", "Jpan" that
are defined as multiple scripts or script variants) to subdivide it
with the new "extended script" property.

It's true that for now, Unicode is unable to make distinctions between
"Hans" and "Hant" on just the encoded abstract characters (so for them
we have "sc=Hani" only, but an "extended script" property could make
more precise mappings, without being completely bound to the stability
policy).

But it does not mean that texts and localization resources can't make
such distinctions by external tagging, or in stylesheets, or in
romanization schemes. And librarians (and book readers) already make
distinctions as well between  Eastern and Western versions of the
unified Arabic.

It could even have benefit within IDNA to help diagnose those digits
that have confusable forms in the two variants (even if there's a work
in progress for defining the confusables needed for IDNA), and adding
the extra ISO 15924 codes (for Arabic variants) won't break Unicode
(after all there are already variants for Latin and Sinograms, exactly
because of these "font variants").




Re: Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

2010-07-29 Thread Mark Davis ☕
Mark

*— Il meglio è l’inimico del bene —*


On Thu, Jul 29, 2010 at 05:57, Philippe Verdy  wrote:

> "Martin J. Dürst"  wrote:
> >
> > On 2010/07/29 13:33, karl williamson wrote:
> > > Asmus Freytag wrote:
> > >> On 7/25/2010 6:05 PM, Martin J. Dürst wrote:
> >
> > >>> Well, there actually is such a script, namely Han. The digits (一、
> > >>> 二、三、四、五、六、七、八、九、〇) are used both as letters and as
> > >>> decimal place-value digits, and they are scattered widely, and of
> > >>> course there are is a lot of modern living practice.
> >
> > >> The situation is worse than you indicate, because the same characters
> > >> are also used as elements in a system that doesn't use place-value,
> > >> but uses special characters to show powers of 10.
> >
> > No. Sequences of numeric Kanji are also used in names and word-plays,
> > and as sequences of individual small numbers.
>
>  (1) Existing exception :
>
> There's one example of a digit which has a numeric type = decimal, AND
> is encoded in a "scattered" way:
>
> 19DA;6618;᧚;New Tai Lue Tham Digit One;Nd;0;L;...;1;1;1;N
>
> The other decimal nine digits for the Tham variant of the New Tai Lue
> digits are borrowed from another sequence of decimal digits, starting
> at U+19D0 (for digit zero) with the exception of U+19D1 which is
> replaced (for digit one). Both sets are assigned in the same
> "New_Tai_Lue" script property value.
>
> So the additional stability proposal will not be enforceable.
>

On the contrary. Were we do want such a policy, the implication would be
either to:
(a) change the type of 19DA from Nd to No (what I think would be the right
thing to do)
(b) grandfather in the character.


>
>  (2) Arabic digits :
>
> Such case was avoided for the Eastern/Extended variant of Arabo-Indic
> digits in U+06F0..U+06F9, without borrowing the common forms for the
> Standard variant in U+0660.U+0669: they were reencoded separately to
> create a complete sequence of 10 digits, even if most of them (all
> except 4 to 6) are exactly similar and belong to the same unified
> "script".
>
> But what is even more "strange" is that the Standard Arabic digits are
> assigned to the "Common" script, when the Eastern/Extended variant is
> assigned to the "Arabic" script (look at the Unicode script property
> value, from the file "Scripts-5.2.0.txt" in the UCD).
>
> If you just look at this property, you may think that the
> Extended/Eastern digits are the standard ones for the Arabic script:
> this is a side-effect of unification of Western and Eastern variants
> of the Arabic script.
>

It is not so strange. Read
http://www.unicode.org/reports/tr24/proposed.html#Multiple_Script_Values,
and other parts of #24 describing Common.


>
>
>  (3) Unification of the Arabic script:
>
> Ideally, there should be two additional separate ISO 15924 script
> codes for the Western and Eastern variants the Arabic script (possibly
> [Arbs] for Standard/Western, and [Arbx] for Extended/Eastern), and the
> Unicode "script" property value alias for the Western and Eastern
> digits or letters should be segregated, using a separate Script
> property value (splitting the Arabic script, where it is significant,
> just like it occured for Georgian and Greek/Coptic alphabets).
>

There is no likelihood of that happening, simply for the sake of these
digits.

The original characters were just font variants; they were really split to a
large extend because of the UBA (which I think in retrospect was a mistake,
but c'est la vie, n'est pas?).



> Nothing will be changed for the existing Arabic script, but the
> "Extended/Eastern Arabic" script (assigned with a new ISO 15924 code
> and mapped with a new property alias in Unicode), will still borrow
> most of its letters from the standard script without reencoding them.
>
> No character or block will be renamed (and I DO NOT propose to
> disunifying existing common Arabic letters, or assigning them in the
> "Common" script), it should just be a better sub-classification, where
> the characters are clearly distinguished between the two variants.
>
> Most Arabic characters should remain in the common "Arabic" script,
> and those that are differentiated should be assigned in a
> "Standard_Arabic" or "Extended_Arabic" script. But this may cause some
> complication for the script inheritance in spans of texts (because the
> "Arabic" script property value would behave a bit like what the
> "Common" does for alphabetic scripts, i.e. like a group of scripts).
>
> Such change for the assigned script property value (if it's not
> already stabilized) would require documentation, and changes in a few
> other core or derived datafiles:
>
> - PropertyValueAliases.txt (adding two new property values for "sc"):
> sc ; Arab  ; Arabic # All forms, includes "sc=Arbc", "sc=Arbs" and
> "sc=Arbx" in regexps)
> sc ; Arbc  ; Common_Arabic
> sc ; Arbs  ; Standard_Arabic # (also includes "sc=Arbc" in regexps)
> sc ; Arbx  ; Extended_Arabic # (also includes "sc=Arbc" in

Re: Why not just change the glyph of 20A8 RUPEE SIGN?

2010-07-29 Thread Shriramana Sharma
Thanks all for responding. Of course it was my mistake to forget that 
there are other countries using the Rupee currency. Maybe the new 
character will be named the INDIAN RUPEE SIGN to underline the fact that 
this sign is only for India.


--
Shriramana Sharma



Digit/letter variants in the "same" unified script (was: stability policy on numeric type = decimal)

2010-07-29 Thread Philippe Verdy
"Martin J. Dürst"  wrote:
>
> On 2010/07/29 13:33, karl williamson wrote:
> > Asmus Freytag wrote:
> >> On 7/25/2010 6:05 PM, Martin J. Dürst wrote:
>
> >>> Well, there actually is such a script, namely Han. The digits (一、
> >>> 二、三、四、五、六、七、八、九、〇) are used both as letters and as
> >>> decimal place-value digits, and they are scattered widely, and of
> >>> course there are is a lot of modern living practice.
>
> >> The situation is worse than you indicate, because the same characters
> >> are also used as elements in a system that doesn't use place-value,
> >> but uses special characters to show powers of 10.
>
> No. Sequences of numeric Kanji are also used in names and word-plays,
> and as sequences of individual small numbers.

  (1) Existing exception :

There's one example of a digit which has a numeric type = decimal, AND
is encoded in a "scattered" way:

19DA;6618;᧚;New Tai Lue Tham Digit One;Nd;0;L;...;1;1;1;N

The other decimal nine digits for the Tham variant of the New Tai Lue
digits are borrowed from another sequence of decimal digits, starting
at U+19D0 (for digit zero) with the exception of U+19D1 which is
replaced (for digit one). Both sets are assigned in the same
"New_Tai_Lue" script property value.

So the additional stability proposal will not be enforceable.


  (2) Arabic digits :

Such case was avoided for the Eastern/Extended variant of Arabo-Indic
digits in U+06F0..U+06F9, without borrowing the common forms for the
Standard variant in U+0660.U+0669: they were reencoded separately to
create a complete sequence of 10 digits, even if most of them (all
except 4 to 6) are exactly similar and belong to the same unified
"script".

But what is even more "strange" is that the Standard Arabic digits are
assigned to the "Common" script, when the Eastern/Extended variant is
assigned to the "Arabic" script (look at the Unicode script property
value, from the file "Scripts-5.2.0.txt" in the UCD).

If you just look at this property, you may think that the
Extended/Eastern digits are the standard ones for the Arabic script:
this is a side-effect of unification of Western and Eastern variants
of the Arabic script.


  (3) Unification of the Arabic script:

Ideally, there should be two additional separate ISO 15924 script
codes for the Western and Eastern variants the Arabic script (possibly
[Arbs] for Standard/Western, and [Arbx] for Extended/Eastern), and the
Unicode "script" property value alias for the Western and Eastern
digits or letters should be segregated, using a separate Script
property value (splitting the Arabic script, where it is significant,
just like it occured for Georgian and Greek/Coptic alphabets).

Nothing will be changed for the existing Arabic script, but the
"Extended/Eastern Arabic" script (assigned with a new ISO 15924 code
and mapped with a new property alias in Unicode), will still borrow
most of its letters from the standard script without reencoding them.

No character or block will be renamed (and I DO NOT propose to
disunifying existing common Arabic letters, or assigning them in the
"Common" script), it should just be a better sub-classification, where
the characters are clearly distinguished between the two variants.

Most Arabic characters should remain in the common "Arabic" script,
and those that are differentiated should be assigned in a
"Standard_Arabic" or "Extended_Arabic" script. But this may cause some
complication for the script inheritance in spans of texts (because the
"Arabic" script property value would behave a bit like what the
"Common" does for alphabetic scripts, i.e. like a group of scripts).

Such change for the assigned script property value (if it's not
already stabilized) would require documentation, and changes in a few
other core or derived datafiles:

- PropertyValueAliases.txt (adding two new property values for "sc"):
sc ; Arab  ; Arabic # All forms, includes "sc=Arbc", "sc=Arbs" and
"sc=Arbx" in regexps)
sc ; Arbc  ; Common_Arabic
sc ; Arbs  ; Standard_Arabic # (also includes "sc=Arbc" in regexps)
sc ; Arbx  ; Extended_Arabic # (also includes "sc=Arbc" in regexps)

- Script.txt (assigning the two new property values to remap existing "Arabic")
- Arabic-Shaping.txt (possibly adding comments at end of lines where
this is not the Common Arabic)
- Joining-Groups.txt (same remark)
- Bidi-Mirroring.txt (same remark)

And in the description of some standard script identification and
segmentation algorithms. I don't know if IDNA should continue to use
"Arab" (all forms) or if it should segregate "Arbs" and "Arbx" (to
avoid mixing digits that are visually confusable), as it uses such
segmentation (note that these characters are canonically different,
for normalization purposes).

Philippe.




Re: Indian Rupee Sign (U+20B9) proposal

2010-07-29 Thread N. Ganesan
On Wed, Jul 28, 2010 at 10:43 PM, Tulasi  wrote:

>  I do not see any Unicode role on India
> Rupee symbol  :)


why?

Here is the rupee sign in a Tamil newspaper:
http://epaper.dinakaran.com/pdf/2010/07/28/20100728c_014101005.jpg

N. Ganesan


[ISO15924] Typo for Egyptian_Hierog(l)yphs

2010-07-29 Thread Philippe Verdy
Everywhere below, the Unicode property value alias is missing an 'l'.

- In HTML table 1:
Egyp050 Egyptian hieroglyphshiéroglyphes égyptiens  Egyptian
_Hierogyphs 2009-06-01
- In HTML table 2:
050 EgypEgyptian hieroglyphshiéroglyphes égyptiens  Egyptian
_Hierogyphs 2009-06-01
- In HTML table 3:
Egyptian hieroglyphsEgyp050 hiéroglyphes égyptiens  Egyptian
_Hierogyphs 2009-06-01
- In HTML table 4:
hiéroglyphes égyptiens  Egyp050 Egyptian hieroglyphsEgyptian
_Hierogyphs 2009-06-01
- In the downloadable normative plain-text file:
Egyp;050;Egyptian hieroglyphs;hiéroglyphes
égyptiens;Egyptian_Hierogyphs;2009-06-01

Also, I suggest that this table column does not include any SHY
control (before underscores), it should reflect the exact Unicode
property value.
(or that the page clearly indicates that SHY controls present in that
column are not part of the Unicode property value and are just there
for readability/accessibility of the table in browsers with small
windows).

I discovered these while comparing data sources for UCD related files.
There's no error in the code change page
(http://www.unicode.org/iso15924/codechanges.html)

Philippe.




Character corrections for Kannada

2010-07-29 Thread K.S. Naveen
Sirs,
There are certain corrections required for Kannada characters.  Actually 
correctness varies with the version of the windows OS and the version of MS 
office! 
Now, my questions are:Is this problem concerned with the unicode consortium Can 
a private party develop an engine that generates unicode letters, that works 
across the platforms (or a platform free). There are couple of old kannada 
characters and some special characters are also required to be 
displayed.Further, any information on these lines would be very helpful.
Regards,KS Naveen


  

Re: High dot/dot above punctuation?

2010-07-29 Thread Juanma Barranquero
On Thu, Jul 29, 2010 at 10:15, Khaled Hosny  wrote:

> Also, I don't buy in Unicode idea of
> encoding different sets of decimal digits separately, they are all
> different graphical presentations of the same thing.

Not in a document where the author is discussing the differences
between them, for example.

    Juanma




RE: Indian Rupee Sign (U+20B9) proposal

2010-07-29 Thread Jonathan Rosenne
I had thought that the glyphs were not part of the UNICODE or ISO 10646
standards and only serve as a reference.

See the Disclaimer in http://www.unicode.org/versions/Unicode5.2.0/ch17.pdf.

Jony

> -Original Message-
> From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On
> Behalf Of Michael Everson
> Sent: Thursday, July 29, 2010 10:53 AM
> To: unicode Unicode Discussion
> Subject: Re: Indian Rupee Sign (U+20B9) proposal
> 
> On 29 Jul 2010, at 04:43, Tulasi wrote:
> 
> >> It is good that M. Everson's proposal and Govt. of India proposal
> are converging.
> >
> > No its not good!
> 
> I'm so sorry to disappoint you.
> 
> > M. Everson's proposal be withdrawn, he rushed hastily.
> 
> No, I didn't.
> 
> > His font design is out of his mind, not from the drawing.
> 
> That's because my font design takes a Times-like Latin R as the basis
> for the design.
> 
> > He cut left vertical bar of English alphabet "R" from an existing TTF
> font, then place two rectangler bar in parallel :-')
> 
> Yes, indeed I did. And this is just what we do for the EURO SIGN (a C
> with bars), the YEN SIGN (a Y with bars) and so on. See
> http://www.evertype.com/standards/euro/euroglyph.html for example.
> 
> > This is not what it is in the drawing or JPG image.
> 
> That's because what THEY did was to take an Arial-like Latin R as the
> basis for the design.
> 
> > Enlarge both in PDF, see yourself before encouraging. ISO technical
> committee shall place it on "Currency block" only to keep stuff
> uniform.
> 
> No. We will put it in the "Currency Symbols" block because the
> character does not belong to either the Devanagari or the Latin script.
> Please note that all of the referene glyphs
> 
> > You may not like my critic ->
> 
> Since you ask, I (for my part) do not like an attitude of wilfully
> antagonistic hostility, particularly when it is pointless.
> 
> > I do not see any Unicode role on India Rupee symbol  :)
> 
> It doesn't really matter. Though the Government of India have put
> forward a proposal.
> 
> > ISO will approve it anyway.
> 
> You'd better hope so.
> 
> Have a nice day,
> Michael Everson * http://www.evertype.com/
> 
> 





Re: Indian Rupee Sign (U+20B9) proposal

2010-07-29 Thread Michael Everson

On 29 Jul 2010, at 08:53, Michael Everson wrote:

> No. We will put it in the "Currency Symbols" block because the character does 
> not belong to either the Devanagari or the Latin script. Please note that all 
> of the reference glyphs 

... in that block use a Times-like font as the basis for the design.

Michael Everson * http://www.evertype.com/





Re: High dot/dot above punctuation?

2010-07-29 Thread Khaled Hosny
On Thu, Jul 29, 2010 at 10:01:37AM +0200, Kent Karlsson wrote:
> 
> Den 2010-07-29 08.47, skrev "Khaled Hosny" :
> 
> > I have few fonts where I implemented a 'locl' OpenType feature that maps
> > European to Arabic digits, and contextual substitution feature that
> > replaces the dot with Arabic decimal separator when it comes between two
> > Arabic numbers, so I think it is doable.
> 
> Doable is not the same thing as a good idea. Your example here is one of the
> not-at-all-good ideas.

This was done of a GUI font, the main aim is to have Arabic numbers in
Arabic contexts and vice versa, since the numbers here are generated on
the fly like dates, percentages etc. it is not possible (or even
desirable) to change the input. Also, I don't buy in Unicode idea of
encoding different sets of decimal digits separately, they are all
different graphical presentations of the same thing.

Regards,
 Khaled

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer



Re: High dot/dot above punctuation?

2010-07-29 Thread André Szabolcs Szelp
2010/7/28 Asmus Freytag 

> On 7/28/2010 9:30 AM, André Szabolcs Szelp wrote:
>
>> You really all say, that general property Sk (DOT ABOVE) rather than Po
>> (FULL STOP, COMMA, MIDDLE DOT) (compared with all other decimal point
>> characters) can not cause any problems ever in certain algorithms?
>>
> No, we say that this is equivalent to a decimal comma - it's the same as
> regular comma, and well-designed algorithms can tell the difference.
>
> Distinguishing identically looking punctuation marks by their function in
> text on the level of character encoding is not something that has proven
> workable.
>


Well,

I have replied Ken Whistler privately on a question of his, however, in the
particular case I'm trying to digitalize, comma is used as
millions-separator, period as thousands separator, and high dot as decimal
separator.
(formally: #,###.###˙##(###...) ) (I can post the example if necessary).

Seeing a number encoded as 1.000 — even knowing this locale!! — you cannot
tell whether its "thousand" or "one" with explicit post-decimal zeros, IF
you encode both the period and the high dot as FULL STOP.
This poses a problem for _any_ contextual alternates-based approach in
display as well.

So in this case, actually, I think its well arguable, that encoding the
decimal point with FULL STOP and treating it as a glyphtic variant is not
viable.

/Szabolcs


Re: Pashto yeh characters

2010-07-29 Thread André Szabolcs Szelp
On Wed, Jul 28, 2010 at 7:20 PM, Murray Sargent <
murr...@exchange.microsoft.com> wrote:
Andreas Prilop commented "A native speaker of English does not
/automatically/ know better about English grammar, English punctuation than
an informed Frenchman." So true, so true. Most native speakers of English
have only limited understanding of English grammar.

I very recently read an anecdote about Radloff, the russian turkologist.
One day a Turk visited him and told him his theories and ideas about the
Turkic languages. It became quite soon apparent, that he was not to be taken
seriously. So Radloff asked:
— Why do you think, your ideas are right?
— Because I'm a turkologist — the man replied.
— And what makes you a turkologist?
— Well, I'm a Turk, and my mother tongue ist Turkish.
— Oh no, my friend, a bird is not an ornithologist either...

... Actually, in general birds know pretty little about birds :-)

Szabolcs


Re: High dot/dot above punctuation?

2010-07-29 Thread Kent Karlsson

Den 2010-07-29 08.47, skrev "Khaled Hosny" :

> I have few fonts where I implemented a 'locl' OpenType feature that maps
> European to Arabic digits, and contextual substitution feature that
> replaces the dot with Arabic decimal separator when it comes between two
> Arabic numbers, so I think it is doable.

Doable is not the same thing as a good idea. Your example here is one of the
not-at-all-good ideas.

/Kent K





Re: Pashto yeh characters

2010-07-29 Thread André Szabolcs Szelp
"Persian and Urdu write [g] using a kaf character with a line above U+06AF,
while Pashto uses kaf with a ring U+06AB. It really should be that simple."

I seem to remember, that Persian used kaf with three dots above (like your
Moroccan example) at least in the 19th century. No idea when they switched
to the double-lined version. (and I can well imagine how the three dots
would have merged to a line, thought this might as well not be the origin of
that character).

Szabolcs


Re: Indian Rupee Sign (U+20B9) proposal

2010-07-29 Thread Michael Everson
On 29 Jul 2010, at 04:43, Tulasi wrote:

>> It is good that M. Everson's proposal and Govt. of India proposal are 
>> converging.
> 
> No its not good!

I'm so sorry to disappoint you.

> M. Everson's proposal be withdrawn, he rushed hastily.

No, I didn't.

> His font design is out of his mind, not from the drawing.

That's because my font design takes a Times-like Latin R as the basis for the 
design.

> He cut left vertical bar of English alphabet "R" from an existing TTF font, 
> then place two rectangler bar in parallel :-')

Yes, indeed I did. And this is just what we do for the EURO SIGN (a C with 
bars), the YEN SIGN (a Y with bars) and so on. See 
http://www.evertype.com/standards/euro/euroglyph.html for example.

> This is not what it is in the drawing or JPG image.

That's because what THEY did was to take an Arial-like Latin R as the basis for 
the design.

> Enlarge both in PDF, see yourself before encouraging. ISO technical committee 
> shall place it on "Currency block" only to keep stuff uniform.

No. We will put it in the "Currency Symbols" block because the character does 
not belong to either the Devanagari or the Latin script. Please note that all 
of the referene glyphs 

> You may not like my critic ->

Since you ask, I (for my part) do not like an attitude of wilfully antagonistic 
hostility, particularly when it is pointless. 

> I do not see any Unicode role on India Rupee symbol  :)

It doesn't really matter. Though the Government of India have put forward a 
proposal.

> ISO will approve it anyway.

You'd better hope so.

Have a nice day,
Michael Everson * http://www.evertype.com/





Re: High dot/dot above punctuation?

2010-07-29 Thread Khaled Hosny
On Wed, Jul 28, 2010 at 11:37:28AM -0700, Asmus Freytag wrote:
> On 7/28/2010 10:09 AM, Murray Sargent wrote:
> >Contextual rendering is getting to be more common thanks to
> >adoption of OpenType features. For example, both MS Publisher 2010
> >and MS Word 2010 support various contextually dependent OpenType
> >features at the user's discretion. The choice of glyph for U+002E
> >could be chosen according to an OpenType style.
> I know that the technology exists that (in principle) can overcome
> an early limitation of 1:1 relation between characters and glyphs in
> a single font. I also know that this technology has been implemented
> for certain (but not all) types of mappings that are not 1:1.
> >It's worth remembering that plain text is a format that was introduced due 
> >to the limitations of early computers. Books have always been rendered with 
> >at least some degree of rich text. And due to the complexity of Unicode, 
> >even Unicode plain text often needs to be rendered with more than one font.
> However, the question I raised here is whether such mechanisms have
> been implemented to date for FULL STOP. Which implementation makes
> the required context analysis to determine whether 002E is part of a
> number during layout? If it does make this determination, which
> OpenType feature does it invoke? Which font supports this particular
> OpenType feature?

I have few fonts where I implemented a 'locl' OpenType feature that maps
European to Arabic digits, and contextual substitution feature that
replaces the dot with Arabic decimal separator when it comes between two
Arabic numbers, so I think it is doable.

Regards,
 Khaled

-- 
 Khaled Hosny
 Arabic localiser and member of Arabeyes.org team
 Free font developer