Re: CJK stroke order data: kRSUnicode v. kRSKangXi

2014-03-12 Thread Adam Nohejl
Mr. Cook,

Thank you for all the information (and bringin Wenlin to my attention as well).

 Where a specific value attested in a specific Kangxi edition is missing from 
 kRSUnicode, it would indeed be useful to add it,

Great to hear that.

 Since kRSUnicode is a Normative property, a formal proposal to modify that 
 data is required, for review in WG2. I have added notes on the items you 
 mention below, for consideration in that process, and in the meantime, if you 
 identify any other issues, please bring them to our attention.

OK, I will prepare a more comprehensive list. Do you mean that you would submit 
such a formal proposal? Or can I submit it myself somehow?

 PS: About the subject line of your message.

Yes, of course, the subject should have read CJK radical-stroke count data. 
Not one of my brightest moments, I guess...


-- 
Adam Nohejl
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Names for control characters (Was: (in 6429) in allkeys.txt)

2014-03-12 Thread Per Starbäck
Ken Whistler wrote:
 Ah, I see what the interpretation problem was. Yes, that is
 a straightforward kind of improvement -- easily enough done.
 Look for a change the next time the file is updated. (It will not
 be immediately changed, pending other review comments.)

Thanks! Then I'll skip making a formal request about this.

Regarding these names in ISO 6429 again, how come these control
characters don't have Unicode names? For many uses of names, the control
characters have as much need for them as any other character.
Since it seems so straightforward it must have been suggested several
times to introduce names like

  CONTROL CHARACTER NULL
  CONTROL CHARACTER START OF HEADING
  CONTROL CHARACTER START OF TEXT

etc., so I assume there are good reasons for not doing that, but I can't
see what they are.

Since applications want names they will use other things as names when
there isn't a real name, and that leads to problems. Take Emacs where
the command describe-char currently describes U+0007 as

  name: control
  old-name: BELL

(I reported the misusage of control here as a name in 2009, but it
wasn't fixed until this year, so still not in a released version.)
The usage of BELL here invites confusion with U+1F514 BELL.

Emacs should do better regarding this, but still, with a proper name
all of this would have been averted.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Names for control characters (Was: (in 6429) in allkeys.txt)

2014-03-12 Thread Mark Davis ☕
They do have aliases in NameAliases.txt

;NULL;control

;NUL;abbreviation

0001;START OF HEADING;control

0001;SOH;abbreviation

0002;START OF TEXT;control

0002;STX;abbreviation

...


Mark https://google.com/+MarkDavis

 *— Il meglio è l’inimico del bene —*


On Wed, Mar 12, 2014 at 1:32 PM, Per Starbäck starb...@stp.lingfil.uu.sewrote:

 Ken Whistler wrote:
  Ah, I see what the interpretation problem was. Yes, that is
  a straightforward kind of improvement -- easily enough done.
  Look for a change the next time the file is updated. (It will not
  be immediately changed, pending other review comments.)

 Thanks! Then I'll skip making a formal request about this.

 Regarding these names in ISO 6429 again, how come these control
 characters don't have Unicode names? For many uses of names, the control
 characters have as much need for them as any other character.
 Since it seems so straightforward it must have been suggested several
 times to introduce names like

   CONTROL CHARACTER NULL
   CONTROL CHARACTER START OF HEADING
   CONTROL CHARACTER START OF TEXT

 etc., so I assume there are good reasons for not doing that, but I can't
 see what they are.

 Since applications want names they will use other things as names when
 there isn't a real name, and that leads to problems. Take Emacs where
 the command describe-char currently describes U+0007 as

   name: control
   old-name: BELL

 (I reported the misusage of control here as a name in 2009, but it
 wasn't fixed until this year, so still not in a released version.)
 The usage of BELL here invites confusion with U+1F514 BELL.

 Emacs should do better regarding this, but still, with a proper name
 all of this would have been averted.
 ___
 Unicode mailing list
 Unicode@unicode.org
 http://unicode.org/mailman/listinfo/unicode

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Names for control characters (Was: (in 6429) in allkeys.txt)

2014-03-12 Thread Eli Zaretskii
 From: starb...@stp.lingfil.uu.se (Per Starbäck)
 Date: Wed, 12 Mar 2014 13:32:15 +0100
 Cc: unicode@unicode.org unicode@unicode.org
 
 Regarding these names in ISO 6429 again, how come these control
 characters don't have Unicode names?

They have a non-empty old name field:

  ;control;Cc;0;BN;N;NULL
   

 Emacs should do better regarding this

As you yourself say, it already does, so I don't see the point in this
rant.
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Names for control characters (Was: (in 6429) in allkeys.txt)

2014-03-12 Thread Whistler, Ken
Please be very careful here. Having a non-empty value in field 1 of
UnicodeData.txt is *not* the same has having a Unicode name.

See:

http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G135207

for the gory details.

The Unicode name is formally defined in terms of the Name property,
which itself is a combination of enumerated values extracted from
UnicodeData.txt, plus a number of rules.

For all characters whose General_Category=Cc, the formal definition
of the Name property is a null string. The string control is *never*
to be interpreted as a Unicode name. It is a field placeholder with
legacy status. See Interpretation of Field 1 of UnicodeData.txt in
the section I cited above.

As far as user interfaces and other applications needing names for
Unicode control characters -- one of the reasons that the namespace
for Unicode characters includes all of the formal name aliases provided
in NameAliases.txt is so that applications can safely treat any formal
name alias for a control character (or the other abbreviations, etc.,
also listed in NameAliases.txt) *as if* they were Unicode names, without
running into name collisions with the actual Name property value
for Unicode characters.

The history of the name collision for the (relatively) recently encoded
U+1F514 BELL with the traditional usage for the U+0007 control function
BELL led the UTC to extend the namespace as noted, so we won't be
running into more such problems in the future.

If Emacs were to use ALERT or the abbreviation BEL for U+0007,
instead of control, that would avoid the collision with U+1F514 BELL,
be conformant to the Unicode Standard, and presumably be helpful
to users, as well. See the entries for U+0007 in NameAliases.txt:

# Note that no formal name alias for the ISO 6429 BELL is
# provided for U+0007, because of the existing name collision
# with U+1F514 BELL.

0007;ALERT;control
0007;BEL;abbreviation

--Ken


  Regarding these names in ISO 6429 again, how come these control
  characters don't have Unicode names?
 
 They have a non-empty old name field:
 
   ;control;Cc;0;BN;N;NULL



___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Names for control characters (Was: (in 6429) in allkeys.txt)

2014-03-12 Thread Eli Zaretskii
 From: Whistler, Ken ken.whist...@sap.com
 Date: Wed, 12 Mar 2014 16:48:25 +
 Cc: Whistler, Ken ken.whist...@sap.com,
 unicode@unicode.org unicode@unicode.org
 
 Please be very careful here. Having a non-empty value in field 1 of
 UnicodeData.txt is *not* the same has having a Unicode name.

You will see that I didn't refer to the Name attribute, I referred to
the old name attribute (called Unicode_1_Name in UAX#44).
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Names for control characters

2014-03-12 Thread Per Starbäck
Ken Whistler wrote:
 Please be very careful here. Having a non-empty value in field 1 of
 UnicodeData.txt is *not* the same has having a Unicode name.

 See:

 http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G135207

I know it's not a name. My question was *why* control characters don't
*have* names like

  CONTROL CHARACTER NULL
  CONTROL CHARACTER START OF HEADING
  CONTROL CHARACTER START OF TEXT
  etc.

It would be so obvious to have it like that, so I assume there is some
specific reason not to, but I still can't figure it out. For me there is
not less reason for these characters to have names than any others, so
for me it's like Linear B characters didn't have names, and I got the
answer no problem, they have aliases, so that's OK! This is just
strange to me. If names aren't needed, why do almost all characters have
them?

This is not about Emacs. Emacs was an example of a program that has use
for character names, and has a harder job because of this strangeness.
Too bad that (Emacs developer) Eli Zaretskii sees it as a rant against
Emacs when I mention that this property of Unicode has led to
longstanding (small) bugs there, but I think real examples are better
than made-up ones.

 If Emacs were to use ALERT or the abbreviation BEL for U+0007, ...

Yes, programs could have their own lists of preferred aliases to use,
or have a rule such as always use the first alias, but why? Why not have
a name, so programs don't have to choose which alias to use?

(I may be coming of as having a mission about this; it should be done
like this!!, but mostly this is just a question: it seems obvious it
should be done like this, so what am i missing?)
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Names for control characters

2014-03-12 Thread Eli Zaretskii
 From: starb...@stp.lingfil.uu.se (Per Starbäck)
 Cc: Eli Zaretskii e...@gnu.org, unicode\@unicode.org unicode@unicode.org
 Date: Wed, 12 Mar 2014 20:37:57 +0100
 
 This is not about Emacs. Emacs was an example of a program that has use
 for character names, and has a harder job because of this strangeness.
 Too bad that (Emacs developer) Eli Zaretskii sees it as a rant against
 Emacs when I mention that this property of Unicode has led to
 longstanding (small) bugs there, but I think real examples are better
 than made-up ones.

What I saw as a rant is only this part (which was the only one I
quoted):

  Emacs should do better regarding this

Should do better means it still doesn't, although it's expected to.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Names for control characters

2014-03-12 Thread Markus Scherer
On Wed, Mar 12, 2014 at 12:37 PM, Per Starbäck
starb...@stp.lingfil.uu.sewrote:

 My question was *why* control characters don't
 *have* names


That's because formally the ISO control codes do not have one fixed,
normative meaning; implementers may or may not follow ISO 6429. That is why
these don't have names in ISO 10646 and in Unicode.

http://www.unicode.org/faq/casemap_charprop.html#15

Of course, a few control codes (e.g., U+000A) are very widely used, and
have Unicode properties according to that use. (e.g., White_Space)

markus
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: CJK stroke order data: kRSUnicode v. kRSKangXi

2014-03-12 Thread Richard COOK
On Mar 12, 2014, at 2:59 AM, Adam Nohejl wrote:
 
 Since kRSUnicode is a Normative property, a formal proposal to modify that 
 data is required, for review in WG2. I have added notes on the items you 
 mention below, for consideration in that process, and in the meantime, if 
 you identify any other issues, please bring them to our attention.
 
 OK, I will prepare a more comprehensive list. Do you mean that you would 
 submit such a formal proposal? Or can I submit it myself somehow?

You are welcome to prepare a proposal, or just send us your list.

We have already started a proposal to augment kRSUnicode, but I'm not sure 
about the timeframe for completion.

The proofing of the various Kangxi properties is separate from this, but is 
aimed at the specific KX edition used by IRG in Extension B work.

-Richard


___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


RE: Names for control characters

2014-03-12 Thread Whistler, Ken
Per continued:

 I know it's not a name. My question was *why* control characters don't
 *have* names like
 
   CONTROL CHARACTER NULL
   CONTROL CHARACTER START OF HEADING
   CONTROL CHARACTER START OF TEXT
   etc.
 
 It would be so obvious to have it like that, so I assume there is some
 specific reason not to, but I still can't figure it out. For me there is
 not less reason for these characters to have names than any others, so
 for me it's like Linear B characters didn't have names, and I got the
 answer no problem, they have aliases, so that's OK! This is just
 strange to me. If names aren't needed, why do almost all characters have
 them?

Ah, so this is a Why is the sky blue? kind of question. ;-)
And perhaps the correct response is then a Just So story...

Once upon a time, there was an ISO framework for character
encoding. Officially his name was ISO 2022 Information technology --
Character code structure and extension techniques. But we'll
think of him as the troll that lives under the bridge and just call
him 2022 for short.

Now 2022 had his favorite collection of code points that he
kept in buckets under the bridge. But he was very, very particular
about how he organized his collection. All the code points 00 to 1F
had to go in the bucket labeled C0, and all the code points 20
to 7E had to go in the bucket labeled G0 (or GL --
sometimes the troll would get confused). He had other, even
bigger code points, too, but we can save those for another story.

2022 said all the code points in the G0 bucket could get names.
In fact they could get lots of names, if they wanted. So 2022 also
starting collecting sets of characters, where all those names were
written down. Sometimes he would escape to one set and admire
all those pretty names, and then he would escape to another set
and admire other pretty names. 2022 was a great admirer of
escaping, by the way, as well as pretty names.

But the code points in the C0 bucket were different. 2022
insisted that those code points weren't like the ones in the
G0 bucket, and they couldn't have names at all.
Indeed, these were very odd code points -- 2022 called
them control functions. Sometimes when the troll took one
out of the C0 bucket and examined it, it did one thing, but
the next time it might do something completely different.
Only 2022's friend, the troll named 6429 living under the next
bridge to the north, really understood what they might be doing from
one week to the next.

One day an aspiring young wizard named Unicode was crossing
the bridge. As an aspiring young wizard, he was rather observant.
And he noticed that there was a troll living under the bridge and
that that troll had stolen all the code points and was hoarding
them in strangely labeled buckets under the bridge. Being a wizard
and all, he knew that it was his duty to slay the troll and free all the
code points. So he set about writing down the appropriate spell in
his brand new spellbook.

Now Unicode was a very egalitarian wizard -- it just seemed right
to him that all code points should be able to have names, and it
would be better if each one had just one, unique name. That way,
none of them would get jealous of all the names some other code
point had acquired, and besides, each code point would know its name
and could come when you called it. So in the first version of
Unicode's spellbook, he wrote the spell down just that way. He
called his spell Unicode 1.0, because, well, it was his spell,
after all, and the very first complete spell that he would be trying
to use. 00 could be called NULL and 01 could be called START OF
HEADING, just like 20 could be called SPACE and 2D could be
called HYPHEN-MINUS.

You may be wondering why Unicode would use such odd names
for all the code points, but then there is no accounting for the whims 
of wizards, I guess.

Well, once Unicode had finished writing down the Unicode 1.0 spell,
he started casting it on the troll:

Shazm! Ffffppfft!

To Unicode's surprise, the spell only partly worked, but then fizzled.
The troll had been badly hurt, but he was still limping around under
the bridge, and he still clung tightly to his buckets of code points.

Unicode looked around to see what the problem could be, and
noticed that there was a warlock at the other end of the bridge.
It was an infamous warlock who had taken to calling himself 10646,
and from all appearances he was *also* trying to cast a spell to
kill the troll and free all the code points. Apparently, casting the
two spells at the same time had resulted in interference in the ley lines.
That was why neither spell had fully worked, and was why the troll
2022 was still limping around with his code point buckets.

The wizard Unicode headed across the bridge to speak to the
warlock 10646:

Look, we both want to slay that troll and free his code points.
Why don't we team up and cast synchronized spells?

But 10646 was a suspicious warlock. He wasn't sure that *all*
of the code points could be