Re: CJK stroke order data: kRSUnicode v. kRSKangXi
Mr. Cook, Thank you for all the information (and bringin Wenlin to my attention as well). Where a specific value attested in a specific Kangxi edition is missing from kRSUnicode, it would indeed be useful to add it, Great to hear that. Since kRSUnicode is a Normative property, a formal proposal to modify that data is required, for review in WG2. I have added notes on the items you mention below, for consideration in that process, and in the meantime, if you identify any other issues, please bring them to our attention. OK, I will prepare a more comprehensive list. Do you mean that you would submit such a formal proposal? Or can I submit it myself somehow? PS: About the subject line of your message. Yes, of course, the subject should have read CJK radical-stroke count data. Not one of my brightest moments, I guess... -- Adam Nohejl ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Names for control characters (Was: (in 6429) in allkeys.txt)
Ken Whistler wrote: Ah, I see what the interpretation problem was. Yes, that is a straightforward kind of improvement -- easily enough done. Look for a change the next time the file is updated. (It will not be immediately changed, pending other review comments.) Thanks! Then I'll skip making a formal request about this. Regarding these names in ISO 6429 again, how come these control characters don't have Unicode names? For many uses of names, the control characters have as much need for them as any other character. Since it seems so straightforward it must have been suggested several times to introduce names like CONTROL CHARACTER NULL CONTROL CHARACTER START OF HEADING CONTROL CHARACTER START OF TEXT etc., so I assume there are good reasons for not doing that, but I can't see what they are. Since applications want names they will use other things as names when there isn't a real name, and that leads to problems. Take Emacs where the command describe-char currently describes U+0007 as name: control old-name: BELL (I reported the misusage of control here as a name in 2009, but it wasn't fixed until this year, so still not in a released version.) The usage of BELL here invites confusion with U+1F514 BELL. Emacs should do better regarding this, but still, with a proper name all of this would have been averted. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Names for control characters (Was: (in 6429) in allkeys.txt)
They do have aliases in NameAliases.txt ;NULL;control ;NUL;abbreviation 0001;START OF HEADING;control 0001;SOH;abbreviation 0002;START OF TEXT;control 0002;STX;abbreviation ... Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Wed, Mar 12, 2014 at 1:32 PM, Per Starbäck starb...@stp.lingfil.uu.sewrote: Ken Whistler wrote: Ah, I see what the interpretation problem was. Yes, that is a straightforward kind of improvement -- easily enough done. Look for a change the next time the file is updated. (It will not be immediately changed, pending other review comments.) Thanks! Then I'll skip making a formal request about this. Regarding these names in ISO 6429 again, how come these control characters don't have Unicode names? For many uses of names, the control characters have as much need for them as any other character. Since it seems so straightforward it must have been suggested several times to introduce names like CONTROL CHARACTER NULL CONTROL CHARACTER START OF HEADING CONTROL CHARACTER START OF TEXT etc., so I assume there are good reasons for not doing that, but I can't see what they are. Since applications want names they will use other things as names when there isn't a real name, and that leads to problems. Take Emacs where the command describe-char currently describes U+0007 as name: control old-name: BELL (I reported the misusage of control here as a name in 2009, but it wasn't fixed until this year, so still not in a released version.) The usage of BELL here invites confusion with U+1F514 BELL. Emacs should do better regarding this, but still, with a proper name all of this would have been averted. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Names for control characters (Was: (in 6429) in allkeys.txt)
From: starb...@stp.lingfil.uu.se (Per Starbäck) Date: Wed, 12 Mar 2014 13:32:15 +0100 Cc: unicode@unicode.org unicode@unicode.org Regarding these names in ISO 6429 again, how come these control characters don't have Unicode names? They have a non-empty old name field: ;control;Cc;0;BN;N;NULL Emacs should do better regarding this As you yourself say, it already does, so I don't see the point in this rant. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Names for control characters (Was: (in 6429) in allkeys.txt)
Please be very careful here. Having a non-empty value in field 1 of UnicodeData.txt is *not* the same has having a Unicode name. See: http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G135207 for the gory details. The Unicode name is formally defined in terms of the Name property, which itself is a combination of enumerated values extracted from UnicodeData.txt, plus a number of rules. For all characters whose General_Category=Cc, the formal definition of the Name property is a null string. The string control is *never* to be interpreted as a Unicode name. It is a field placeholder with legacy status. See Interpretation of Field 1 of UnicodeData.txt in the section I cited above. As far as user interfaces and other applications needing names for Unicode control characters -- one of the reasons that the namespace for Unicode characters includes all of the formal name aliases provided in NameAliases.txt is so that applications can safely treat any formal name alias for a control character (or the other abbreviations, etc., also listed in NameAliases.txt) *as if* they were Unicode names, without running into name collisions with the actual Name property value for Unicode characters. The history of the name collision for the (relatively) recently encoded U+1F514 BELL with the traditional usage for the U+0007 control function BELL led the UTC to extend the namespace as noted, so we won't be running into more such problems in the future. If Emacs were to use ALERT or the abbreviation BEL for U+0007, instead of control, that would avoid the collision with U+1F514 BELL, be conformant to the Unicode Standard, and presumably be helpful to users, as well. See the entries for U+0007 in NameAliases.txt: # Note that no formal name alias for the ISO 6429 BELL is # provided for U+0007, because of the existing name collision # with U+1F514 BELL. 0007;ALERT;control 0007;BEL;abbreviation --Ken Regarding these names in ISO 6429 again, how come these control characters don't have Unicode names? They have a non-empty old name field: ;control;Cc;0;BN;N;NULL ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Names for control characters (Was: (in 6429) in allkeys.txt)
From: Whistler, Ken ken.whist...@sap.com Date: Wed, 12 Mar 2014 16:48:25 + Cc: Whistler, Ken ken.whist...@sap.com, unicode@unicode.org unicode@unicode.org Please be very careful here. Having a non-empty value in field 1 of UnicodeData.txt is *not* the same has having a Unicode name. You will see that I didn't refer to the Name attribute, I referred to the old name attribute (called Unicode_1_Name in UAX#44). ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Names for control characters
Ken Whistler wrote: Please be very careful here. Having a non-empty value in field 1 of UnicodeData.txt is *not* the same has having a Unicode name. See: http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf#G135207 I know it's not a name. My question was *why* control characters don't *have* names like CONTROL CHARACTER NULL CONTROL CHARACTER START OF HEADING CONTROL CHARACTER START OF TEXT etc. It would be so obvious to have it like that, so I assume there is some specific reason not to, but I still can't figure it out. For me there is not less reason for these characters to have names than any others, so for me it's like Linear B characters didn't have names, and I got the answer no problem, they have aliases, so that's OK! This is just strange to me. If names aren't needed, why do almost all characters have them? This is not about Emacs. Emacs was an example of a program that has use for character names, and has a harder job because of this strangeness. Too bad that (Emacs developer) Eli Zaretskii sees it as a rant against Emacs when I mention that this property of Unicode has led to longstanding (small) bugs there, but I think real examples are better than made-up ones. If Emacs were to use ALERT or the abbreviation BEL for U+0007, ... Yes, programs could have their own lists of preferred aliases to use, or have a rule such as always use the first alias, but why? Why not have a name, so programs don't have to choose which alias to use? (I may be coming of as having a mission about this; it should be done like this!!, but mostly this is just a question: it seems obvious it should be done like this, so what am i missing?) ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Names for control characters
From: starb...@stp.lingfil.uu.se (Per Starbäck) Cc: Eli Zaretskii e...@gnu.org, unicode\@unicode.org unicode@unicode.org Date: Wed, 12 Mar 2014 20:37:57 +0100 This is not about Emacs. Emacs was an example of a program that has use for character names, and has a harder job because of this strangeness. Too bad that (Emacs developer) Eli Zaretskii sees it as a rant against Emacs when I mention that this property of Unicode has led to longstanding (small) bugs there, but I think real examples are better than made-up ones. What I saw as a rant is only this part (which was the only one I quoted): Emacs should do better regarding this Should do better means it still doesn't, although it's expected to. ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: Names for control characters
On Wed, Mar 12, 2014 at 12:37 PM, Per Starbäck starb...@stp.lingfil.uu.sewrote: My question was *why* control characters don't *have* names That's because formally the ISO control codes do not have one fixed, normative meaning; implementers may or may not follow ISO 6429. That is why these don't have names in ISO 10646 and in Unicode. http://www.unicode.org/faq/casemap_charprop.html#15 Of course, a few control codes (e.g., U+000A) are very widely used, and have Unicode properties according to that use. (e.g., White_Space) markus ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
Re: CJK stroke order data: kRSUnicode v. kRSKangXi
On Mar 12, 2014, at 2:59 AM, Adam Nohejl wrote: Since kRSUnicode is a Normative property, a formal proposal to modify that data is required, for review in WG2. I have added notes on the items you mention below, for consideration in that process, and in the meantime, if you identify any other issues, please bring them to our attention. OK, I will prepare a more comprehensive list. Do you mean that you would submit such a formal proposal? Or can I submit it myself somehow? You are welcome to prepare a proposal, or just send us your list. We have already started a proposal to augment kRSUnicode, but I'm not sure about the timeframe for completion. The proofing of the various Kangxi properties is separate from this, but is aimed at the specific KX edition used by IRG in Extension B work. -Richard ___ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode
RE: Names for control characters
Per continued: I know it's not a name. My question was *why* control characters don't *have* names like CONTROL CHARACTER NULL CONTROL CHARACTER START OF HEADING CONTROL CHARACTER START OF TEXT etc. It would be so obvious to have it like that, so I assume there is some specific reason not to, but I still can't figure it out. For me there is not less reason for these characters to have names than any others, so for me it's like Linear B characters didn't have names, and I got the answer no problem, they have aliases, so that's OK! This is just strange to me. If names aren't needed, why do almost all characters have them? Ah, so this is a Why is the sky blue? kind of question. ;-) And perhaps the correct response is then a Just So story... Once upon a time, there was an ISO framework for character encoding. Officially his name was ISO 2022 Information technology -- Character code structure and extension techniques. But we'll think of him as the troll that lives under the bridge and just call him 2022 for short. Now 2022 had his favorite collection of code points that he kept in buckets under the bridge. But he was very, very particular about how he organized his collection. All the code points 00 to 1F had to go in the bucket labeled C0, and all the code points 20 to 7E had to go in the bucket labeled G0 (or GL -- sometimes the troll would get confused). He had other, even bigger code points, too, but we can save those for another story. 2022 said all the code points in the G0 bucket could get names. In fact they could get lots of names, if they wanted. So 2022 also starting collecting sets of characters, where all those names were written down. Sometimes he would escape to one set and admire all those pretty names, and then he would escape to another set and admire other pretty names. 2022 was a great admirer of escaping, by the way, as well as pretty names. But the code points in the C0 bucket were different. 2022 insisted that those code points weren't like the ones in the G0 bucket, and they couldn't have names at all. Indeed, these were very odd code points -- 2022 called them control functions. Sometimes when the troll took one out of the C0 bucket and examined it, it did one thing, but the next time it might do something completely different. Only 2022's friend, the troll named 6429 living under the next bridge to the north, really understood what they might be doing from one week to the next. One day an aspiring young wizard named Unicode was crossing the bridge. As an aspiring young wizard, he was rather observant. And he noticed that there was a troll living under the bridge and that that troll had stolen all the code points and was hoarding them in strangely labeled buckets under the bridge. Being a wizard and all, he knew that it was his duty to slay the troll and free all the code points. So he set about writing down the appropriate spell in his brand new spellbook. Now Unicode was a very egalitarian wizard -- it just seemed right to him that all code points should be able to have names, and it would be better if each one had just one, unique name. That way, none of them would get jealous of all the names some other code point had acquired, and besides, each code point would know its name and could come when you called it. So in the first version of Unicode's spellbook, he wrote the spell down just that way. He called his spell Unicode 1.0, because, well, it was his spell, after all, and the very first complete spell that he would be trying to use. 00 could be called NULL and 01 could be called START OF HEADING, just like 20 could be called SPACE and 2D could be called HYPHEN-MINUS. You may be wondering why Unicode would use such odd names for all the code points, but then there is no accounting for the whims of wizards, I guess. Well, once Unicode had finished writing down the Unicode 1.0 spell, he started casting it on the troll: Shazm! Ffffppfft! To Unicode's surprise, the spell only partly worked, but then fizzled. The troll had been badly hurt, but he was still limping around under the bridge, and he still clung tightly to his buckets of code points. Unicode looked around to see what the problem could be, and noticed that there was a warlock at the other end of the bridge. It was an infamous warlock who had taken to calling himself 10646, and from all appearances he was *also* trying to cast a spell to kill the troll and free all the code points. Apparently, casting the two spells at the same time had resulted in interference in the ley lines. That was why neither spell had fully worked, and was why the troll 2022 was still limping around with his code point buckets. The wizard Unicode headed across the bridge to speak to the warlock 10646: Look, we both want to slay that troll and free his code points. Why don't we team up and cast synchronized spells? But 10646 was a suspicious warlock. He wasn't sure that *all* of the code points could be