Liana ruminated: > We have [STD13] defines that LDH are the DNS identifiers, > then what are the IDN identifiers? UCS is too big and contains > many semantically equivalent characters for IDN. Should we > ask for a table of semantically equivalent character sets > definition table from Unicode Consortium?
In my opinion, no. This concept of "semantically equivalent character sets" is way too imprecisely defined to make sense. What the Unicode Consortium provides is a large number of precisely defined data tables, giving various properties of the entire set of characters in the UCS. It is then up to a group such as this, in the context of their particular requirements (as for IDN identifiers) to make use of those property tables to pick and choose among the characters as appropriate to their application(s). (As has been done for nameprep.) > > If we are agree on the first RFC in Dan's list, > I suggest to ask Unicode group to provide a table of > "Semantically equivalent chatacters of UCS", where > we can define which characters are used for > 1) label separators, ie puncturations and formating marks > 2) structured data indicators, ie. $/%/& ... > 3) unstructured data identifiers, ie. alphabet, CJKs, > sound marks... > "IDN identifiers" should be subset of such a table, > to determine IDN nomalization protocol in the RFC. The Unicode Consortium's take on identifiers is already published in section 5.16 of the Unicode Standard, Version 3.0. Updated summary table information, covering the repertoire of Unicode 3.1.1 can be found in: ftp://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt and look for the ID_Start and ID_Continue properties. By the way, "format[t]ing marks" are typically not label separators. They are mostly ignorable for identifier formation -- and can either be omitted from your identifier repertoire or can be ignored (included silently) if not omitted from your identifier repertoire. In either case they do not *delimit* identifiers. > "Semantically equivalent chatacters of UCS" means > characters are equivalent to be used as an IDN identifier > when they are > 1)case insensitive, > 2)size or width insensitive, > 3)font insensitive (include majority of TC/SC) > 4)language insensitive (include CJK), > 5)combination insensitive(regardless NFC or KNFC). > > Case, size, font insensitive is easy to understand, > and have been addressed. What you seem to be aiming at here is a collection of various kinds of character foldings. Character foldings -- even case foldings -- are a rather murky area. The UTC position on case folding is summarized in: ftp://www.unicode.org/Public/UNIDATA/CaseFolding.txt Regarding other kinds of foldings, the UTC is currently working on a Unicode Technical Report on the subject. Nameprep involves a number of foldings -- but these issues are not, in fact all that easy to understand. > TC/SC shall be under font > category, which is not addressed in Unicode. That you characterize TC/SC as a font folding illustrates part of the problem. It is not a font folding, and cannot be handled that way, except in the grossest manner. > But > language and combination insensitive are the ones I'd > like to explain. > > Language insensitive: ie. circled numbers, circled > Han numerals, Dingbats, subset of CJKs. But other > subset of CJK will be different semantically for each > languages, then we have to have separated tables to > work with for each or them. Even with your examples, it isn't clear what you are talking about here with the term "language insensitive". > I think we are designing future IDN, we > assume all IDN has to be loaded somehow. If Japanese > agree on the semantic equivalence on the symbol to be > used in IDN, then we can ask if the current <business2> > handled by existing JIS local system can stay local without > leaking into new IDN, and let <business2> be in > the semantically equivalent set for globle communication. > Unicode group has to make such a choice for IETF. Why? Language use and country conventions are not areas that the Unicode Consortium holds expertise nor wishes to establish standards in. > Case study 3): > Armenian samll n should be in with Latin n or not > is depending on the users' decision, that is we > take Unicode group advice on this, since they are the > language usage experts to make such a decision. Well, we aren't language use experts. But if you want an expert opinion on *character* identity, ARMENIAN SMALL LETTER NOW is not and never will be grouped with, confused with, interchanged with LATIN SMALL LETTER N, any more than it would be with U+30F3 KATAKANA LETTER N or for that matter the Han character U+53C3 ni2, used in Chinese transliteration of Nepal, Nero, Nile, nylon, nicotine, Nicaragua, Nietzsche, Nice, and Nixon. > If > the Armenian samll n is in with Latin, then we have > another case similar with CJK unification case 1). > If they are not in with Latin, then we have another > case of Bengali and the alikes. ?? I presume this is an allusion to your concern that U+09EA BENGALI DIGIT FOUR is confusable (out of context) with the appearance of U+0038 DIGIT EIGHT. But Armenian doesn't look the slightest bit like Latin, so it isn't clear what you are on about here. > > Combination insensitive: <i><acute on top>,<i><acute> > <acute on top><i> shall be the same, Well, these aren't all the same. This is a fundamental misunderstanding of how combining characters work in Unicode. > all in Set > <i+acute on top>. This is the base for normalizing > from either a table (TC/SC like) or by a procedure > (NFC or KNFC like). The Unicode Normalization Algorithm (which defines NFC and NFKC -- not "KNFC") is based on tables, too. And it is not done by some vague notion of assembling all the sets that we think ought to be "the same". It is done by applying the algorithm rigorously to the defining tables (of decompositions and of composition exclusions). > > So the format is something like: > <i>: <I>,<tilt i>,<fat i>,<Greek i>,<Greek I>,... Wrong from the start. Trying to mix scripts together like that is completely unextensible. > <i with acute>:<I with acute>,<i><acute>,<i><acute on top>,<I><acute>... > > For reasonable request, I suggest we limit our scope > to UCS Plain 0 characters. And we will end up with a > nicely display on the Web for us to read and for the > public to judge, instead of ieft draft with all the > U+E456.. which is meant for forting data and spotting > checks. The relevant data tables regarding Unicode normalization are all already posted and are public to judge. Beyond that, this effort to get someone (who?) to define all the "semantically equivalent characters" just seems like an ill-destined detour from actually getting the work on IDN accomplished. --Ken
