Hi, James, We have [STD13] defines that LDH are the DNS identifiers, then what are the IDN identifiers? UCS is too big and contains many semantically equivalent characters for IDN. Should we ask for a table of semantically equivalent character sets definition table from Unicode Consortium?
If we are agree on the first RFC in Dan's list, I suggest to ask Unicode group to provide a table of "Semantically equivalent chatacters of UCS", where we can define which characters are used for 1) label separators, ie puncturations and formating marks 2) structured data indicators, ie. $/%/& ... 3) unstructured data identifiers, ie. alphabet, CJKs, sound marks... "IDN identifiers" should be subset of such a table, to determine IDN nomalization protocol in the RFC. "Semantically equivalent chatacters of UCS" means characters are equivalent to be used as an IDN identifier when they are 1)case insensitive, 2)size or width insensitive, 3)font insensitive (include majority of TC/SC) 4)language insensitive (include CJK), 5)combination insensitive(regardless NFC or KNFC). Case, size, font insensitive is easy to understand, and have been addressed. TC/SC shall be under font category, which is not addressed in Unicode. But language and combination insensitive are the ones I'd like to explain. Language insensitive: ie. circled numbers, circled Han numerals, Dingbats, subset of CJKs. But other subset of CJK will be different semantically for each languages, then we have to have separated tables to work with for each or them. Case study 1): Kanji <business> has three forms, <business1> <business1'> and <business2>, which are the same with Chinese <business1> <business1'> and <business2>. Chinese use <business2> as IDN id, for all three. Japanese agrees on put <business1><business1'> in, and want to have <business2> as a different semantic set, since they are different semanticly in their accounting data base. The issue is which class Kanji<bussiness2> should be. The current [TSconv] takes it out of the table, so it is undecided. I think we are designing future IDN, we assume all IDN has to be loaded somehow. If Japanese agree on the semantic equivalence on the symbol to be used in IDN, then we can ask if the current <business2> handled by existing JIS local system can stay local without leaking into new IDN, and let <business2> be in the semantically equivalent set for globle communication. Unicode group has to make such a choice for IETF. Case 2) : If there is <business3> in Kanji, but not in Chinese, then <business3> is a set by itself. Case study 3): Armenian samll n should be in with Latin n or not is depending on the users' decision, that is we take Unicode group advice on this, since they are the language usage experts to make such a decision. If the Armenian samll n is in with Latin, then we have another case similar with CJK unification case 1). If they are not in with Latin, then we have another case of Bengali and the alikes. Combination insensitive: <i><acute on top>,<i><acute> <acute on top><i> shall be the same, all in Set <i+acute on top>. This is the base for normalizing from either a table (TC/SC like) or by a procedure (NFC or KNFC like). So the format is something like: <i>: <I>,<tilt i>,<fat i>,<Greek i>,<Greek I>,... <i with acute>:<I with acute>,<i><acute>,<i><acute on top>,<I><acute>... For reasonable request, I suggest we limit our scope to UCS Plain 0 characters. And we will end up with a nicely display on the Web for us to read and for the public to judge, instead of ieft draft with all the U+E456.. which is meant for forting data and spotting checks. Regards, Liana
