First of all, allow me to say that I am not proposing to have Character Equivalence preparations in the "protocol". In fact, I intend this topic to be more "operational", and I have mentioned that this might not be the best list to discuss this, but I did want to clarify what I said during the SLC meeting as was noted in the minutes (which was what started this discussion). Also, since the interested parties on this topic is likely going to be on this list too. I apologize for lingering on the discussion, but allow me to say the following:
I do believe very much that since this issue sparks so much contention, we should archive the problem and discussions better, perhaps into an informational document, pointing out that there are a set of codepoints that may be perceived as equivalent. This in fact would include issues for Latin-Greek-Cyrillic (LGC) characters as well as at least JPCHAR, HangulChar and T/S Chinese. Becuase we do not have much information on Arabic and the other General Scripts, that is why I said that this document should be a living document. If it becomes apparent that some Character Equivalence preparation is beneficial for these scripts, then we can ammend the document. The document will contain a discussion on the different issues surrounding LGC and CJK characters in Unicode respectively, as well as a set of tables that lists out the equivalent characters. With this document, a zone operator (whether a local zone manager or a TLD manager) can choose one of three approaches to deal with the issue (and can choose to implment the mappings and which mappings to use for their own zone): 1. do nothing 2. multiple registrations 3. consolidate characters before name matching within the name server It will alarm me if anyone thinks that this is useless because they are real issues that an implementor should be aware of. Edmon ----- Original Message ----- From: "Mark Davis" <[EMAIL PROTECTED]> To: "Edmon" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Thursday, January 03, 2002 2:04 PM Subject: Re: Character equivalence mapping (was: Re: [idn] SLC minutes) > You do not seem to have read the material I suggested. There was quite a > discussion on this list on why it is hopeless AND counterproductive to try > identify all the characters whose glyphs could be visually indistinguishable > to a user, and create equivalence classes based upon that identification. > > Hopeless: For this to be used in IDN, one would have to collect a massive > amount of data; and all at once -- the mappings can't simply change over > time. There are a huge number of questionable cases, where for accuracy one > would have to survey a substantial range of fonts in the common sizes used > on different platforms to determine visual distinguishability. Examples are > quotation dash and the CJK character for "one", the katakana KA and the CJK > character for power, etc. And in many of those cases, one would still have > to make a judgment call, since even if the pixels are always somewhat > different, users may perceive the glyphs as being the same. > > Counterproductive: As I noted earlier, the N / V problem arises hundreds of > times -- a lowercase greek nu in common fonts (e.g. Arial Unicode MS) is > visually indistinguishable from a Latin v. That causes N, V, n, v, NU, nu to > be part of the same equivalence class, so a company could not register > NIA.com if VIA.com were already registered. > > > Although it is certainly "in character" for this list to endlessly (and > pointlessly) repeat earlier discussions, I'd suggest you look at the email > archives on this topic, since this subject has arisen many times. After you > have read them, if you have any new information it might be worth continuing > the discussion. > > Mark > > P.S. Unfortunately, there appears to be no web interface for viewing the > archive, so it is a bit clumsy to search for particular topics. I think this > topic was discussed at length sometime in 2000, with repeated comments > through 2001. > > ————— > > Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ > [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] > > http://www.macchiato.com > > ----- Original Message ----- > Sent: Thursday, January 03, 2002 10:39 > > > > QUIT > > QUIT > > Message-ID: <005101c19487$36192d40$[EMAIL PROTECTED]> > > From: "Edmon" <[EMAIL PROTECTED]> > > To: "Mark Davis" <[EMAIL PROTECTED]> > > References: <[EMAIL PROTECTED]> > <008b01c1947d$03a4a830$08d8ea0c@c1340594a> > > Subject: Re: > > Date: Thu, 3 Jan 2002 13:48:13 -0500 > > MIME-Version: 1.0 > > Content-Type: text/plain; > > charset="utf-8" > > Content-Transfer-Encoding: 8bit > > X-Priority: 3 > > X-MSMail-Priority: Normal > > X-Mailer: Microsoft Outlook Express 5.50.4522.1200 > > X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200 > > > > Hi Mark, > > > > But I am not suggesting any transliteration or transformation. I am only > > suggesting that some characters which might be "perceived" or "confused" > as > > equivalent/identical be collected together and regarded as "equivalent" > > during the DNS name matching process within the DNS server. It should not > > hinder the ability to have different characters or representations send > > different data over the wire and to maintain the original information. > > > > Take for example the current DNS, neither uppercase or lowercase is the > > "primary" case, they are considered equal. Which means that you can have > a > > domain that is DoMaIn.CoM, as well as a domain that is domain.COM, but > > during the matching process, the DNS will declare that they are the same. > > Not that it is mapped to (or declared to be "same as") domain.com or > > DOMAIN.COM, they are simply considered equivalent. (at least it is what > its > > supposed to do I think) > > > > So, what is the complexity except for coming up with the exact list? The > > "exact list" I believe should be a living document and be revised over > time, > > but we need to start somewhere. It should be relatively uncomplicated to > > come up with a list for Latin-Greek-Cyrillic equivalent characters. In > > fact, I think I dont mind working on a table that contains ones that could > > be "perceived" as equivalent to start with and allow other people to > > scrutinize on each's equivalence in form. Or do you know if there is such > a > > list already? > > > > Edmon > > > > > > > > ----- Original Message ----- > > From: "Mark Davis" <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Cc: "Kenneth Whistler" <[EMAIL PROTECTED]> > > Sent: Thursday, January 03, 2002 12:00 PM > > Subject: Re: > > > > > > > It is not that simple. I suggest you review the messages on this topic > in > > > the archive for this list, and also read both of the following: > > > > > > http://www.unicode.org/unicode/standard/where/ > > > http://www.unicode.org/unicode/reports/tr17/ > > > > > > Mark > > > ————— > > > > > > Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο > > πάντα — Ὁμήρου Μαργίτῃ > > > [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] > > > > > > http://www.macchiato.com > > > > > > ----- Original Message ----- > > > Sent: Thursday, January 03, 2002 08:03 > > > > > > > > > > QUIT > > > > QUIT > > > > Message-ID: <002f01c19471$822bac00$[EMAIL PROTECTED]> > > > > From: "Edmon" <[EMAIL PROTECTED]> > > > > To: "Mark Davis" <[EMAIL PROTECTED]> > > > > References: <[EMAIL PROTECTED]> > > > <00aa01c1940f$ac0fee30$08d8ea0c@c1340594a> > > > > Subject: Re: Character equivalence mapping (was: Re: [idn] SLC > minutes) > > > > Date: Thu, 3 Jan 2002 11:12:53 -0500 > > > > MIME-Version: 1.0 > > > > Content-Type: text/plain; > > > > charset="utf-8" > > > > Content-Transfer-Encoding: 7bit > > > > X-Priority: 3 > > > > X-MSMail-Priority: Normal > > > > X-Mailer: Microsoft Outlook Express 5.50.4522.1200 > > > > X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200 > > > > > > > > Hi Mark > > > > > > > > From: "Mark Davis" <[EMAIL PROTECTED]> > > > > > 2. Moreover, stop and think about the implications; using both case > > > > folding > > > > > and visual confusability would have some very unpleasant > consequences. > > > For > > > > > example, it would force the ASCII letters N and V to be in the same > > > > > equivalence class: > > > > > > > > I will argue that <nu> and v have some subtle difference while <ALPHA> > > and > > > > A, <NU> and N, are truly identical. Perhaps character equivalence > was > > > not > > > > a good choice of word, lets try Character Identicality. I think there > > are > > > > some characters that we can truly say that they are "identical" under > > > > scrutiny. Do you think so? > > > > > > > > Edmon > > > > > > > > > > > > > > > > > > > > > > >
