In addition, 1. This issue was debated at length some time ago. I suggest that the people arguing for visual confusability as a criterion for matching look at that discussion in detail before proceding.
2. Moreover, stop and think about the implications; using both case folding and visual confusability would have some very unpleasant consequences. For example, it would force the ASCII letters N and V to be in the same equivalence class: - N is in the same class as GREEK NU, by visual confusability. - GREEK NU is in the same class as greek nu, by casefolding. - greek nu is in the same class as v, by visual confusability. - v is in the same class as V, by visual casefolding. Mark ————— Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "Kenneth Whistler" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Wednesday, January 02, 2002 17:51 Subject: Character equivalence mapping (was: Re: [idn] SLC minutes) > Edmon suggested: > > > Character Equivalence mapping is to deal with this issue: > > > > A registrant registers a domain <ALPHA><BETA>.example > > Advertises it to other people as their capital form AB.example > > An end user will not know whether it was Greek or English and attempts to > > access the site with ab.example and does not get to it. > > > > With Character Equivalance mapping, this situation would not occur. No > > matter how a domain name is represented, it is always unique. > > I think this example nicely points up the contrary problem that > cross-script mapping has. If you start doing cross-script equivalence > mapping to eliminate differences between (to Latin-trained eyes) confusable > letters, you violate the integrity of other scripts and start mapping > the set of possible strings in those scripts even more confusably into > the already crowded domain namespace of Latin strings. > > In this particular example, suppose I was a Greek and actually wanted > to register <ALPHA><BETA>.com, in addition to <ALPHA><BETA>.gr for > the <ALPHA><BETA> construction company in Athens. Whoops! I'd be > out of luck since ab.com already exists and is registered to > Allen-Bradley. (See www.ab.com ) Why should I, as a Greek, find my > own Greek namespace unpredictably polluted by some arbitrary list > of equivalences between Greek letters and Latin letters? > > And exactly what equivalences would you suggest? Greek uppercase > eta is basically indistinguishable in shape from a Latin uppercase "H". > So do I equivalence map it to Latin "H", which would make no sense at > all for transliteration and serve only the purposes of dumb equations > for people who know nothing about Greek whatsoever? Or do I equivalence map it > to Latin "I", which is the normal transliteration for eta in Modern Greek? > Or do I equivalence map it to Latin "E", which is the normal transliteration > for eta in Ancient Greek? > > So does: <ALPHA><BETA>.<OMICRON><MU><ETA><RHO><OMICRON><SIGMA> > > equate to: ab.omhpo<sigma> or ab.omiro<sigma> or ab.omero<sigma> or > ab.omhpos or ab.omiros or ab.omeros ? > > By the way, the 5th example is how the Greeks themselves would Latinize > it. (see www.omiros.gr ) > > The problem of "AB.example" is generally dealt with by context. First > of all "example" would be in Greek if I was really dealing with Greek. > Second, if I wanted people to enter "ab.whatever" I'd be advertising > in *English* to set the expectations. If I wanted people to enter > "<alpha><beta>.....", I'd be advertising in *Greek* to set the expectations, > and people would be using Greek keyboards and expect to enter Greek. > > Furthermore, visual confusability quickly runs off the road as the basis > for determining equivalence classes when you start to deal with scripts > that have more complicated rules for the presentation of glyphs than > is typical for the Latin script. Which of several possible forms is > the basis for the confusability used to determine the equivalence? > And this turns into an N-body problem, because you start having to > account for visual confusability between N different scripts -- not just > between N scripts and Latin characters. Where do you draw the line, in > principle? Or do we just end up arguing for the next decade about all > the edge cases? > > > > > Bear in mind that this need to happen only during matching of names within > > the DNS server. > > > > A registrant can register <ALPHA><b>.example all they want. This is the > > misconception that I wanted to point out. Character Equivalence mapping > > does not prohibit mixed scripts. > > But it does severe damage to the integrity of namespaces in other scripts. > > This is Latin- and English-centric thinking, in my opinion, that would > damage the whole point of having IDN's by folding other scripts towards > Latin characters. > > --Ken > > > > > Edmon > > > > > >
