Doug Ewell <[EMAIL PROTECTED]> wrote: > Is it really possible that we spent a year and a half, two years on > putting together an IDN architecture, and during all that time nobody > ever gave the slightest thought to the possibility of someone using > IDNs for spoofing purposes,
No, it was thought about, and it was decided that the IDNA protocol was not the place to address those issues; that they should be addressed in registries and user interfaces. IDNA could have addressed the easier portion of the problem (prohibiting punctuation and symbols) (and for a while I was arguing for that), but it still would have left the harder part of the problem (dealing with script mixtures and homographs among letters) for the registries and user interfaces to deal with, so why not let them deal with the easier part too? (Of course, one could then ask why that argument doesn't apply to all the invisible characters that IDNA does prohibit. I have no good answer at the moment. Maybe invisibility was the only disqualifying attribute that everyone could agree on.) John C Klensin <[EMAIL PROTECTED]> wrote: > I hope that those who wrote the IDNA specs will agree with the > statement of those principles I'm about to make, or at least that they > are close... they may not. > > (1) To the extent possible, we should accommodate all Unicode > characters, excluding as little as possible. That (or something very similar) was a principle that went into the IDNA spec. I personally was inclined to define both internationalized domain names and internationalized host names, where the former would be completely general (allowing *all* Unicode characters, even the invisible ones), and the latter would be much narrower (excluding most punctuation and symbols). This would be an analogy to traditional domain names (which allow all ASCII characters, even control characters) and traditional host names (which allow only the ASCII letters, digits, and one punctuation mark, the hyphen-minus). On the other hand, there was an argument that the traditional distinction between domain names and host names was the source of endless confusion and debate, and was a mistake that should not be repeated with IDNs. I have some sympathy for that argument. In any case, we ended up with just one set of non-ASCII characters for IDNs, between the two extremes: only invisible characters are excluded. (I think there's one exception--a visible space character that is also excluded). > (2) When code points had been identified by UTC as the same as, or > equivalent to, others, we tended to map them together, rather than > picking one and prohibiting the others. This was more than a tendency; it was strictly followed. > This has caused more problems than most of us expected, with people > being surprised when they register or query using one character and > the result that comes back uses another. I think this happens only for the case-folding mappings. The normalization mappings should not surprise anyone. > It also creates a near-homograph problem that we haven't "discovered" > in the last couple of weeks: If we have character X mapping to > character Y, but X looks vaguely like Z, then there may be no Y-Z > homograph, but there may be an X-Z one. True. And again, I think it's just the case-folding mappings that do this, not the normalization mappings. > Curiously, if we followed existing precedents, we could even move > IDNA from Proposed to Draft and change the tables to eliminate many > mappings and characters: no change to the algorithm, just elimination > of some features that didn't work in practice. If we want to place further restrictions the set of characters used in IDNs, I think it would be pretty rude of us to simply add them to the set of prohibited characters in Nameprep. What about the guy who registered <not_equal>.com? What if people had already bookmarked that site, and created links to it? Are we just going to break those links? A less rude approach would be to recommend that domain labels containing certain characters not be displayed. Their ACE forms could still be display, and they could still be looked up. The domain holder in this example could register a new displayable domain name, and could put an HTTP redirector at the old site, and existing bookmarks and links would continue to work. Erik van der Poel <[EMAIL PROTECTED]> wrote: > I believe it would be difficult to reach consensus on a relatively > narrow extension of the LDH rule. Just for starters, the hyphen used > to separate names and other strings in the Western world is not used > in Japan for Katakana, because Katakana uses a middle dot (U+30FB) to > separate 2 Katakana strings. In fact, this character is allowed in > .jp. But notice how seldom the hyphen-minus is actually used in domain names. People prefer to just run words together, even in languages that customarily use word breaks. Maybe the analogous characters in other scripts (like the katakana middle dot) would likewise be very seldom used in practice (especially in Japan where the lack of word breaks is the norm), and would not be missed if they were deprecated. > It may be possible to "tune" the tables, but nowhere in your email do > I find any reference to the ACE prefix. I think that we should also > figure out exactly which types of changes would absolutely require a > new ACE prefix, Coming up with the necessary and sufficient conditions will be tricky, but now that you've got me thinking about it, I think I can supply one sufficient condition: If the only changes you make are to add characters to the prohibited table, I don't think you need to change the ACE prefix. This would cause some valid IDN labels under the old spec to become invalid under the new spec, and would cause some valid ACE labels under the old spec to become bogo-ACE labels under the new space. (The bogo-ACE phenomenon already exists: there are labels that begin with the ACE prefix but don't validate during ToUnicode and therefore display as literal ASCII strings.) It would not cause anything to encode or decode to something different than it used to. But I don't advocate making such a change (see my argument above about rudeness). AMC
