This discussion is being sent to two mailing lists, the IDN list and the Unicode list. Some Unicode people don't know about IDNs, so I'll give a brief introduction. At the end I'll address Klensin's comments.
IDNs are ``internationalized domain names.'' Imagine, for example, a Greek pi in http://pi.cr.yp.to. IETF created the IDN working group and the IDN mailing list---allegedly open to the public, but the group chair sometimes censors objections--- to try to make IDNs work. The group chair has been pushing a specific proposal, called IDNA; he has recently issued a Last Call for IDNA despite objections to IDNA from many group participants. There are three big issues surrounding IDNs: (1) How should characters be encoded? The obvious choice is UTF-8. Many existing programs already work with UTF-8 domain names. See http://pi.cr.yp.to for a discussion of what works and what has to be fixed. The IDNA proposal uses a new special-purpose 7-bit character encoding. The proponents claim that adding this new encoding to a huge number of programs will be less expensive than fixing the programs that have trouble with UTF-8. (2) Should two strings be treated separately as domain names if they are visually indistinguishable? For example, should someone be allowed to register aol.com, with o replaced by a Greek omicron? An answer of ``yes'' would introduce many new errors into common uses of domain names. Domain names are not hidden inside the computer; they are displayed for users, so that users can recognize known names. Visually indistinguishable domain names would make this function inherently unreliable. The IDNA proposal ignores this problem. The proponents observe that we're already faced with digit 0 and capital O, and digit 1 and lowercase l; they then leap to the incredible conclusion that there's no harm in adding a bunch of new characters whose glyphs are similar or even identical. (On the other hand, IDNA prohibits all characters that look like hyphens, periods, etc.) The conservative approach, familiar in security contexts, is to have all new characters prohibited by default. The domain-name registries will then allow selected characters that have been carefully reviewed for problems. (3) Should two strings be treated identically as domain names if they are visually different but semantically similar? For example, should uppercase Pi.cr.yp.to, with a Greek Pi, be treated the same way as lowercase pi.cr.yp.to? The only existing semantic-similarity rule is that uppercase ASCII is treated the same way as lowercase ASCII. We're forced to continue treating uppercase ASCII the same way as lowercase ASCII for interoperability. But should there be more rules? There are several reasons to say no. First, new rules would have to be added to a huge number of programs that handle domain names. Second, new rules would make issue #2 substantially more difficult to solve: for example, if Greek lowercase alpha is treated the same way as uppercase Alpha, then alpha ol.com will conflict with aol.com, because Alpha OL.COM is visually identical to AOL.COM. Third, semantic similarity depends on the reader: for example, lowercase delta and uppercase Delta are semantically similar in Greece, but not in the United States. Fourth, semantic similarity is not transitive: for example, I see signs in France using capital E as a capitalization of e-accent-egu, and signs using capital E as a capitalization of e without an accent, but omitting the accent from e-accent-egu would be a misspelling. The IDNA proposal imposes a global set of uppercase-lowercase conversions. The proponents claim that, if we don't allow uppercase, we'll be flooded with complaints from users who tried typing domain names in uppercase. They don't respond when it is pointed out that users already handle case-sensitive lowercase URLs without trouble: blah.html works and BLAH.HTML doesn't. (It's funny how the IDNA proponents express such concern about the accuracy of users _typing_ domain names, but completely ignore the accuracy of users _reading_ domain names. See issue #2. Has it occurred to them that the domain names typed are, in almost all cases, copies of domain names previously read?) Meanwhile, the IDNA proposal ignores other questions of semantic similarity. There have already been a huge number of complaints about this from Chinese users. The IDNA proponents say that the complaints ``don't count.'' The conservative---and cost-effective---approach is to start without any new rules, and have the registries prohibit characters that may be affected by new rules. Then new rules can be safely added later _if_ that turns out to be a good idea. For example, uppercase non-ASCII letters won't be treated the same way as lowercase, but they also won't be allowed in registrations, Now, back to the current thread. The ``Unicode and Security'' message explained one way that visually indistinguishable characters can be exploited by attackers. This isn't a flaw in Unicode; it is a flaw in careless protocol designs such as IDNA. John C Klensin writes: > This is _really_ old news, old enough that some companies have > mail gateways set up to trap and reject outgoing mail that uses > spoofed variations on the company's name. Here Klensin is admitting that IDNA will let attackers breach security. These gateways are---unless modified in the last few months by an IDNA proponent to use visual-similarity tables that the IDNA proponents say don't exist---unaware of IDNA, and therefore unable to stop the attack described in the ``Unicode and Security'' message. > The solution to them, along with the rest of the large catalog of ways > to spoof email, is signed and encrypted mail. Speaking as a cryptographer: I find this ``cryptography solves all security problems'' attitude to be astonishingly naive. The problem here is not message forgery. The computer has accurately identified the name of the sender. A cryptographic guarantee of authenticity would do nothing to stop the attack. The problem is in the design of the name system itself: how names are assigned and used. The attacker was, as a matter of policy, allowed to use a name visually indistinguishable from the target name. The name was displayed on a computer screen. The display was read by the victim, and used to authorize access. One long-term solution is to drop all reliance on global name displays in favor of local name displays (address books) defined entirely by the user. In the short term, however, recipients will continue to use global name displays to recognize known senders. ---D. J. Bernstein, Associate Professor, Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago
