Re: [idn] Re: Unicode and Security

D. J. Bernstein Thu, 07 Feb 2002 21:44:40 -0800

This discussion is being sent to two mailing lists, the IDN list and the
Unicode list. Some Unicode people don't know about IDNs, so I'll give a
brief introduction. At the end I'll address Klensin's comments.


IDNs are ``internationalized domain names.'' Imagine, for example, a
Greek pi in http://pi.cr.yp.to.

IETF created the IDN working group and the IDN mailing list---allegedly
open to the public, but the group chair sometimes censors objections---
to try to make IDNs work. The group chair has been pushing a specific
proposal, called IDNA; he has recently issued a Last Call for IDNA
despite objections to IDNA from many group participants.

There are three big issues surrounding IDNs:

   (1) How should characters be encoded?

       The obvious choice is UTF-8. Many existing programs already work
       with UTF-8 domain names. See http://pi.cr.yp.to for a discussion
       of what works and what has to be fixed.

       The IDNA proposal uses a new special-purpose 7-bit character
       encoding. The proponents claim that adding this new encoding to a
       huge number of programs will be less expensive than fixing the
       programs that have trouble with UTF-8.

   (2) Should two strings be treated separately as domain names if they
       are visually indistinguishable? For example, should someone be
       allowed to register aol.com, with o replaced by a Greek omicron?

       An answer of ``yes'' would introduce many new errors into common
       uses of domain names. Domain names are not hidden inside the
       computer; they are displayed for users, so that users can
       recognize known names. Visually indistinguishable domain names
       would make this function inherently unreliable.

       The IDNA proposal ignores this problem. The proponents observe
       that we're already faced with digit 0 and capital O, and digit 1
       and lowercase l; they then leap to the incredible conclusion that
       there's no harm in adding a bunch of new characters whose glyphs
       are similar or even identical.

       (On the other hand, IDNA prohibits all characters that look like
       hyphens, periods, etc.)

       The conservative approach, familiar in security contexts, is to
       have all new characters prohibited by default. The domain-name
       registries will then allow selected characters that have been
       carefully reviewed for problems.

   (3) Should two strings be treated identically as domain names if
       they are visually different but semantically similar? For
       example, should uppercase Pi.cr.yp.to, with a Greek Pi, be
       treated the same way as lowercase pi.cr.yp.to?

       The only existing semantic-similarity rule is that uppercase
       ASCII is treated the same way as lowercase ASCII. We're forced to
       continue treating uppercase ASCII the same way as lowercase ASCII
       for interoperability. But should there be more rules?

       There are several reasons to say no. First, new rules would have
       to be added to a huge number of programs that handle domain
       names. Second, new rules would make issue #2 substantially more
       difficult to solve: for example, if Greek lowercase alpha is
       treated the same way as uppercase Alpha, then alpha ol.com will
       conflict with aol.com, because Alpha OL.COM is visually identical
       to AOL.COM. Third, semantic similarity depends on the reader: for
       example, lowercase delta and uppercase Delta are semantically
       similar in Greece, but not in the United States. Fourth, semantic
       similarity is not transitive: for example, I see signs in France
       using capital E as a capitalization of e-accent-egu, and signs
       using capital E as a capitalization of e without an accent, but
       omitting the accent from e-accent-egu would be a misspelling.

       The IDNA proposal imposes a global set of uppercase-lowercase
       conversions. The proponents claim that, if we don't allow
       uppercase, we'll be flooded with complaints from users who tried
       typing domain names in uppercase. They don't respond when it is
       pointed out that users already handle case-sensitive lowercase
       URLs without trouble: blah.html works and BLAH.HTML doesn't.

       (It's funny how the IDNA proponents express such concern about
       the accuracy of users _typing_ domain names, but completely
       ignore the accuracy of users _reading_ domain names. See issue
       #2. Has it occurred to them that the domain names typed are, in
       almost all cases, copies of domain names previously read?)

       Meanwhile, the IDNA proposal ignores other questions of semantic
       similarity. There have already been a huge number of complaints
       about this from Chinese users. The IDNA proponents say that the
       complaints ``don't count.''

       The conservative---and cost-effective---approach is to start
       without any new rules, and have the registries prohibit
       characters that may be affected by new rules. Then new rules can
       be safely added later _if_ that turns out to be a good idea. For
       example, uppercase non-ASCII letters won't be treated the same
       way as lowercase, but they also won't be allowed in
       registrations,

Now, back to the current thread. The ``Unicode and Security'' message
explained one way that visually indistinguishable characters can be
exploited by attackers. This isn't a flaw in Unicode; it is a flaw in
careless protocol designs such as IDNA.

John C Klensin writes:
> This is _really_ old news, old enough that some companies have
> mail gateways set up to trap and reject outgoing mail that uses
> spoofed variations on the company's name.

Here Klensin is admitting that IDNA will let attackers breach security.
These gateways are---unless modified in the last few months by an IDNA
proponent to use visual-similarity tables that the IDNA proponents say
don't exist---unaware of IDNA, and therefore unable to stop the attack
described in the ``Unicode and Security'' message.

> The solution to them, along with the rest of the large catalog of ways
> to spoof email, is signed and encrypted mail.

Speaking as a cryptographer: I find this ``cryptography solves all
security problems'' attitude to be astonishingly naive.

The problem here is not message forgery. The computer has accurately
identified the name of the sender. A cryptographic guarantee of
authenticity would do nothing to stop the attack.

The problem is in the design of the name system itself: how names are
assigned and used. The attacker was, as a matter of policy, allowed to
use a name visually indistinguishable from the target name. The name was
displayed on a computer screen. The display was read by the victim, and
used to authorize access.

One long-term solution is to drop all reliance on global name displays
in favor of local name displays (address books) defined entirely by the
user. In the short term, however, recipients will continue to use global
name displays to recognize known senders.

---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago

Re: [idn] Re: Unicode and Security

Reply via email to