RE: [idn] homograph attacks

John C Klensin Wed, 16 Feb 2005 09:23:07 -0800

Responding to several messages together, rather than sending a
series of fragmented messages...

Folks, I'm pleased to see so much interest on this list, but
note that the issue has already been extensively discussed on
the IETF list and elsewhere.  I'm copying Tina Dam, who has the
IDN lead for ICANN, on this message; if there is further
traffic, you might want to include her as well (or, for the
reasons discussed below, she might want to move part of the
conversation to an appropriate ICANN list).

--On Wednesday, 16 February, 2005 15:09 +0100 "JFC (Jefsey)
Morfin" <[EMAIL PROTECTED]> wrote:

>...
> Thank you for your response. Let assume this list is the
> missing International coordination list.

While I think that an "International coordination list" would be
helpful, my assumption is that ICANN's idn-discuss list (see
http://www.icann.org/topics/idn.html for subscription
information and some other information that might be useful) is
intended to serve that purpose.  It is pretty clearly not this,
or any other, IETF list: the IETF made a series of fairly
explicit decisions to not get into the business of deciding what
things could or could not be registered within the very broad
coding mechanisms specified as part of IDNA.

>  As you know we are
> in a gray situation regarding the IANA and language tags.
> There is the RFC 3066 which accepts ISO 639 as a reference and
> which permits the language "specialists" (ISO and W3C) of the
> http://www.alvestrand.no/mailman/listinfo/ietf-languages
> mailing list to discuss missing languages. This is clean.
> 
> But then we have a confusion.
> 
> 1. When you register a IANA tag, you are to register it by
> language, script and ccTLD. This is clear.

Actually, that is not the ICANN/IANA requirement, nor is it
clear where their requirement applies (e.g., such registrations
are optional).  Their requirement is unclear and, IMO, needs
updating.  I hope that they will get to that updating process
RSN.  On the other hand, I have had that hope for well over a
year.  Progress is unlikely to be made on those subjects by
further discussion here.

>...
> 1. internationlisation: this is the IDN and the ccTLD (quoted
> in the language/script/ccTLD sequence is the authority). So
>...

Jefsey, as you know, we do not agree on your analysis of
language and cultural issues and how they relate to the
Internet's protocol suites or the DNS and agree even less with
your model of how the DNS works and its role in the world.  We
also disagree on what appears to be your desire --exhibited by
citing OPES as an Internet norm-- to move all sorts of issues
from the edges of the network to the center.  For me, one of the
major strengths of the Internet, and a key reason it has been
deployed worldwide and achieved as much penetration as it has,
is that it strongly resemblances an "edge"-based network and has
avoided the many traps associated with a design in which
significant functionality, especially significant applications
functionality, is located in between peers or clients and
servers   I won't reprise the rest of  my side of that
disagreement here.

In a separate note, at Tuesday, February 15, 2005 7:44 PM, you
wrote, in part...

>> Should it not be supported on the IANA server and common to
>> all the gTLDs?

I agree with Pat that commonality --among gTLDs and even more
generally -- would be a good thing, especially for the would-be
registrant who wishes to register the same name in multiple
domains.  But ICANN has so far chosen to not to try to impose
it, and I hope they don't.  We have so far discovered at least
two things that may argue for caution in this area:

        (i) As I trust everyone on this list knows, phishing is
        only part of a far more general set of issues that can
        cause end-user confusion (whether through accident or
        malice).  For any given domain, there is a tradeoff
        between maximum safety (which might require permitting
        only a small number of characters and imposing
        significant restrictions on how they are used) and
        maximum registration flexibility (which might argue for
        much more flexible rules).  In my personal opinion, at
        least, it is far better for the Internet to let domains
        compete on how protective or flexible they want to be,
        assuming the advantages and risks of whatever solution
        point they pick, than to try to impose some Procrustian
        solution.

        (ii) An interesting distinction has been identified
        between the needs of a domain that must serve the
        requirements of a particular country and a domain that
        supports the language commonly associated with that
        country.  For the first case, of which .DE is the
        best-worked-out example, there is a legitimate
        requirement for registration of common names, company
        names, street names, etc., in Germany.  Given history,
        that list will include strings and characters that don't
        exist in the German language.  It may include strings
        the contain combinations of characters that do no appear
        together in any contemporary language that uses
        Roman-based characters.   By contrast, if a gTLD creates
        a language table defined around the German language,
        many of the characters needed by .DE are simply invalid.
        That contrast, which Martin has identified in the form
        of the difference between the "German" tables used in
        the TLDs for Austria and Switzerland relative to those
        used in Germany) may require taking a different look at
        the rules and guidelines (and table registration models)
        than we have heretofore taken: either for rather
        different guidelines for ccTLDs than for gTLDs, or for
        rethinking the registration model, or both.

        The issue that Pat identifies with Tajik is another
        piece of the same puzzle: many of us may believe that
        there is no possible reason to mix the three scripts in
        which that language can be written in a single label,
        and I certainly trust Roozbeh's knowledge and experience
        in that area. Certainly, it would make things safer to
        prohibit any mixing (note that IDNA's BIDI restrictions
        essentially prohibit mixing an Arabic-derived script
        with anything other than itself, another Arabic-derived
        script, or Hebrew).   However, we have a long history of
        DNS labels that could not possibly be words in any
        language.  Whether or not to permit mixed-script labels
        is presumably an issue that the registry for .TJ will
        need to sort out (I have been told, for example, that
        mixed Cyrillic and Latin-character labels are likely to
        be a requirement in Serbia and Montenegro, although this
        illustration might give them pause).  And the best
        answer for them might or might not be the best answer
        for a gTLD.

In addition, as Hotta-san's very helpful note points out, one
could considerably reduce the scope of the identified
confusion/phishing problems by aggressively applying a variant
model across scripts, restricting the registration of homographs
to the same registrant.  I personally suspect that will not
prove practical, from a policy standpoint, in the collection of
alphabetic scripts that share Old Semitic origins, but that is,
IMO, just another argument for giving different registries the
flexibility to develop their own policies and take
responsibility for the consequences of those policies.

Again, these issues need to be worked out in an ICANN forum; the
IETF has thrown the problem over the wall and shows no signs of
wanting it back.

--On Wednesday, 16 February, 2005 08:07 +0100 "\"Martin v.
L�wis\"" <[EMAIL PROTECTED]> wrote, responding to Soobok Lee:

>> All Cyrillic  label  "HP" (.com)  can be registered even in
>> Russian  language pack.
>> 
>> Cyrillic "HP".COM  in its uppercase form  looks the same as
>> all ASCII   "HP.COM".
>> 
>> Any Registration Process should filter out these "HP" like
>> combinations..

But the only way to do that would require that a domain that
permits Cyrillic characters must ban ASCII characters and vice
versa.   I would predict that will just never happen, if only
because every domain that exists today has a long of all-ASCII
labels in it.  It is not a very good example (see below), but I
note that this particular example is only of the reasons why
"identify a mixed-script label in the application" may be a
useful tool, but is not a solution -- this is not a mixed-script
label.

> I think this is unreasonable. The lower-case forms ("??" vs
> "hp")
> look quite differently, and browsers typically display domain
> names in all-lower in the address bar.

Regardless of what browsers do (there is a case to be made that,
for traditional labels, if they get an all-upper-case label back
from a DNS query and display it in lower case, they are
violating the intent of the spec), note that, for IDNs, mapping
through IDNA and back (ToUnicode followed by ToASCII) will
always result in lower case and application of several other
mappings.

>...
> just because they are homograph with a latin combination. Then,
> the same would apply to Greek vs. Cyrillic.

Yes, banning a mixture of Latin characters with those of any
other script won't work either because one can get homographs
among other pairs of scripts.  However, it might be realistic to
ban the combination of Greek and Cyrillic in the same zone,
while it would not be practical to ban the presence of
Roman-based characters with either.  Mostly, again, this points
out the importance of zone-specific policies that are
well-tailored to the needs of that particular zone.

To repeat what I have said on the IETF list and elsewhere,
nothing is going to make these issues 100% foolproof or easy.  A
number of tools may help.  Certainly carefully-designed
restrictions on what can be registered in a particular zone and
what characters can be used together in the same label will
help.  Intelligent and well-thought-out use of variant models
may help a good deal.  If labels that can be identified as
having no use other than to confuse or defraud can be rejected
at registration time, that would eliminate a lot of problems
down the line -- that may be practical in some cases but not in
others.  I'm a bit skeptical about identification of
mixed-script labels in applications, not because I think it
wouldn't be useful, but because carrying those tables around and
keeping them up to date could be a bar to implementation and
performance, but there may be ways to make it practical.  I
think users --both those whose preferred scripts are written in
Roman-based characters and those whose preferred scripts are
not--  are going to need to become educated about some of these
issues, not just to protect themselves but to understand when
IDNs or IRIs can be exchanged with others with high odds that
they will be usable and unambiguous.   I think there is a lot of
potential in distinctions like the Firefox one between "copy
link location" / "paste link" and "copy"/"paste" and that we may
discover that the former pair should convert to punycode and
URIs and back to UTF-8 (or whatever) and IRIs to prevent
inter-application and inter-system cut and paste problems.

And I hope we can all figure out a way to work together to make
this work.  It is important, "don't use IDNs" isn't an answer
now and never has been, and the alternatives are just a choice
among ways to fragment the Internet.

     john

RE: [idn] homograph attacks

Reply via email to