--On Thursday, 17 February, 2005 09:58 +0000 "Adam M. Costello" <[EMAIL PROTECTED]> wrote:
>... >> ...assuming we can make the language tag available via some >> dns tricks or some API... > > I don't see that happening. The IDN working group decided > quite deliberately that domain names would not contain any > meta-info like language tags; they're just text strings. Concur. I also concur with the several folks who have pointed out that, at browser time, language information won't do a whole lot of good, at least without access to every per-domain table for the characters that are assumed to be bound to a given browser. > Still, I expect that some not-terribly-complex heuristics, > based only on the bare character strings, could go a long way > toward exposing suspicious domain names. I used to be convinced of this, but have become increasingly skeptical about how far it would take us. The easiest tests in principle are for for homogeneity of characters within a label (the "one label, one script" test, more or less). Those tests are indeed fairly simple if the label contains characters that form a contiguous block in Unicode that conforms, more or less, to one script. That requires a somewhat fuzzy definition of "script", which is probably ok, but also isn't a hugely good test once one gets out of the low-end of the BMP and the scripts taken over, as blocks, from prior standards. After that, the tests get more complex, to the point that one can imagine needing all of the Unicode script tables (assuming they are adequate, which they probably are for this purpose) within the application to make a good test. A test based on those tables wouldn't be terribly complex computationally, but the notion of carrying those tables around in resource-limited devices, or even in a browser whose footprint one was trying to minimize, gets a little dicey. But the important question, I think, is what attacks that would protect us against. Certainly, it would provide protection against a name accidentally registered with an odd mix of characters. But I suggest that is a null set -- if a label is entered into the DNS with a heterogeneous collection of characters, it is because someone decided to do it and the registrar and registry decided to permit them to do it. That isn't an accident, that is a deliberate set of decisions, for whatever reasons. As the bad guys get more sophisticated, there are going to be attacks that will be far harder to detect than either the paypal or yah00 examples that have shown up on this list, many of them probably involving mixtures of scripts neither of which is Roman-based. So, should the browsers (or other UI programs) take whatever precautions and issue whatever warnings are reasonably feasible? Sure. Should we assume that will help very much against a determined and sophisticated attacker? Nope. Can a lot more be done at registration time than at lookup (or user inspection) time? Yes, certainly -- not only is the code base easier to control and the "language" information available, as has been pointed out, but taking some time on _those_ servers to look up script tables, compare them to labels, and even apply variant or other cross-checking rules if the domain considers that appropriate would be perfectly rational. I think we will find ourselves, a few years down the line, in a situation in which users discover that names in some domains, by virtue of tighter registration policies, will be safer to use than others. If that results in competitive pressures on the more relaxed ones, so much the better. But, in the general case, it may be about as much as can be done. john
