--On Saturday, November 1, 2025 13:58 -0700 Rob Sayre <[email protected]> wrote: > On 11/1/25 1:18 PM, John C Klensin wrote: > > [snip] > >> >> The Cyrillic paypal example was chosen, not because it was a >> realistic name but because it is extremely familiar to many of >> those who might be reading this discussion and/or the final >> document. However, and probably sadly, you have just made my point >> (or three of them): >> > Hi, > > That's a real thing. I think the general term is "homograph attack".
Rob, despite the common usage of that term, going back to the paper that first pointed it out in the "paypal" context [1], I, and many others who have found themselves deeply involved in i18n issues, have tended to avoid the term, especially in discussions that might be heard or read by people who are not thoroughly familiar with the issues and various subtle distinctions that have arisen around it. No time to go into the details now and it would probably accomplish little or nothing. Maybe the two of us could work out a tutorial the next time we both show up at an IETF meeting in person :-(. > But it is a known and old problem: > > https://www.mozilla.org/en-US/security/advisories/mfsa2013-61/ > > (2013). Lots older than that. See above and the reference below. > The general recipe is usually not to allow mixed character ranges. What we have learned over the years -- actually the first time people started pushing on the examples in the paper cited in [1] if not earlier is the prohibiting mixed character ranges, or characters from more than one script, in a string really does not help much. In particular, there are enough similarities in character forms in the collection of scripts derived from ancient Greek (contemporary Greek, Latin, and Cyrillic, often referred to in the relevant contexts as "GLC") that it is not hard to construct complete strings from a single script (range). In particular, while many of the early "paypal" discussions just involved Cyrillic substitutions for "p" and/or "a", the Cyrillic examples I used in the response to Martin from which you quoted did not use characters from more than one script (or, unless it is defined very narrowly, "range". Specifically, as I spelled out in an earlier message about 7997bis-05, that was "раураӏ", i.e., \u0440\u0430\u0443\u0440\u0430\u04CF. All but the last of those characters appear, IIR, in contemporary Russian. The Palochka is part of what Unicode calls "Extended Cyrillic", but it is, according to a couple of sources I just checked, including the Unicode standard, used in many Caucasian languages. And it is part of the main Cyrillic block (range?) in Unicode (U+0400 through U+04FF). I've never written or spoken any of those languages, so am not sure, but assume it could be a plausible, single-range, string in any of them, even if not in Russian. > So, is there something useful we can say here? Since you asked, I'm afraid so. For the document, see my earlier notes, some of Martin's comments, and a response to Jean I hope to get finished and posted soon. More generally, you've just provided an example for the general point I've been trying to get across: when one starts talking about "any Unicode character", or even "any displayable Unicode character", it is fairly easy to do a bit of research on one subtopic, dig down a layer or two into the issues, and come up with a rule or two (e.g., in this case, "watch out for homographs" and "avoid mixed-script strings"). Those rules are actually be quite useful but, at the same time, miss important cases. In the i18n area, if there is any substitute for skilled and knowledgeable human judgment, ideally from a team whose members have different experience and perspectives and who listen to each other, I don't think we have found it yet. And the many analogues to the example you provided above are the reason I'm reluctant to say things that amount to "just trust the RPC to get it right": Not a criticism of them in any way, at least until the unlikely day that they claim omniscience, just recognition of the complexity and diversity of the issues. best, john [1] The 2002 Gabrilovich and Gontmakher, CACM paper "The Homograph Attack". See [CACM-Homograph] in RFC 8324 for details and to show this is not new to the community and discussion. Aspects of the problem were discussed in the IETF much earlier, certainly during IDNA 2008 development and, IIR, even before that paper appeared. -- rswg mailing list -- [email protected] To unsubscribe send an email to [email protected]
