--On Saturday, November 1, 2025 13:58 -0700 Rob Sayre
<[email protected]> wrote:
 
> On 11/1/25 1:18 PM, John C Klensin wrote:
> 
> [snip]
> 
>> 
>> The Cyrillic paypal example was chosen, not because it was a
>> realistic name but because it is extremely familiar to many of
>> those who might be reading this discussion and/or the final
>> document. However, and probably sadly, you have just made my point
>> (or three of them):
>> 
> Hi,
> 
> That's a real thing. I think the general term is "homograph attack".

Rob, despite the common usage of that term, going back to the paper
that first pointed it out in the "paypal" context [1], I, and many
others who have found themselves deeply involved in i18n issues, have
tended to avoid the term, especially in discussions that might be
heard or read by people who are not thoroughly familiar with the
issues and various subtle distinctions that have arisen around it.
No time to go into the details now and it would probably accomplish
little or nothing.  Maybe the two of us could work out a tutorial the
next time we both show up at an IETF meeting in person :-(.

> But it is a known and old problem:
> 
> https://www.mozilla.org/en-US/security/advisories/mfsa2013-61/
> 
> (2013).

Lots older than that.  See above and the reference below.
 
> The general recipe is usually not to allow mixed character ranges.

What we have learned over the years -- actually the first time people
started pushing on the examples in the paper cited in [1] if not
earlier is the prohibiting mixed character ranges, or characters from
more than one script, in a string really does not help much.  In
particular, there are enough similarities in character forms in the
collection of scripts derived from ancient Greek (contemporary Greek,
Latin, and Cyrillic, often referred to in the relevant contexts as
"GLC") that it is not hard to construct complete strings from a
single script (range).

In particular, while many of the early "paypal" discussions just
involved Cyrillic substitutions for "p" and/or "a", the Cyrillic
examples I used in the response to Martin from which you quoted did
not use characters from more than one script (or, unless it is
defined very narrowly, "range".  Specifically, as I spelled out in an
earlier message about 7997bis-05, that was "раураӏ", i.e.,
\u0440\u0430\u0443\u0440\u0430\u04CF.  All but the last of those
characters appear, IIR, in contemporary Russian.  The Palochka is
part of what Unicode calls "Extended Cyrillic", but it is, according
to a couple of sources I just checked, including the Unicode
standard, used in many Caucasian languages.  And it is part of the
main Cyrillic block (range?) in Unicode (U+0400 through U+04FF).
I've never written or spoken any of those languages, so am not sure,
but assume it could be a plausible, single-range, string in any of
them, even if not in Russian.

> So, is there something useful we can say here?

Since you asked, I'm afraid so.  For the document, see my earlier
notes, some of Martin's comments, and a response to Jean I hope to
get finished and posted soon.  More generally, you've just provided
an example for the general point I've been trying to get across: when
one starts talking about "any Unicode character", or even "any
displayable Unicode character", it is fairly easy to do a bit of
research on one subtopic, dig down a layer or two into the issues,
and come up with a rule or two (e.g., in this case, "watch out for
homographs" and "avoid mixed-script strings").  Those rules are
actually be quite useful but, at the same time, miss important cases.
In the i18n area, if there is any substitute for skilled and
knowledgeable human judgment, ideally from a team whose members have
different experience and perspectives and who listen to each other, I
don't think we have found it yet.    And the many analogues to the
example you provided above are the reason I'm reluctant to say things
that amount to "just trust the RPC to get it right":  Not a criticism
of them in any way, at least until the unlikely day that they claim
omniscience, just recognition of the complexity and diversity of the
issues. 

best,
 john

[1] The 2002 Gabrilovich and Gontmakher, CACM paper "The Homograph
Attack".  See [CACM-Homograph] in RFC 8324 for details and to show
this is not new to the community and discussion.  Aspects of the
problem were discussed in the IETF much earlier, certainly during
IDNA 2008 development and, IIR, even before that paper appeared.

-- 
rswg mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to