--On Monday, November 3, 2025 16:03 +0900 "Martin J. Dürst"
<[email protected]> wrote:

> Hello John, others,
> 
> [replying to your earlier mail, but having read the follow up
> answer to Rob about 5h later]

Martin, 

I'm going to give a response that is much more detailed than you need
because much of my goal in these notes (despite a stupid error that
you caught, see below) is in the hope of educating and explaining
some of the subtle problems in the i18n space to those who don't
know.  That part of this message and others is not directed at you:
I'd be astonished if there were any part of it you don't know already.

> The first part of this mail is about homographs. The term has been
> used since 2002. If you know a better term, please feel free to
> propose it. I just feel it's better to have a word when talking
> about something. I don't see any danger of confusion on this
> mailing list.

Ok.  Maybe it is useful for me to explain at least part of why I
avoided it.  As you know, there is a great deal of specialized
terminology in the i18n space and the Unicode space in particular.
Drawing a couple of examples from this week, if you say "bidi" or
"homograph", Pete and I (as well as many others) immediately know
what you are talking about and know to some precision.  However, I
don't believe that everyone reading this list (or, for the future,
everyone reading whatever documents will emerge from this I-D or
relevant Style Guide sections) will necessarily have the save
background and consequent vocabulary.  So I chose to try to use
terminology in my attempted explanations that would be generally
understandable to everyone, even if it is somewhat less precise.  If
we need a term in the final document or the Style Guide, I'd advocate
using the precise i18n terminology, "homograph" included, but being
careful to supply a definition or reference to one for non-expert
readers.  If I were writing a document specifically addressed to an
i18n audience, I would just use the term and not bother with the
definition.

Maybe I should have introduced the term and defined it earlier in the
discussion as Rob partially did in a recent note.   Matter of
judgment.  I didn't expect this set of threads to go on nearly this
long or into this much depth.  That prediction was obviously a bad
guess on my part.

Does that make any sense?
 
> I wasn't worried about homographs, but it took me some time to find
> the main reason why. The main reason is that for names in RFCs,
> they don't constitute an attack surface.

But they might because it is not hard to imagine misquotations and
misattributions whose authors/perpetrators trying to avoid
responsibility by using look-like names (names containing
homographs).  Since others, and I think you, have comments on the
importance of copy and paste as a tool, what is copied should also be
what was intended, rather than some look-alike from mistranscripton
(maybe especially important if we are going down the recording path
or speech-to-text for other reasons.  We also remember that names
--personal, company, etc. -- do not appear in RFCs only in "author"
sections, examples, and discussion text.  They also appears in
references.  If those references are to documents published in
languages very different from English, written by authors who used
non-Latin-script names, many of the same issues we have been
discussing arise whether we decide to deal with them or not.

And that, and your next paragraphs, are another reason I didn't use
the "homograph" terminology in my earlier notes.  The term is almost
used, as you essentially illustrate below, in the context of attacks.
I might be more worried about such attacks in this context, but not
much if at all.  I am, however, concerned about look-alike issues
that might be the result of accidental, careless, or uninformed
errors.  Those are not "attacks" in any usual sense, but they can
still be problematic (or at least confusing for readers and problems
for copy-and-paste operations.
 
> [I'm not a security expert, so maybe 'attack surface' is the wrong
> word; if you have a better one, please tell me.]
> 
> The reason is simple: If we have both paypal.com and
> раураӏ.com (the first all Latin, the second all Cyrillic),
> there is a clear danger of spoofing. But if we have an author named
> 'P. Paypal' and another author named P. Раураӏ (R. Raura')
> (*) (again the first all Latin, the second all Cyrillic), that's
> not worse than having two authors named John Smith.
> 
> [(*) I used "Raupai" as a transcription in an earlier mail, which
> was wrong because the second Cyrillic 'р' also corresponds to a
> Latin 'r' and because the last letter (which I see as an upper case
> 'I', but is actually lower case and could look more like a 'l' or
> '1') doesn't stand for an 'i'.]
> 
> Having two authors named John Smith isn't an ideal situation, but
> I'm sure the authors and the RPC would be able to handle this, if
> it ever comes up.

In case it isn't clear from my comments above, I completely agree
with all of that.

> What's more important to understand here is that there is no
> incentive for anybody to falsely pretend to be a second John Smith
> just to get a leg up on the first John Smith or somebody else. Why
> would somebody want to write an RFC as John Smith when their real
> name is Bill Miller?
> 
> Somebody may want to create a website offering consulting,
> pretending to be John Smith and having written that RFC, but that's
> already possible today.

Indeed.  And that is closer to my own concern about an "attack".  As
you say, the "John Smith" problem definitely exists, as, in the DNS
context and many discussions when ICANN was being put together, the
"Joe's Pizza" problem.   There is, however, a question of whether we
want to risk making that problem worse by introducing more of a
homograph problem that already exists in Basic Latin script.  When,
for example, IDNs were being designed, there were quite explicit
decisions that the answer should be "no" despite understanding of
those existing risks.  And very little of that has to do with the RFC
problem, as you point out: again, I'm much more concerned about
accidents and misunderstandings than about malicious attacks.

> According to my limited knowledge of DNS,
> e.g. klensin.com is still up for grabs. smith.com isn't, but of
> course not because of an RFC author named Smith.
 
> So while homograph attacks are a thing to be careful about in the
> DNS, and therefore in the URI handling code of browsers, they are
> not something to worry about in RFC names.

See above.


> [more below]
> 
> On 2025-11-02 05:18, John C Klensin wrote:
> 
>> --On Friday, October 31, 2025 17:26 +0900 "Martin J. Dürst"
>> <[email protected]> wrote:
> 
>>>> 3. Whether the policy is aimed "for the reader" or "for the
>>>> author":  Consensus seems to me that the doc should say something
>>>> about authors.  Some explicit support from Brian's suggestion in
>>>> <https://
>>>> mailarchive.ietf.org/arch/msg/rswg/zF2-lBMYYDPj-igQivMo3O5ivWo>.
>>>> Might  also want something saying, "The RPC style guide will
>>>> define which  characters authors may use and how."
>>> 
>>> As long as the style guide is an RFC, or something with similar
>>> change rate, I think this is way too inflexible. We already have
>>> successful use of non-ASCII characters at least in the Latin
>>> script that where used without any explicit guidance.
>> 
>> I think the plan is to make the Style Guide more of a web page or
>> set of them, rather than publishing it as an RFC.  I hope the RPC
>> (or Style Guide approval mechanism if something else) will
>> recognize that too-frequent changes can cause general confusion
>> and harm to authors but, otherwise, I agree.  More about this
>> below.
> 
> I think we should try to write this draft/RFC under the assumption
> that things may change, but they may as well stay as is, or may
> stay as is longer than we hoped.

Agreed.  I think.


>> If the document were consistent about that, this would work for me.
>> And, again, I have no problem pushing the whole discussion to the
>> Style Manual as long as (as you indicated) it isn't too static
>> _and_ whatever is said in this document not be misleading or
>> confuse things.   For this case in particular, I'd rather see the
>> whole name/example distinction go to the Style Guide because there
>> are some special nuances there, ones that might evolve as
>> understanding increases.
> 
> I don't mind discussing how to handle nuances. But I'm definitely
> against punting on some basic guidelines just because there might
> be "special nuances".

I don't think we have any disagreement there either.

>>>> 8. JCK's 4(a) - 4(c) on NFC, directionality, naming: No
>>>> discussion so  far, but again, with chair hat off, this sounds
>>>> like style guide  material, not policy.
>>> 
>>> In particular with respect to (4c), I'd argue that there's
>>> *nobody* with a name such as Cyrillic "раураӏ".
>>> 
>>> (Just in case there were, it would be required to also have a
>>> Latin script equivalent (most probably something like "Raupai"),
>>> at which point it would be clear that it's not Latin script, and
>>> any interested user could cut-and-paste it into a tool that would
>>> reveal the exact code points if needed.)
>> 
>> The Cyrillic paypal example was chosen, not because it was a
>> realistic name but because it is extremely familiar to many of
>> those who might be reading this discussion and/or the final
>> document. However, and probably sadly, you have just made my point
>> (or three of them):
>> 
>> (i)  The document says "names",  While it distinguishes
>>      between names of authors and names of companies and
>>      geographic entities, it does not draw further distinctions.
>>      I.e., it does not clearly distinguish among, e.g., personal
>>      (or family, etc.) names of authors or editors (of documents
>>      and maybe in references), organization names in those
>>      contexts, section titles and document titles in references,
>>      or even names in examples or running text or quotations.
>>      Increasingly broad readings of "names" along those lines
>>      increase the odds of just such a string appearing.   Such
>>      lack of precision abut the category is a problem and, in
>>      particular, "раураӏ" (with or without something like
>>      "Raurai") might plausibly occur in some of them, even if not
>>      the first.  "paypal" (ASCII) is certainly a company name and
>>      "раураӏ" (Cyrillic) might be too, but the document
>>      makes the presence or absence of an ASCII interpretation a
>>      matter of discretion of the author (I trust the RPC there,
>>      but your separate comments suggests that you see the point).
>>      In particular, see the overlap between this and your comment
>>      about company names in your other note, so we might be close
>>      to agreement on the subject after all.
 
> [repeating myself] My understanding is that the document currently
> does not, and should not, make the presence of a Latin equivalent a
> matter of author discretion.

I think we are still discussing "Latin" versus "ASCII".  And the
current draft does make those "equivalents" discretionary for company
and geographic names and does not address names in references at all.
Other threads, I think.

> If you think 'names' isn't clear enough, I'm happy to discuss text
> that makes this clearer.

>> (ii) A construction like "name (something)" does not imply
>>      only "'name' not in Latin script" but could also be, e.g.,
>>      "'string that might be a name' followed by a pronunciation
>>      hint or explanation".  Consider, e.g., "King Charles II (of
>>      France)". Because many people get the pronunciation wrong, I
>>      might even want to write "Klensin" in text followed by a
>>      phonetic alphabet presentation.  So that construction does
>>      not automatically imply that the name is other than Latin
>>      script.
 
> This is just a minor side issue, but I'm looking forward to "King
> Charles II (of France)", or for that matter "King Charles III (of
> the United Kingdom)" to write his first RFC.

Again, I picked an example that would be easily understood (or looked
up if needed) by almost all readers of this thread.  And I avoided
the more complicated, mixed issue, case.  But I was not thinking only
of RFC Authors.   While, if I recall, RFC 7997 strongly discouraged
it, this document appears to allow non-ASCII characters in running
text while provide little guidance about their use.   So, as a
slightly far-fetched example for an RFC but to make the point,
suppose an author, trying to be precise, wrote, "The protocols, as
recorded by حمورابی specified...".  With the understanding
that I'd prefer to use the original script in that example (Unicode
has not, to my knowledge, standardized code points for Akkadian
cuneiform), my knowledge of Farsi, or even the ability to recognize
that string as Farsi, caused me to hope there would be a requirement
for an all-ASCII equivalent.  The document does not address that, or
even say something vaguely like "authors are encouraged to supply
ASCII (or Latin, or ...) equivalents when readers might want those
and the RPC may require such equivalents".

Now, with that as background, consider a similar sentence referring
to דוד המלך.   Unlike Hammurabi, "King David" is ambiguous --
there have been a lot of those, in multiple countries.  So, I'd
expect a parenthetical notes containing "King David" _and_, unless it
were extremely clear from context, an explanation of which one.  Now
that would enable the presence of the parenthetical string to be a
clue that the prior string were a name but the contents of that
string would not be just an equivalent name.   And, if "King David"
were written in ASCII, as I'd expect, the parenthetical, perhaps
"(the hotel in Jerusalem)" would still be there and would be clearing
up an ambiguity, not providing an equivalent name.


>> (iii) "раураӏ" (or, worse, "раүраӏ") is immediately
>>    recognizable as Cyrillic (or not) depending on
>>      the renderer's choice of display type styles or
>>      fonts (something over which we have little control) and, even
>>      then, far more easily by those are sensitive to such things
>>      than those who are not (reader distinctions over which we
>>      have even less control).  Obvious to you and me might not be
>>      obvious to a reader and, depending on the script, might not
>>      even be obvious to the RPC.
> 
> Homograph attacks of course assume that there is no difference, or
> the difference is too small to be recognized.

Right.

>> And, of course, none of that addresses the directionality issues.
> 
> I'm not aware of any serious directionality issues. If a name is
> RTL (e.g. Arabic or Hebrew script), then if it's inline (e.g. in an
> Acknowledgement section), the Unicode bidi algorithm should just
> take care of it. If it's in the header, it also should work, even
> in the ASCII version. If it's alone on a single line, such as in
> the Author's Address section, an LRM (left-to-right mark) may be
> needed at the start in the ASCII version to keep the name
> left-justified. In HTML, that can be solved with CSS.

While I hope and trust that they wouldn't show up in names, the
recent discussion was have been party to about IRIs and isolates
suggest that more subtle directionality issues may exist than the one
you identify above.  See below.

>> All of this could be taken as strong arguments for moving far more
>> of the discussion to the Style Guide but then to be sure this
>> document and the Style Guide do not diverge (or even appear to do
>> so) and probably to include explicit pointers to the Style Guide
>> for details.
>> 
>>>> 9. JCK's 5 on making a list of scripts and languages: No
>>>> discussion.  Silence is not a good basis on which to judge
>>>> consensus.

>>> Already said so above, but I think this makes things too
>>> inflexible. If the RPC really feels it would help them, they can
>>> always start such a list, but there's also a danger this would be
>>> interpreted as exclusionary (somebody claiming somewhere "RFCs can
>>> be written by Chinese and Japanese, but not Koreans" just because
>>> the RPC didn't yet have a case of a Korean author and therefore
>>> didn't yet put Korean/Hangul in the list).
>> 
>> I'm not sure I understand the lack of flexibility you are seeing.
>> I did not propose making that list part of an RFC, nor even of a
>> more easily updated Style Guide, but simply a list, updated
>> whenever the RPC considers that appropriate.  Under any
>> circumstances I can easily imagine, it would be updated only by
>> adding languages and/or scripts, not removing them (unless, I
>> suppose, a language or script about which they thought they were
>> confident turned out to be more problematic than they had assumed,
>> but, while I wouldn't want to prohibit that, I'd expect it to be
>> so rare as to be irrelevant).  You seem to have inferred a  "you
>> can't write text in something that is not on the list" situation.
>> I never intended that.

> I'm not claiming that you intended it. What I wrote is that some
> third party may interpret it that way. Or they may think that it
> would be a major hassle to be the first to use a particular script
> or language.

I think that concern is reasonable.  However, I think it is easily
addressed by either a comment in this document or an introduction to
the list that discourages that interpretation, preferably both.  I'm
happy to draft either or both if this gets traction.  I also think
that the reality about this is suggested by several of your and
Carsten's recent comments.  With the way the IETF (and, to a
considerable degree the other streams) work these days, we pretty
much know the scripts, and even the languages, that are likely to
turn up in I-Ds being handed off to the RFC Editor.  To borrow from
what I think was your example, if the RPC suddenly needed an expert
on Korean, it would take me (and probably others) under five minutes
to turn one up, and turn one up with prior IETF experience.  The
languages and scripts that could turn into major hassles, or even
significant displays, are, from the standpoint of IETF participants
and those likely to be writing I-Ds that could turn into RFCs,
almost certain to be a little obscure... with some delays or major
hassles no matter what we do.  Part of that doesn't involve the RPC
at all: If I come to one of the streams with text, even my name, in
Klingon (to avoid calling out any earthly script), I assume I'm going
to get some curious comments and maybe pushback such that, even if it
is discriminatory to any Klingons who might want to participate in
the IETF, those characters would probably never get to the RPC.
That makes one approach to this close to "any displayable character":
saying explicitly in the text that while that is our aspiration, the
more the languages and scripts used are unfamiliar in the IETF and to
the RPC, the more difficulties and delays may be encountered in the
process.

>From that point of view, there should be an expectation of a hassle
and/or delays for any new-to-the-RPC obscure scripts, with or without
the list I suggested.  The advantage of the list is the avoidance of
surprises and informing authors and streams about what is likely to
go easily and what might take longer and, as part of that, where
early discussions with the RPC might be useful for all concerned.

And, frankly, as far as author names are concerned, I'd like authors
who are tempted to write their names in really obscure scripts to
think about whether doing so would be of value to readers and, if
not, to just go with the all-ASCII (or all common Latin characters)
"equivalent" option and bypass all of this.

>> Instead, think
>> about it as a convenience for authors and reviewers, especially
>> document shepherds.  If a language/script combination, or, where
>> relevant, just a script, are on the list, people in the document
>> development process can have reasonable assurance that the text
>> will be handled smoothly and efficiently.   If it isn't, then that
>> should serve as a recommendation to consult with the RPC earlier
>> in the process than handoff from the stream.  If that
>> recommendation were ignored, the authors and stream should expect
>> the possibility of delays in processing as the RPC checks the text
>> strings and finds advice about them if needed.
> 
> It may not only be about scripts/languages, but also about specific
> (groups of) characters. See also the discussion of Latin script in
> a separate thread.

> There are a lot of scripts/languages where most if not all
> characters are highly unproblematic. There are some scripts where
> most characters are highly unproblematic, but some may be tricky in
> some situations. It can be expected that the authors should be
> familiar with the issues, either because it's about their names or
> because it's in their examples. The examples will be there for a
> purpose. An example to show some specific bidi issue in a protocol
> may need different treatment from a name, even if both are in the
> same language and script.

Yes.

> The RPC certainly is good about asking authors.
> Shepherds/chairs/ADs will also either have the relevant knowledge,
> or will have asked questions, or will have read the answers to
> questions from others. Issues with non-ASCII examples,... should
> have surfaced long before going to the RPC, and it's just a matter
> of telling the RPC about these if there are any. There shouldn't be
> any significantly longer delays than for other issues when
> publishing an RFC (of which we all know there are many).

And, I think I've said this before, but the cases I'm worried about
are the ones in which things do not go as smoothly as you describe
above.  I don't know how recently, but I've seen cases in which there
were i18n-related issues with a document that didn't surface until
what is now called the RPC started digging into it.  I'm also seen a
few near-missed in which authors or WG Chairs reached out for advice,
but might have not done so.  

Maybe the right solution is for this document to explicitly say that
there can be complicated cases involving non-ASCII scripts and that
authors (and shepherds, ADs, etc.) are urged to have early
discussions with the RPC of they see any potential problems or are in
doubt about whether such problems might exists.

Please also see the last (new) paragraph of my note to you and Brian
[1]

    john

[1]
https://mailarchive.ietf.org/arch/msg/rswg/JLBVAcz-do08QgqyuWnSA1DLS4M/

-- 
rswg mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to