Paul,
First, just to establish some context around what sounds to me like
irritation on your part, please note that my first note in this
thread contained some very specific suggestions. I avoided
suggesting specific text because I recognize that our writing styles
(and even how we organize documents) are very different and was
concerned that such specific suggestions might result in arguments
(more with others than with you) about style and phrasing rather than
about principles and substance. Text, and careful explanation of the
proposed changes, are below, as requested, but first a general
observation and related explanations because I believe that, if we
are in disagreement about fundamental principles, trying to resolve
those disagreements by arguing about phrasing is not a good use of
time. YMMD, of course.
My apologies for the delay in getting this posted. It had to
complete with other work I owed the IETF for priority and required
considerable time and checking of sources. I didn't expect you to do
that work, I was just hoping that the WG could agree that the general
principles for which I was arguing could be accepted (or not) before
the work was done.
Whatever text I might have proposed more than week ago, comments
since my initial response have made two things clear to me. One is
that there is a preference in the WG (at least among those who are
speaking up) for saying and explaining less, rather than more. The
other is that there seems to be a general sense in the WG to leave
details on many issues to the discretion of the RPC (a position with
which I strongly agree). However, for some participants, that
extends into leaving figuring out what the details are to which RPC
discretion should be applied to their discretion as well -- a sort of
distinction between resolving problems and defining and framing those
problems. I think there are probably many topic/issue areas in which
ignoring that distinction are entirely reasonable but I suggest there
there is a boundary, one that probably connects to the Socratic
principle about knowing what one does not know, to which we should be
sensitive (and to which the current draft does not appear to be
sensitive).
The problem with i18n issues in particular is that, at least for
unfamiliar scripts and some familiar ones, the path may appear
reasonably wide but is narrower than it appears with a selection of
dragons on both sides. That is one reason why documents such as RFC
8264-8266 and related work are as complicated as they are even while
they do not cover all of the issues. Many collections of
use-sensitive, language-sensitive, and script-sensitive Unicode and
W3C specifications and sets of suggestions exist and provide other
illustrations of the complexities and potential issues involved as
well as remedies for specific issues. While I'm confident that you
are familiar with those documents and the issues that motivated them,
I don't believe that is generally true in the community or that we've
made that level of knowledge a requirement for RPC staff. I believe
it would not be wise to do so (or even to try). However, without it,
knowing what they can and cannot handle may, itself, require expert
advice. That is distinct from the expert advice needed should the
RPC conclude that they cannot efficiently and accurately handle
particular cases internally.
Those issues should have particular emphasis since the RSWG/RSAB have
claimed to be concerned about accessibility to RFCs (as, FWIW, I
believe we should be). However, accessibility extends into the need
to support, or least not impede, text to speech efforts. Those
efforts depend on mechanisms that are language-sensitive and having
things written correctly in situations in which simple phonetic
reading of text does not work or does not work unless the characters
are presented correctly. And, of course, it interacts with "can
interpret that text" (more on that below).
What I am proposing doing is to identify and clarify some issues a
bit more in the document and explicitly ask the RPC to do something
that will help them -- and help authors of documents that are
expected to evolve into RFCs -- identify the languages and scripts
that they are confident they can handle well and easily, implicitly
warning authors that they had best get advice when needed as well as
providing some internal guidance about when advice might be needed.
Because it is in everyone's interest to have documents be as complete
as possible before they reach the RPC, it seems to me that we want
authors (and reviewers) to have that information long before stream
approval and the beginning of in-depth RPC. And I think it is
useful to explicitly ask for / require that, not just allow the RPC
to decide that it should be done (or not) and what priority it should
have.
Final general comment: there are least two places, possibly more, in
the text where "Unicode characters" is used in ways that can be taken
as "non-ASCII characters" and not "any characters in the Unicode
repertoire". Even if that interpretation is not intended, the
phrasing should be avoided. I've identified specific examples below
and proposed fixes, but might have missed some (and the advice may be
of use for other documents and even the evolving Style Guide). In
addition to the ASCII/ non-ASCII issues, there is potential confusion
between "Unicode code point" and "Unicode character" (e.g.,
\u006F\u0336 is a "Unicode character" by most definitions, but two
code points). The definition in Section 1.1 ("...characters define
in [UnicodeCurrent]") is not helpful in that regard, at least without
a very specific pointer to a definition, not a wave of the hand in
the direction of the current version of The Unicode Standard (now
13Mb of text in PDF form). I recommend avoiding "Unicode character",
at least without a specific definition in a nearby sentence, entirely.
Specific text identification and suggestions follow.
(1) Section 2, second paragraph: "support appropriate Unicode string
matching behaviors" is fairly serious handwaving, since the question
of what constitutes such behaviors is inherently very controversial.
It is also not clear what is meant. If the intent is to say that
people who are searching for, or within, RFCs should make sure their
search tools are up to the task, say that (although I'm not sure if
that fits in this document). If it is an instruction to those who
might be developing new RFC-specific tools (e.g., in conjunction with
the web page redesign), say that and provide some references. I
might start with
https://www.w3.org/TR/charmod-norm/ and
https://icu4c-demos.unicode.org/icu-bin/scompare
but you might have better (or additional) ideas.
Or, if we don't know what that phrase actually means, we should
consider stopping the handwaving and revise the sentence to say
"should return accurate and predictable results" and stop there.
Still handwaving, but at least more obvious about it.
(2) Section 3, first paragraph (and its repetition in the Abstract):
"as long as the reader of an RFC can interpret that text" is
aspirational and not a basis on which the RPC (or draft authors) can
make decisions without the ability to read the minds and know the
skills of, and tools available to, future readers. In conjunction
with other suggestions below, that phrase might be adjusted to "as
long as there is a high expectation that readers of an RFC will be
able to interpret its text as intended". That still leaves open the
question of who would have those expectations and on what basis, but
at least makes it clear that this is somewhat subjective and
predictive.
(3) Section 3: The Unicode issues with display of a single character
are very different, and generally easier, from those that involve
strings (more precisely, multiple code point sequences). Some of the
existing text appears to be confused (or confusing) about that. For
example, the second sentence of the second paragraph says
(parenthetical material added to explain the concern):
'If an RFC includes such characters in normative or
descriptive text (("characters", plural, and "text", hence
maybe implying a string)), the RFC needs to also clearly
describe the character (("character", singular, i.e., an
isolated one)).'
is just not clear what is being talked about. I think the document
really needs a separate subsection to talk about isolated characters,
probably a new 3.2, pushing "Examples" down to 3.3 (see below).
Suggestions (you may be able to do better and I remain concerned
about differences in style, but, at least as a starting point):
(3a) Add to first paragraph of Section 3:
'Isolated characters, such as the commonly used WHITE SMILING
FACE "☺" (U+263A), raise different issues than text strings
(or strings of code points more generally). They are
discussed separately below for the cases in which the
distinction may be important.'
Variations of a phrase similar to the above and referring to a
specific character could easily appear in a draft RFC as either part
of an example or as a declarative (possibly normative) statement. It
raises a few stylistic issues (e.g., should the symbol, which is not
a member of the emoji category, be quoted or not -- while I've done
so here, the examples of currency symbols and colors in Section 3.3
are not quoted) and I have no idea if advice on those issues should
be in this document or in the Style Guide. But it should be in at
least one or the other.
(3b) In the third paragraph, change "identifying Unicode characters"
to "identifying non-ASCII characters"
(3c) Add a new 3.2 (as follows or something similar) above the
current subsection with that number and renumbering.
"3.2 Individual characters
"Many examples, and some normative statements, will be about
particular characters (or Unicode code points) rather than,
e.g., using them in a name or as part of a discussions of
something else. In this document, the color display sentence
at the end of Section 3.3 and the WHITE SMILING FACE example
early in this Section illustrate that point. Unless it is
obvious, those characters should be identified by name or
numerical code point identifier(s) or, if there is doubt as
to what will be most clear, both. When characters are part
of strings, their exact interpretations may be clear from
context or more easily deduced, but, when they are called out
as single characters, precise identification is important.
It is perhaps even more important when a single character is
composed of multiple Unicode code points (even, or
especially, if NFC is used as suggested above)."
I think the introduction to Section 3 should also make that
distinction, maybe just with a new statement like:
"With the exception of code points that cannot be displayed
at all and isolated combining forms, the optimal conventions
for mention of individual characters may differ somewhat from
those for strings of characters. This is discussed in more
detail below."
I'd be inclined to put it after the first sentence of the fourth
paragraph ("Note that this policy only applies...") but there might
be better places.
(4) Other Section 3 issues:
As I mentioned in my earlier note, NFC is not sufficient to keep
people out of trouble and the issue is not use of unnormalized text
in an RFC. I also have doubts about the general advice that names do
not require character identification.
Suggested changes:
(4a) At least add a paragraph or two to the end of the introductory
material
in Section 3 (above 3.1) reading something like:
"In addition to normalization issues, characters and strings
with directional properties other than those for Latin Script
(e.g., right to left (RtoL)) can pose special challenges when
embedded in conventional left to right text (including ASCII
text). Careful attention should be given to them. Authors
are expected to give the RPC specific advice on how they
should be handled including whether direction-specifying code
points should be used with such strings.
"Authors who intend to use emoji sequences in RFCs should be
aware that NFC is, in general, not useful for establishing
canonical forms for those sequences."
Remembering that we had a rather controversial snit about a then
newly added Arabic character some years ago, a reference to at least
some code points that are problematic for NFC and their use would
probably be a good idea, as would references to further discussion of
the emoji combining sequence issues. If you want such references and
don't have them handy, I'll try to dig something out. And, for
whatever it is worth, if either RPC staff or other readers of this
note have no idea what I'm talking about, that reinforces my point
about making this document somewhat more explicit about them.
(4b) Section 3.1 does not seem quite right. As I first read it, it
would allow an author who normally writes their name in some script
with very non-European origins (i.e., not a Greek-Latin-Cyrillic
based script) to simply use that name without either an ASCII
interpretation or a code point list. When I read it three or four
more times, I don't think that was the intent. Suggestion:
Replace second sentence by:
"These authors can either give their names using only ASCII
characters, or may use a string consisting of non-ASCII
characters followed by an ASCII interpretation of their name.
In the second case, if they believe it is important for
clarity, they should also provide numerical code point
identification for the non-ASCII characters in their names."
(4c) I think the text should not so easily dismiss code point
identifications of names and assume that an ASCII interpretation is
always a good substitute. To use a time-worn example, "раураӏ"
is certainly a name, but I believe one would want to see either a
careful textual explanation (in English, something the document does
not suggest) and/or "\u0440\u0430\u0443\u0440\u0430\u04cf" along with
it, lest the name in running text be construed as the all-ASCII
"paypal". A code point list for the latter might not be a terrible
idea either. At a very minimum, add a paragraph to the end of 3.1
reading something like:
"Despite the above, a name, other than a personal name, that
involves non-ASCII characters should always be accompanied by
either a list of the code points involved (in some form
specified in [BCP137]) or a careful explanation if there is
any plausible possibility of its being construed as an
all-ASCII string. Where there is doubt, the document should
err on the side of including a code point list."
(4d) Section 3.2, first paragraph, replace "monetary symbol" (twice)
by "currency symbol". The latter terminology is generally preferred
in discussions of typography including, IIR, in The Unicode Standard.
(5) Add a new Section 4 above the current one and renumber the
following sections. The new material is somewhat more explanatory
then I would prefer, but I don't believe there is another IETF
document that spells the issues out and, at least IMO, the very fact
that we are having this discussion suggests that is not general
understanding in the community of the implications of the issues
involved.
"4. Scripts and Languages
"The diversity of the world's languages and writing systems
includes, not just differences in pronunciation and the
shapes used to represent characters but, in Unicode and other
digital encoding systems not developed specifically for those
languages, to, e.g., issues of sequencing of code points and
the relationships among them. To assure accuracy and
comprehensibility of RFCs that use non-English languages or
non-Latin scripts, the RPC will maintain a list of languages
and scripts that they can evaluate and process easily (either
internally or with the aid of experts they have identified)
and share that list with the community. That list is not
intended to discourage use of any language or script that is
appropriate to the draft in which it is embedded, only to
alert authors and those in the various streams evaluating
draft documents that languages and scripts not on the list
may require additional effort and consequent costs and
delays."
I have not made the distinction in that paragraph between single
characters, names, and strings discussed in (3) and (4) above. It
would be relatively easy to do so, perhaps by inserting something
like "as implied by Sections ... above" in some appropriate place,
but I am guessing that the cost of increased length to get more
precision would not be worth it. Up to you and the WG.
(6) Finally, a nit: "Workgroup" at the top of the document should
probably show "RSWG", not "Network Working Group", especially since
this is not a technical document about networking.
thanks,
john
--On Wednesday, October 15, 2025 19:34 +0000 Paul Hoffman
<[email protected]> wrote:
> On Oct 15, 2025, at 12:15, John C Klensin <[email protected]> wrote:
>> explicitly allow (and
>> request) the RPC to establish a list of scripts (and, where
>> applicable, languages) that they are ready to handle expeditiously.
>
> The current draft currently allows them to do that. If you believe
> it does not, please point to the sentences that you think does not
> allow that so that the draft can be fixed.
>
>> It just seems to me that having the RPC identify what they can
>> handle well and easily (in some way I'm happy to have them work
>> out) so that document writers will know that other things may
>> require some additional time, expertise, and possible rewriting,
>> or that consulting the RPC while the writing work is still in
>> progress would be a good idea, would benefit everyone.
>
> We disagree here. There are *plenty* of non-i18n things that the
> RPC has a problem with handling well and easily. When a draft with
> those things comes in, the RPC learns how to handle them. i18n
> things are no different, except maybe to i18n experts.
>
>> And it would probably save
>> the RPC work rather than putting more burden on them.
>
> If that's actually the case (I'm skeptical), they can already do
> that within the wording of the draft.
>
> Again: if you really want this and you want support for your
> desire, please suggest specific text changes so the WG can evaluate
> them.
>
> --Paul Hoffman
--
rswg mailing list -- [email protected]
To unsubscribe send an email to [email protected]