--On Tuesday, October 14, 2025 15:29 -0700 Alexis Rossi
<[email protected]> wrote:

> On Tue, Oct 14, 2025 at 2:11 PM John R Levine <[email protected]>
> wrote:
> 
>> On Tue, 14 Oct 2025, Carsten Bormann wrote:
>> >> In cases of outright errors in
>> >> character names such as misspellings, a character may be given
>> >> a formal name alias.
>> > 
>> > Right.  Do we use the original, possibly broken name (which is
>> > at least
>> promised to be stable) or the corrected one?  There are a couple
>> hundred alias names, so this isn't entirely theoretical.
>> > 
>> > (Having to ask that question puts me in the camp of liking
>> > U+NNNN more,
>> but putting the most corrected name at the time of writing *as
>> well* might help readers.
>> > [1] might give us some easy ways to to make that happen, but
>> unfortunately the [2] referenced from [1] does not indicate how
>> the name is chosen.
>> > This is a defect.)
>> 
>> I was hoping we would expect the authors and editors to use a
>> little common sense.  In the usual case that the name is not
>> broken, they can use it.  If the name might be confusing or the
>> number is important to the point they're making, they can use the
>> number or maybe both.  Let's not try to specify this down to the
>> last pixel.

> Strong seconding to John's point about allowing common sense and
> avoiding over-specification in an RSWG doc.


Alexis,

I don't know if I agree with Carsten or not (would take more
thinking), but [just] allowing and relying on common sense seems a
bit naive to me.   The second paragraph of the Abstract actually
highlights part of the problem.  

Let me see if I can take "all displayable text is allowed as long as
the reader of an RFC can interpret that text" apart to illustrate the
problem.

First, in at least many cases, for a reader to interpret text
requires, at least, that they be able to distinguish among characters
in a world filled with look-alike characters.  If the text is a
string, "interpret" normally means ability to read and understand
that string (i.e., have language understanding), not just see the
glyphs. 

Second, there are a huge number of languages and writing systems,
present and past, in the world with different rules for forming
strings (and, for strings that consist of more than one "word",
separating words or whatever gets separated).   With a general rule
like that, and a term as ambiguous as "interpret", there are almost
endless ways in which an author who was trying to be clever and/or
prove a point and/or show their knowledge and skills off to others
could get readers into situations where they had no idea what they
were looking at, whether a string that was extracted from an RFC was
equivalent to a string found elsewhere, and so on. 
<sarcasm> Of course, there has never been anyone with a personality
like that participating in the IETF. </scarcasm>. We used to think
that NFC was sufficient to sort out string ordering and
relationships, but we've discovered a large number of cases where we
were wrong about that level of sufficiency and, IIR, some aspects of
NFC's relative stability and its implications.  I'd still recommend
NFC, but only if people are aware of its limitations, including
limitations for specific scripts and language-script combinations.

In that context, "common sense" works only if the person applying it
has a reasonable understanding of the writing system involved and,
where relevant, the language.   If there are people in the world with
that level of understanding of all of the world's scripts and
languages, past or present, I've seen little evidence of many of them
as active IETF participants, much less as active participants who are
likely to try writing I-Ds and RFCs.

As soon as one moves away from relatively common scripts, the other
problem is that some browsers (and other rendering engines), in some
parts of the world, may be able to correctly render a given string
(available fonts are important there, but not the only issue), others
may not, and the capabilities may vary over time.   Not all rendering
engines are browsers either and there is no guarantee that the IETF
will even have heard of rendering engines that might sensibly be used
with particular scripts.  As just one obvious example, if you were to
allow a script that first became standardized in Unicode 17.0 (i.e.,
five weeks ago), the odds that those scripts would be broadly
supported across all rendering engines and that they would become
available at the same time, are, to put it mildly, pretty low even if
some rendering engines got ahead of formal publication of the
standard.

In addition, the RPC, in practice, needs to be able to work with
whatever comes their way, presumably including being able to render
characters and strings with their tools and to "interpret" them.

And, in case it is not obvious, the right criteria for the use
(inclusion in RFCs) of single, isolated, characters are almost
certainly different for those involving strings.  For starters, NFC
and related choices about string composition and rendering are almost
always going to be irrelevant to isolated characters but important to
strings.

An attempt at a few constructive suggestion (if relevant, please read
to the end before exploding):

(1) Create two lists of scripts and languages.  The first should
consist of very common code point/ script/ language cases, ones with
which a non-trivial number of IETF participants and, to the extent to
which we can guess, likely readers, will be familiar.  The second
should include scripts and probably languages that are in common use
in the world, common enough that the rules and conventions for
reading and writing them are well-understood, stable, and easily
accessible.   Reliance on common sense alone works for the first list
only and then only the common sense of people who can be relied upon
to know what they don't know.  Use of the second should require the
author to provide a rationale (before stream approval) for including
the characters in an RFC at all and working with the RPC on inclusion
of any multiple-codepoint string in the text _and_ inclusion of
numeric code points information (presumably U+nnnn for individuals
characters and probably \unnnn\upppp\uqqqq etc. for longer strings.
Characters and strings that fall into neither list are either not
allowed or require strong rationale for why there are not better
choices.  If accepted, they must always be associated with a numeric
code point list.

(2) While I think this document should contain list definitions and
rules that at least roughly conform to the above, the contents of
those lists (characters, scripts, languages) can and should be left
to the RPC with the understanding that they will evolve over time.
The choices should be up to them, but I'd advise that the initial
version of the first group contain only scripts and languages with
which they are very comfortable working.  Later expansion is likely
and would be reasonable, but how, what, and when should be up to them.

(3) As suggested above, single characters are probably different from
strings, especially strings more than a few code points in length.
If it is necessary to identify particular characters/ code points,
using names is fine, especially if they make what is going on more
clear, but they should generally be accompanied by numeric code point
identifiers (in practice, someone unskilled in Unicode specification
navigation may find looking up the name more challenging than is
probably appropriate).  That restrictions is likely to be consistent
with "interpretation" for many readers.  For characters on the the
second list (and those on neither), the numeric values should always
be present whether the names are or not.  For those on the first
list, the choices can be left to common sense and the traditional
conventions about author choices and consistency within a document.
As with the lists themselves, I don't think that needs to be written
into the document as long as it is generally understood that the RPC
has discretion to apply their good sense, insist on intra-document
consistency when that appears to them to be important, and be
resistant to bullying.

(4) Nothing above should prevent the RPC from requiring numeric code
point identifiers (instead or in addition to renderable characters)
in situations where there is any possibility, any possibility at all,
of confusion among look-alike characters (in any set of type styles)
reducing the clarity of a document.  Indeed, they should be strongly
encouraged to impose that requirement.

Does that help us move forward?

   john





-- 
rswg mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to