--On Tuesday, October 14, 2025 15:29 -0700 Alexis Rossi <[email protected]> wrote:
> On Tue, Oct 14, 2025 at 2:11 PM John R Levine <[email protected]> > wrote: > >> On Tue, 14 Oct 2025, Carsten Bormann wrote: >> >> In cases of outright errors in >> >> character names such as misspellings, a character may be given >> >> a formal name alias. >> > >> > Right. Do we use the original, possibly broken name (which is >> > at least >> promised to be stable) or the corrected one? There are a couple >> hundred alias names, so this isn't entirely theoretical. >> > >> > (Having to ask that question puts me in the camp of liking >> > U+NNNN more, >> but putting the most corrected name at the time of writing *as >> well* might help readers. >> > [1] might give us some easy ways to to make that happen, but >> unfortunately the [2] referenced from [1] does not indicate how >> the name is chosen. >> > This is a defect.) >> >> I was hoping we would expect the authors and editors to use a >> little common sense. In the usual case that the name is not >> broken, they can use it. If the name might be confusing or the >> number is important to the point they're making, they can use the >> number or maybe both. Let's not try to specify this down to the >> last pixel. > Strong seconding to John's point about allowing common sense and > avoiding over-specification in an RSWG doc. Alexis, I don't know if I agree with Carsten or not (would take more thinking), but [just] allowing and relying on common sense seems a bit naive to me. The second paragraph of the Abstract actually highlights part of the problem. Let me see if I can take "all displayable text is allowed as long as the reader of an RFC can interpret that text" apart to illustrate the problem. First, in at least many cases, for a reader to interpret text requires, at least, that they be able to distinguish among characters in a world filled with look-alike characters. If the text is a string, "interpret" normally means ability to read and understand that string (i.e., have language understanding), not just see the glyphs. Second, there are a huge number of languages and writing systems, present and past, in the world with different rules for forming strings (and, for strings that consist of more than one "word", separating words or whatever gets separated). With a general rule like that, and a term as ambiguous as "interpret", there are almost endless ways in which an author who was trying to be clever and/or prove a point and/or show their knowledge and skills off to others could get readers into situations where they had no idea what they were looking at, whether a string that was extracted from an RFC was equivalent to a string found elsewhere, and so on. <sarcasm> Of course, there has never been anyone with a personality like that participating in the IETF. </scarcasm>. We used to think that NFC was sufficient to sort out string ordering and relationships, but we've discovered a large number of cases where we were wrong about that level of sufficiency and, IIR, some aspects of NFC's relative stability and its implications. I'd still recommend NFC, but only if people are aware of its limitations, including limitations for specific scripts and language-script combinations. In that context, "common sense" works only if the person applying it has a reasonable understanding of the writing system involved and, where relevant, the language. If there are people in the world with that level of understanding of all of the world's scripts and languages, past or present, I've seen little evidence of many of them as active IETF participants, much less as active participants who are likely to try writing I-Ds and RFCs. As soon as one moves away from relatively common scripts, the other problem is that some browsers (and other rendering engines), in some parts of the world, may be able to correctly render a given string (available fonts are important there, but not the only issue), others may not, and the capabilities may vary over time. Not all rendering engines are browsers either and there is no guarantee that the IETF will even have heard of rendering engines that might sensibly be used with particular scripts. As just one obvious example, if you were to allow a script that first became standardized in Unicode 17.0 (i.e., five weeks ago), the odds that those scripts would be broadly supported across all rendering engines and that they would become available at the same time, are, to put it mildly, pretty low even if some rendering engines got ahead of formal publication of the standard. In addition, the RPC, in practice, needs to be able to work with whatever comes their way, presumably including being able to render characters and strings with their tools and to "interpret" them. And, in case it is not obvious, the right criteria for the use (inclusion in RFCs) of single, isolated, characters are almost certainly different for those involving strings. For starters, NFC and related choices about string composition and rendering are almost always going to be irrelevant to isolated characters but important to strings. An attempt at a few constructive suggestion (if relevant, please read to the end before exploding): (1) Create two lists of scripts and languages. The first should consist of very common code point/ script/ language cases, ones with which a non-trivial number of IETF participants and, to the extent to which we can guess, likely readers, will be familiar. The second should include scripts and probably languages that are in common use in the world, common enough that the rules and conventions for reading and writing them are well-understood, stable, and easily accessible. Reliance on common sense alone works for the first list only and then only the common sense of people who can be relied upon to know what they don't know. Use of the second should require the author to provide a rationale (before stream approval) for including the characters in an RFC at all and working with the RPC on inclusion of any multiple-codepoint string in the text _and_ inclusion of numeric code points information (presumably U+nnnn for individuals characters and probably \unnnn\upppp\uqqqq etc. for longer strings. Characters and strings that fall into neither list are either not allowed or require strong rationale for why there are not better choices. If accepted, they must always be associated with a numeric code point list. (2) While I think this document should contain list definitions and rules that at least roughly conform to the above, the contents of those lists (characters, scripts, languages) can and should be left to the RPC with the understanding that they will evolve over time. The choices should be up to them, but I'd advise that the initial version of the first group contain only scripts and languages with which they are very comfortable working. Later expansion is likely and would be reasonable, but how, what, and when should be up to them. (3) As suggested above, single characters are probably different from strings, especially strings more than a few code points in length. If it is necessary to identify particular characters/ code points, using names is fine, especially if they make what is going on more clear, but they should generally be accompanied by numeric code point identifiers (in practice, someone unskilled in Unicode specification navigation may find looking up the name more challenging than is probably appropriate). That restrictions is likely to be consistent with "interpretation" for many readers. For characters on the the second list (and those on neither), the numeric values should always be present whether the names are or not. For those on the first list, the choices can be left to common sense and the traditional conventions about author choices and consistency within a document. As with the lists themselves, I don't think that needs to be written into the document as long as it is generally understood that the RPC has discretion to apply their good sense, insist on intra-document consistency when that appears to them to be important, and be resistant to bullying. (4) Nothing above should prevent the RPC from requiring numeric code point identifiers (instead or in addition to renderable characters) in situations where there is any possibility, any possibility at all, of confusion among look-alike characters (in any set of type styles) reducing the clarity of a document. Indeed, they should be strongly encouraged to impose that requirement. Does that help us move forward? john -- rswg mailing list -- [email protected] To unsubscribe send an email to [email protected]
