(top post) FWIW, I found this explanation, and drawing together of the material, very helpful. The one thing I might have added explicitly (although the explanation below comes close) is that, when an abstract character requires a sequence of code points to represent, things get _really_ muddy when "character" terms are used outside very specific contexts. I am far more concerned about ( the other issues I tried to raise and about those Russ raised in his second and third notes today, particularly those he identified as RFC 7997 being better and clearer.
john --On Monday, October 27, 2025 17:00 +0100 Carsten Bormann <[email protected]> wrote: > On Oct 27, 2025, at 16:14, Russ Housley <[email protected]> > wrote: >> >> I conclude that "Unicode characters" is expected to be well >> understood term. > > We all wish that, but the Unicode consortium did not bring up the > normative will to make this happen. > > For our discussion, "Unicode character" is a synonym of what > the Unicode consortium actually does define precisely, the > "Unicode scalar value". > > For the details of what people that are closer to the Unicode > consortium might say, you may or may not want to read the small > write-up below that an LLM just gave me on the subject, and which > is astonishingly useful. Just make sure you never confuse "code > point" with "code unit" :-) > > Grüße, Carsten > > > In Unicode, a "character" is an abstract unit of textual > information; the encoded unit is a code point (or scalar value). > > The Unicode documents use "character" in a very specific, > layered way. The core idea is the abstract character—a unit of > information used to organize, control, or represent text—while > the thing you actually encode is a numeric code point (and, more > precisely, a Unicode scalar value, which is any code point except > surrogates). The standard prefers these precise terms over the > vague "Unicode character." > https://www.unicode.org/glossary/ > > What the Unicode Standard calls a "character" > • The Unicode Glossary defines character in multiple senses, > including: "synonym for abstract character" and "the basic > unit of encoding for the Unicode character encoding." In > practice, the standard strongly distinguishes the abstract notion > (character) from its encoding (code point/scalar). The glossary > also defines abstract character, encoded character, and grapheme > cluster (the user‑perceived character). > https://www.unicode.org/glossary/ > > Related precise terms used in the spec > • Code point: Any value in the Unicode codespace > (U+0000..U+10FFFF). This is the primary encoded identifier for > characters. The core specification's "Characters and > Encoding" chapter enumerates these definitions. > https://www.unicode.org/standard/standard.html • Unicode > scalar value: Any code point except the surrogate range > (U+D800..U+DFFF). This is the set of code points that can actually > represent textual data (i.e., can appear in strings in > UTF‑8/UTF‑32 and as non‑surrogate values in UTF‑16). The > core specification's encoding chapters define this distinction. > https://www.unicode.org/standard/standard.html • Abstract > Character Repertoire (ACR): The set of abstract characters to be > encoded; Unicode's repertoire is open and versioned. This is > formalized in Unicode Technical Report #17, which clarifies the > model and terminology across layers. > http://www.unicode.org/reports/tr17/tr17-3.html > > Where the documents say it > • Glossary of Unicode Terms: authoritative short definitions for > "character," "abstract character," "encoded character," > "code point," "Unicode scalar value," "grapheme > cluster," etc. https://www.unicode.org/glossary/ • The > Unicode Standard (Core Specification): "Characters and > Encoding" and "Unicode Encoding Forms" chapters define the > codespace, code points, code unit concepts, and scalar values, and > set conformance language around what may appear in text. > https://www.unicode.org/standard/standard.html • Unicode > Technical Report #17 (Character Encoding Model): formal > five‑layer model (ACR, CCS, CEF, CES, TES) and the repertoire vs. > coding distinctions; useful when you need the precise architecture > behind "character" versus "encoded unit." > http://www.unicode.org/reports/tr17/tr17-3.html > > In short, when the Consortium is being exact, "character" > refers to an abstract entity; the precise encoded unit you index > and manipulate is a code point, and the subset valid in text is the > scalar value. The user‑perceived "character" you see is a > grapheme cluster, which can span multiple encoded characters. > https://www.unicode.org/glossary/ -- rswg mailing list -- [email protected] To unsubscribe send an email to [email protected]
