On Oct 27, 2025, at 16:14, Russ Housley <[email protected]> wrote: > > I conclude that "Unicode characters" is expected to be well understood term.
We all wish that, but the Unicode consortium did not bring up the normative will to make this happen. For our discussion, “Unicode character” is a synonym of what the Unicode consortium actually does define precisely, the “Unicode scalar value”. For the details of what people that are closer to the Unicode consortium might say, you may or may not want to read the small write-up below that an LLM just gave me on the subject, and which is astonishingly useful. Just make sure you never confuse “code point” with “code unit” :-) Grüße, Carsten In Unicode, a “character” is an abstract unit of textual information; the encoded unit is a code point (or scalar value). The Unicode documents use “character” in a very specific, layered way. The core idea is the abstract character—a unit of information used to organize, control, or represent text—while the thing you actually encode is a numeric code point (and, more precisely, a Unicode scalar value, which is any code point except surrogates). The standard prefers these precise terms over the vague “Unicode character.” https://www.unicode.org/glossary/ What the Unicode Standard calls a “character” • The Unicode Glossary defines character in multiple senses, including: “synonym for abstract character” and “the basic unit of encoding for the Unicode character encoding.” In practice, the standard strongly distinguishes the abstract notion (character) from its encoding (code point/scalar). The glossary also defines abstract character, encoded character, and grapheme cluster (the user‑perceived character). https://www.unicode.org/glossary/ Related precise terms used in the spec • Code point: Any value in the Unicode codespace (U+0000..U+10FFFF). This is the primary encoded identifier for characters. The core specification’s “Characters and Encoding” chapter enumerates these definitions. https://www.unicode.org/standard/standard.html • Unicode scalar value: Any code point except the surrogate range (U+D800..U+DFFF). This is the set of code points that can actually represent textual data (i.e., can appear in strings in UTF‑8/UTF‑32 and as non‑surrogate values in UTF‑16). The core specification’s encoding chapters define this distinction. https://www.unicode.org/standard/standard.html • Abstract Character Repertoire (ACR): The set of abstract characters to be encoded; Unicode’s repertoire is open and versioned. This is formalized in Unicode Technical Report #17, which clarifies the model and terminology across layers. http://www.unicode.org/reports/tr17/tr17-3.html Where the documents say it • Glossary of Unicode Terms: authoritative short definitions for “character,” “abstract character,” “encoded character,” “code point,” “Unicode scalar value,” “grapheme cluster,” etc. https://www.unicode.org/glossary/ • The Unicode Standard (Core Specification): “Characters and Encoding” and “Unicode Encoding Forms” chapters define the codespace, code points, code unit concepts, and scalar values, and set conformance language around what may appear in text. https://www.unicode.org/standard/standard.html • Unicode Technical Report #17 (Character Encoding Model): formal five‑layer model (ACR, CCS, CEF, CES, TES) and the repertoire vs. coding distinctions; useful when you need the precise architecture behind “character” versus “encoded unit.” http://www.unicode.org/reports/tr17/tr17-3.html In short, when the Consortium is being exact, “character” refers to an abstract entity; the precise encoded unit you index and manipulate is a code point, and the subset valid in text is the scalar value. The user‑perceived “character” you see is a grapheme cluster, which can span multiple encoded characters. https://www.unicode.org/glossary/ -- rswg mailing list -- [email protected] To unsubscribe send an email to [email protected]
