The various definitions below are the result of a technology that has to
deal with a wide range of writing systems, map that to bits and bytes,
and has to take into account that people use words the way it's easiest
for them, without always looking up definitions.
For the purposes of rfc7997bis, I think that using "characters" (note
the plural) when talking about the stuff that's directly readable (¥, €,
and so on) and code point when talking about the U+XXXX notation will be
easiest to understand for our readers and not wrong.
[It would be good to have a name for the U+XXXX notation; I don't know
if one exist; maybe we should define one because it might make the
document easier to write.]
Regards, Martin.
On 2025-10-28 01:00, Carsten Bormann wrote:
On Oct 27, 2025, at 16:14, Russ Housley <[email protected]> wrote:
I conclude that "Unicode characters" is expected to be well understood term.
We all wish that, but the Unicode consortium did not bring up the normative
will to make this happen.
For our discussion, “Unicode character” is a synonym of what the Unicode
consortium actually does define precisely, the “Unicode scalar value”.
For the details of what people that are closer to the Unicode consortium might
say, you may or may not want to read the small write-up below that an LLM just
gave me on the subject, and which is astonishingly useful. Just make sure you
never confuse “code point” with “code unit” :-)
Grüße, Carsten
In Unicode, a “character” is an abstract unit of textual information; the
encoded unit is a code point (or scalar value).
The Unicode documents use “character” in a very specific, layered way. The core
idea is the abstract character—a unit of information used to organize, control,
or represent text—while the thing you actually encode is a numeric code point
(and, more precisely, a Unicode scalar value, which is any code point except
surrogates). The standard prefers these precise terms over the vague “Unicode
character.” https://www.unicode.org/glossary/
What the Unicode Standard calls a “character”
• The Unicode Glossary defines character in multiple senses, including:
“synonym for abstract character” and “the basic unit of encoding for the
Unicode character encoding.” In practice, the standard strongly distinguishes
the abstract notion (character) from its encoding (code point/scalar). The
glossary also defines abstract character, encoded character, and grapheme
cluster (the user‑perceived character). https://www.unicode.org/glossary/
Related precise terms used in the spec
• Code point: Any value in the Unicode codespace (U+0000..U+10FFFF). This is
the primary encoded identifier for characters. The core specification’s
“Characters and Encoding” chapter enumerates these definitions.
https://www.unicode.org/standard/standard.html
• Unicode scalar value: Any code point except the surrogate range
(U+D800..U+DFFF). This is the set of code points that can actually represent
textual data (i.e., can appear in strings in UTF‑8/UTF‑32 and as non‑surrogate
values in UTF‑16). The core specification’s encoding chapters define this
distinction. https://www.unicode.org/standard/standard.html
• Abstract Character Repertoire (ACR): The set of abstract characters to be
encoded; Unicode’s repertoire is open and versioned. This is formalized in
Unicode Technical Report #17, which clarifies the model and terminology across
layers. http://www.unicode.org/reports/tr17/tr17-3.html
Where the documents say it
• Glossary of Unicode Terms: authoritative short definitions for “character,”
“abstract character,” “encoded character,” “code point,” “Unicode scalar
value,” “grapheme cluster,” etc. https://www.unicode.org/glossary/
• The Unicode Standard (Core Specification): “Characters and Encoding” and
“Unicode Encoding Forms” chapters define the codespace, code points, code unit
concepts, and scalar values, and set conformance language around what may
appear in text. https://www.unicode.org/standard/standard.html
• Unicode Technical Report #17 (Character Encoding Model): formal five‑layer
model (ACR, CCS, CEF, CES, TES) and the repertoire vs. coding distinctions;
useful when you need the precise architecture behind “character” versus
“encoded unit.” http://www.unicode.org/reports/tr17/tr17-3.html
In short, when the Consortium is being exact, “character” refers to an abstract
entity; the precise encoded unit you index and manipulate is a code point, and
the subset valid in text is the scalar value. The user‑perceived “character”
you see is a grapheme cluster, which can span multiple encoded characters.
https://www.unicode.org/glossary/
--
rswg mailing list -- [email protected]
To unsubscribe send an email to [email protected]