The various definitions below are the result of a technology that has to deal with a wide range of writing systems, map that to bits and bytes, and has to take into account that people use words the way it's easiest for them, without always looking up definitions.

For the purposes of rfc7997bis, I think that using "characters" (note the plural) when talking about the stuff that's directly readable (¥, €, and so on) and code point when talking about the U+XXXX notation will be easiest to understand for our readers and not wrong.

[It would be good to have a name for the U+XXXX notation; I don't know if one exist; maybe we should define one because it might make the document easier to write.]

Regards,    Martin.

On 2025-10-28 01:00, Carsten Bormann wrote:
On Oct 27, 2025, at 16:14, Russ Housley <[email protected]> wrote:

I conclude that "Unicode characters" is expected to be well understood term.

We all wish that, but the Unicode consortium did not bring up the normative 
will to make this happen.

For our discussion, “Unicode character” is a synonym of what the Unicode 
consortium actually does define precisely, the “Unicode scalar value”.

For the details of what people that are closer to the Unicode consortium might 
say, you may or may not want to read the small write-up below that an LLM just 
gave me on the subject, and which is astonishingly useful.  Just make sure you 
never confuse “code point” with “code unit” :-)

Grüße, Carsten


In Unicode, a “character” is an abstract unit of textual information; the 
encoded unit is a code point (or scalar value).

The Unicode documents use “character” in a very specific, layered way. The core 
idea is the abstract character—a unit of information used to organize, control, 
or represent text—while the thing you actually encode is a numeric code point 
(and, more precisely, a Unicode scalar value, which is any code point except 
surrogates). The standard prefers these precise terms over the vague “Unicode 
character.” ​⁠https://www.unicode.org/glossary/

What the Unicode Standard calls a “character”
  • The Unicode Glossary defines character in multiple senses, including: 
“synonym for abstract character” and “the basic unit of encoding for the 
Unicode character encoding.” In practice, the standard strongly distinguishes 
the abstract notion (character) from its encoding (code point/scalar). The 
glossary also defines abstract character, encoded character, and grapheme 
cluster (the user‑perceived character). ​⁠https://www.unicode.org/glossary/

Related precise terms used in the spec
  • Code point: Any value in the Unicode codespace (U+0000..U+10FFFF). This is 
the primary encoded identifier for characters. The core specification’s 
“Characters and Encoding” chapter enumerates these definitions. 
​⁠https://www.unicode.org/standard/standard.html
  • Unicode scalar value: Any code point except the surrogate range 
(U+D800..U+DFFF). This is the set of code points that can actually represent 
textual data (i.e., can appear in strings in UTF‑8/UTF‑32 and as non‑surrogate 
values in UTF‑16). The core specification’s encoding chapters define this 
distinction. ​⁠https://www.unicode.org/standard/standard.html
  • Abstract Character Repertoire (ACR): The set of abstract characters to be 
encoded; Unicode’s repertoire is open and versioned. This is formalized in 
Unicode Technical Report #17, which clarifies the model and terminology across 
layers. ​⁠http://www.unicode.org/reports/tr17/tr17-3.html

Where the documents say it
  • Glossary of Unicode Terms: authoritative short definitions for “character,” 
“abstract character,” “encoded character,” “code point,” “Unicode scalar 
value,” “grapheme cluster,” etc. ​⁠https://www.unicode.org/glossary/
  • The Unicode Standard (Core Specification): “Characters and Encoding” and 
“Unicode Encoding Forms” chapters define the codespace, code points, code unit 
concepts, and scalar values, and set conformance language around what may 
appear in text. ​⁠https://www.unicode.org/standard/standard.html
  • Unicode Technical Report #17 (Character Encoding Model): formal five‑layer 
model (ACR, CCS, CEF, CES, TES) and the repertoire vs. coding distinctions; 
useful when you need the precise architecture behind “character” versus 
“encoded unit.” ​⁠http://www.unicode.org/reports/tr17/tr17-3.html

In short, when the Consortium is being exact, “character” refers to an abstract 
entity; the precise encoded unit you index and manipulate is a code point, and 
the subset valid in text is the scalar value. The user‑perceived “character” 
you see is a grapheme cluster, which can span multiple encoded characters. 
​⁠https://www.unicode.org/glossary/


--
rswg mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to