[Rswg] Re: [Ext] draft-rswg-rfc7997bis-05

John C Klensin Mon, 27 Oct 2025 11:27:32 -0700

(top post)
FWIW, I found this explanation, and drawing together of the material,
very helpful.  The one thing I might have added explicitly (although
the explanation below comes close) is that, when an abstract
character requires a sequence of code points to represent, things get
_really_ muddy when "character" terms are used outside very specific
contexts.   I am far more concerned about ( the other issues I tried
to raise and about those Russ raised in his second and third notes
today, particularly those he identified as RFC 7997 being better and
clearer.


  john




--On Monday, October 27, 2025 17:00 +0100 Carsten Bormann
<[email protected]> wrote:

> On Oct 27, 2025, at 16:14, Russ Housley <[email protected]>
> wrote:
>> 
>> I conclude that "Unicode characters" is expected to be well
>> understood term.
> 
> We all wish that, but the Unicode consortium did not bring up the
> normative will to make this happen.
> 
> For our discussion, "Unicode character" is a synonym of what
> the Unicode consortium actually does define precisely, the
> "Unicode scalar value".
> 
> For the details of what people that are closer to the Unicode
> consortium might say, you may or may not want to read the small
> write-up below that an LLM just gave me on the subject, and which
> is astonishingly useful.  Just make sure you never confuse "code
> point" with "code unit" :-)
> 
> Grüße, Carsten
> 
> 
> In Unicode, a "character" is an abstract unit of textual
> information; the encoded unit is a code point (or scalar value).
> 
> The Unicode documents use "character" in a very specific,
> layered way. The core idea is the abstract character—a unit of
> information used to organize, control, or represent text—while
> the thing you actually encode is a numeric code point (and, more
> precisely, a Unicode scalar value, which is any code point except
> surrogates). The standard prefers these precise terms over the
> vague "Unicode character."
> ⁠https://www.unicode.org/glossary/
> 
> What the Unicode Standard calls a "character"
>  • The Unicode Glossary defines character in multiple senses,
> including: "synonym for abstract character" and "the basic
> unit of encoding for the Unicode character encoding." In
> practice, the standard strongly distinguishes the abstract notion
> (character) from its encoding (code point/scalar). The glossary
> also defines abstract character, encoded character, and grapheme
> cluster (the user‑perceived character).
> ⁠https://www.unicode.org/glossary/
> 
> Related precise terms used in the spec
>  • Code point: Any value in the Unicode codespace
> (U+0000..U+10FFFF). This is the primary encoded identifier for
> characters. The core specification's "Characters and
> Encoding" chapter enumerates these definitions.
> ⁠https://www.unicode.org/standard/standard.html  • Unicode
> scalar value: Any code point except the surrogate range
> (U+D800..U+DFFF). This is the set of code points that can actually
> represent textual data (i.e., can appear in strings in
> UTF‑8/UTF‑32 and as non‑surrogate values in UTF‑16). The
> core specification's encoding chapters define this distinction.
> ⁠https://www.unicode.org/standard/standard.html  • Abstract
> Character Repertoire (ACR): The set of abstract characters to be
> encoded; Unicode's repertoire is open and versioned. This is
> formalized in Unicode Technical Report #17, which clarifies the
> model and terminology across layers.
> ⁠http://www.unicode.org/reports/tr17/tr17-3.html
> 
> Where the documents say it
>  • Glossary of Unicode Terms: authoritative short definitions for
> "character," "abstract character," "encoded character,"
> "code point," "Unicode scalar value," "grapheme
> cluster," etc. ⁠https://www.unicode.org/glossary/  • The
> Unicode Standard (Core Specification): "Characters and
> Encoding" and "Unicode Encoding Forms" chapters define the
> codespace, code points, code unit concepts, and scalar values, and
> set conformance language around what may appear in text.
> ⁠https://www.unicode.org/standard/standard.html  • Unicode
> Technical Report #17 (Character Encoding Model): formal
> five‑layer model (ACR, CCS, CEF, CES, TES) and the repertoire vs.
> coding distinctions; useful when you need the precise architecture
> behind "character" versus "encoded unit."
> ⁠http://www.unicode.org/reports/tr17/tr17-3.html
> 
> In short, when the Consortium is being exact, "character"
> refers to an abstract entity; the precise encoded unit you index
> and manipulate is a code point, and the subset valid in text is the
> scalar value. The user‑perceived "character" you see is a
> grapheme cluster, which can span multiple encoded characters.
> ⁠https://www.unicode.org/glossary/


-- 
rswg mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Rswg] Re: [Ext] draft-rswg-rfc7997bis-05

Reply via email to