On 1/25/2025 9:25 AM, Karl Wagner via Unicode wrote:
### "Is `x` Normalized?"
It's helpful to start by considering what it means when we say a
string "is normalised". It's very simple; literally all it means is
that normalising the string returns the same string.
```
isNormalized(x):
normalize(x) == x
```
For me, it was a bit of a revelation to grasp that in general, the
result of `isNormalized` is_only locally meaningful_. Asking the same
question, at another point in space or in time, may yield a different
result:
- Two machines communicating over a network may disagree about whether
x is normalised.
- The same machine may think x is normalised one day, then after an OS
update, suddenly think the same x is not normalised.
That is exactly not as it should be.
"once a string is normalized, it remains normalized".
The corollary is that you cannot normalize a "string" that contains
unassigned characters for the version of Unicode that you know about.
So your two systems must agree on the isNormalized if the string can be
normalized on both of them.
### "Are `x` and `y` Equivalent?"
Normalisation is how we define equivalence. Two strings, x and y, are
equivalent if normalising each of them produces the same result:
```
areEquivalent(x, y):
normalize(x) == normalize(y)
```
And so following from the previous section, when we deal in pairs (or
larger collections) of strings, it follows that:
- Two machines communicating over a network may disagree about whether
x and y are equivalent or distinct.
- The same machine may think x and y are distinct one day, then after
an OS update, suddenly think that the same x and y are equivalent.
This has some interesting implications. For instance:
- If you encode a `Set<String>` in a JSON file, when you (or another
machine) decodes it later, the resulting Set's `count` may be less
than what it was when it was encoded.
- And if you associate values with those strings, such as in a
`Dictionary<String, SomeValue>`, some values may be discarded because
we would think they have duplicate keys.
- If you serialise a sorted list of strings, they may not be
considered sorted when you (or another machine) loads them. Sorting
involves normalisation, since equivalent strings sort identically.
Other than code point order, two systems cannot apply any common sort on
a list of strings where some of the strings contain unassigned
characters for at least one system.
That restriction also applies to any linguistic sorting.
The (overwhelming) majority of data will be from the subset that both
systems know about. You might think of ways to mediate the interaction
by putting a Unicode version number on your list.
If you are trying to process a list that requires knowing
as-yet-undefined equivalences you would be able to flag that as an error.
A./
PS: there are lots of things you shouldn't do with unassigned code
points, as you cannot produce results that are "correct".