On 12/1/2024 5:48 PM, Dominikus Dittes Scherkl via Unicode wrote:
Am 30.11.24 um 18:16 schrieb Asmus Freytag via Unicode:
On 11/27/2024 12:15 PM, Dominikus Dittes Scherkl via Unicode wrote:
However, speaking of this as a "default" is confusing to readers who
think in terms of text processing or authoring environments where a
different set of requirements rule. Here, the proper "default" is the
best implementation of a culturally appropriate case transform.
NO. I really mean "default" in a technical sense, not something someone
tailors to local needs.
The ẞ was introduced to have an invertible casing, just like
compatibility codepoints were assigned to make preservation of old
formating information available if a translation back to some obsolete
charset is necessary.
_This new letter was invented to allow for 1:1 roundtrip conversion._
The letter was not *invented*. It was discovered (= identified as
occurring in actual writing) and encoded.
It was encoded to match a character with a unique shape and properties.
One of them of *being* a capital letter and the other one of ß being its
lowercase equivalent.
toUpper() shall change "ß" to "ẞ" instead of "SS", just to allow
toLower() producing back "ß" instead of a wrong spelling with "ss"
(which at the moment can only be avoided using a german dictionary - a
really heavy constraint to a small function like toLower - and for
family names simply not possible at all - the information is lost).
Your problem is that you assume an implementation of toUpper that takes
no argument. For purposes like text design, publication etc. you want an
implementation that selects which locale should set the rules. (Or one,
where that setting is done behind the scenes, which is logically
equivalent). Without specifiying the locale, your beautiful toUpper()
does not now that in Turkish, 'i' is not mapped to 'I' but to CAPITAL I
WITH DOT.
Because your beautiful toUpper does not handle at least one language
means that it should not need to handle any languages. Instead it should
be stable.
What you are describing is a change to the toUpper() that is invoked
with the german locale as parameter (or selected behind the scenes).
There's not the same requirement for that one to be stable, although
sometimes transitions are implemented by creating a separate locale for
"old" and "new" orthographies and the like.
When it comes to case conversion, purpose matters.
This doesn't detract from the need to have implementations that do the
"right" thing (as currently defined) for a given language. And from the
need to enable these by default for ordinary text manipulation.
But it's not the same thing as overriding an "identifier-safe" or
"filesystem-safe" implementation, just because that's incorrectly viewed
as a "default" that should be applicable to text manipulation.
A./
This is a really bad situation, which should be fixed as soon as
possible, not a matter of taste.
And it should be fixed explicitly in automatic text processing - because
this is were today errors are produced, that can now be avoided.
In private letters it doesn't matter what form is used - the people
write whatever they want anyway. But automatic processing shall not drop
information that can not be brought back (expcept with re-introducing
this knowledge back manually).
And what is "best" can change over time.
No. Fixing this round-trip bug is in the best interest of unicode and
that won't change over time. Using "SS" in all uppercase text was always
a bad workaround that became a source of spelling errors by automatic
text processing and for which a fix was invented some ten years ago. So
lets use it everywhere - at least now that it is officially allowed
(since 2017) and even preferred (since this year).