On 1/17/2026 5:31 PM, Marius Spix via Unicode wrote:
Gesendet: Sonntag, den 18.01.2026 um 02:30 Uhr Von: "Marius Spix" <[email protected]> An: "Jukka K. Korpela" <[email protected]> Betreff: Aw: Re: General Categories Pe, Pf, Pi, PsI see. Another example would be the Frech quotation marks (guillemets) which are pointing outwards and are separated from the quoted text via space in French texts, but are pointing inwards and have no additional spaces in German texts (especially in books). So, these categories come from a time, where Unicode had been very English-centric and can be considered as “historically heritage”, correct?
Unicode was never "English-centric" by design. And nearly all participants in the early effort were familiar with or even experts in software localization (within the limitations of what that meant in the late '80s).
There are many problems with the General_Category and ultimately, they reflect that experience with character properties was limited.
Also, a solid understanding of the differences between properties inherent in a character and properties assumed by a character in the context of a specific orthography emerged over time.
There are some properties that are inherent in a character (or exceptions, if they exist, are very limited). I'm not aware of any orthography that treats "A" as a lowercase letter (there are some that use smallcaps forms, but those would have the lowercase property in Unicode).
When Unicode was created, what set it apart, was the insistence that encoded characters had properties beyond their appearance, name and code point value. No other widely used standard at the time did anything like that. It meant, that Unicode had a lot of attributes that could be used to identify "what" was being encoded at a given code point, something that required reliance on "customary knowledge" for other standards.
You had to infer that DIGIT ZERO had the numeric value of 0, but Unicode spells that out. And so on.
Unicode also refused to encode a "decimal period", arguing that the overloaded use of the full stop is indeed the norm and what was encoded is the full stop across all its uses. Of course this went along with a widely shared understanding that many languages use different conventions.
For some reason, the full range of conventions for quotation marks in particular was less well known, presumably because applying language specific quotation marks by software wasn't as much a "thing" as it is today with autocorrect, etc.
There's another reading on General_Category: this interpretation assumes that these are "defaults", to be applied in context where information on language is not available. So, you could think of these properties as applying to the language code "unknown".
There is nothing "historic" about having a default - anytime the language is not specified (and cannot be determined) you need to do something.
A./
Gesendet: Freitag, den 16.01.2026 um 20:46 Uhr Von: "Jukka K. Korpela via Unicode" <[email protected]> An: "Marius Spix" <[email protected]> Cc: [email protected] Betreff: Re: General Categories Pe, Pf, Pi, Ps My guess is that Pe, Pf, Pi and Ps were based on the usage of punctuation in English and some other languages. If this subclassification is taken too seriously, problems will arise. For example, software that takes U+201D too seriously as Pf, treats texts like xxx ”xxx” xxx badly: since U+201D is Pf, a line break is not permitted before it, even when a space intervenes. This is what MS Word does, irrespective of language settings, even for a language for which it knows that U+201D is both “start quotation” and “end quotation”. Generally, whether a character is closing, final, initial, or opening punctation should be based on language-specific information, such as CLDR. Yucca pe 16.1.2026 klo 18.09 Marius Spix via Unicode ([email protected]) kirjoitti:I wonder what is the point of the General Categories Pe, Pf, Pi and Ps? Different languages use different quotation marks, for example: English: “ (U+201C, Pi) + ” (U+201D, Pf) German: „ (U+201E, Ps) + “ (U+201C, Pi) Polish: „ (U+201E, Ps) + ” (U+201D, Pf) How does a character classify as closing, final, initial, or opening punctation? Are there any general criteria? Best regards, Marius
