Re: Fw: Aw: Re: General Categories Pe, Pf, Pi, Ps

Asmus Freytag via Unicode Sun, 18 Jan 2026 19:42:55 -0800

On 1/17/2026 5:31 PM, Marius Spix via Unicode wrote:

Gesendet: Sonntag, den 18.01.2026 um 02:30 Uhr
Von: "Marius Spix" <[email protected]>
An: "Jukka K. Korpela" <[email protected]>
Betreff: Aw: Re: General Categories Pe, Pf, Pi, Ps


I see. Another example would be the Frech quotation marks (guillemets) which 
are pointing outwards and are separated from the quoted text via space in 
French texts, but are pointing inwards and have no additional spaces in German 
texts (especially in books). So, these categories come from a time, where 
Unicode had been very English-centric and can be considered as “historically 
heritage”, correct?

Unicode was never "English-centric" by design. And nearly allparticipants in the early effort were familiar with or even experts insoftware localization (within the limitations of what that meant in thelate '80s).

There are many problems with the General_Category and ultimately, theyreflect that experience with character properties was limited.

Also, a solid understanding of the differences between propertiesinherent in a character and properties assumed by a character in thecontext of a specific orthography emerged over time.

There are some properties that are inherent in a character (orexceptions, if they exist, are very limited). I'm not aware of anyorthography that treats "A" as a lowercase letter (there are some thatuse smallcaps forms, but those would have the lowercase property inUnicode).

When Unicode was created, what set it apart, was the insistence thatencoded characters had properties beyond their appearance, name and codepoint value. No other widely used standard at the time did anything likethat. It meant, that Unicode had a lot of attributes that could be usedto identify "what" was being encoded at a given code point, somethingthat required reliance on "customary knowledge" for other standards.

You had to infer that DIGIT ZERO had the numeric value of 0, but Unicodespells that out. And so on.

Unicode also refused to encode a "decimal period", arguing that theoverloaded use of the full stop is indeed the norm and what was encodedis the full stop across all its uses. Of course this went along with awidely shared understanding that many languages use different conventions.

For some reason, the full range of conventions for quotation marks inparticular was less well known, presumably because applying languagespecific quotation marks by software wasn't as much a "thing" as it istoday with autocorrect, etc.

There's another reading on General_Category: this interpretation assumesthat these are "defaults", to be applied in context where information onlanguage is not available. So, you could think of these properties asapplying to the language code "unknown".

There is nothing "historic" about having a default - anytime thelanguage is not specified (and cannot be determined) you need to dosomething.

A./

Gesendet: Freitag, den 16.01.2026 um 20:46 Uhr
Von: "Jukka K. Korpela via Unicode" <[email protected]>
An: "Marius Spix" <[email protected]>
Cc: [email protected]
Betreff: Re: General Categories Pe, Pf, Pi, Ps

My guess is that  Pe, Pf, Pi and Ps were based on the usage of punctuation
in English and some other languages. If this subclassification is taken too
seriously, problems will arise. For example, software that takes U+201D too
seriously as Pf, treats texts like xxx ”xxx” xxx badly: since  U+201D is
Pf, a line break is not permitted before it, even when a space intervenes.
This is what MS Word does, irrespective of language settings, even for a
language for which it knows that U+201D is both “start quotation” and “end
quotation”.

Generally, whether a character is closing, final, initial, or opening
punctation should be based on language-specific information, such as CLDR.

Yucca


pe 16.1.2026 klo 18.09 Marius Spix via Unicode ([email protected])
kirjoitti:

I wonder what is the point of the General Categories Pe, Pf, Pi and Ps?

Different languages use different quotation marks, for example:

English:  “ (U+201C, Pi) + ” (U+201D, Pf)
German: „ (U+201E, Ps) + “ (U+201C, Pi)
Polish: „ (U+201E, Ps) + ” (U+201D, Pf)

How does a character classify as closing, final, initial, or opening
punctation? Are there any general criteria?

Best regards,

Marius

Re: Fw: Aw: Re: General Categories Pe, Pf, Pi, Ps

Reply via email to