The design of the general category predates the fuller understanding of
how languages (orthographies) actually use quotation marks. Whether
something is opening, closing or even paired with another quotation mark
is ultimately language dependent, as Jan writes. Mathematical notation
will use brackets in their normal or reversed sense, which also makes
any generalized opening or closing property useless.
The GC values can be understood, at best, to represent the most common
usage of a given punctuation mark.
They are moderately useful when no other information is available.
(Unknown language, or language set to "none" in metadata).
Specific applications may have an issue that changing behavior will
reflow existing documents on opening with a downstream version.
For Unicode, the issue is similar. Changes to long established
properties get more and more restricted to cases where side effects on
existing documents can be balanced against the benefit based on the
severity of the issue and practical relevance of the fix in actual use.
Any motivation such as "this could have been done better" or "these
characters are not treated in a perfectly consistent manner" are
increasingly seen as insufficient to make any adjustments in standard
(language neutral) properties and algorithms.
Instead, the focus will be on fixing actual use cases that have been or
could be raised as bug reports against implementations, assuming that
there's no impact on other users.
One exception is that consistency between the segmentation algorithms is
useful. This gives a small window to fix inconsistent treatment of edge
cases.
A./
On 1/16/2026 11:46 AM, Jukka K. Korpela via Unicode wrote:
My guess is that Pe, Pf, Pi and Ps were based on the usage of
punctuation in English and some other languages. If this
subclassification is taken too seriously, problems will arise. For
example, software that takes U+201D too seriously as Pf, treats texts
like xxx ”xxx” xxx badly: since U+201D is Pf, a line break is not
permitted before it, even when a space intervenes. This is what MS
Word does, irrespective of language settings, even for a language for
which it knows that U+201D is both “start quotation” and “end quotation”.
Generally, whether a character is closing, final, initial, or opening
punctation should be based on language-specific information, such as CLDR.
Yucca
pe 16.1.2026 klo 18.09 Marius Spix via Unicode
([email protected]) kirjoitti:
I wonder what is the point of the General Categories Pe, Pf, Pi
and Ps?
Different languages use different quotation marks, for example:
English: “ (U+201C, Pi) + ” (U+201D, Pf)
German: „ (U+201E, Ps) + “ (U+201C, Pi)
Polish: „ (U+201E, Ps) + ” (U+201D, Pf)
How does a character classify as closing, final, initial, or
opening punctation? Are there any general criteria?
Best regards,
Marius