Re: General Categories Pe, Pf, Pi, Ps

Asmus Freytag via Unicode Fri, 16 Jan 2026 16:25:30 -0800

The design of the general category predates the fuller understanding ofhow languages (orthographies) actually use quotation marks. Whethersomething is opening, closing or even paired with another quotation markis ultimately language dependent, as Jan writes. Mathematical notationwill use brackets in their normal or reversed sense, which also makesany generalized opening or closing property useless.

The GC values can be understood, at best, to represent the most commonusage of a given punctuation mark.

They are moderately useful when no other information is available.(Unknown language, or language set to "none" in metadata).

Specific applications may have an issue that changing behavior willreflow existing documents on opening with a downstream version.

For Unicode, the issue is similar. Changes to long establishedproperties get more and more restricted to cases where side effects onexisting documents can be balanced against the benefit based on theseverity of the issue and practical relevance of the fix in actual use.

Any motivation such as "this could have been done better" or "thesecharacters are not treated in a perfectly consistent manner" areincreasingly seen as insufficient to make any adjustments in standard(language neutral) properties and algorithms.

Instead, the focus will be on fixing actual use cases that have been orcould be raised as bug reports against implementations, assuming thatthere's no impact on other users.

One exception is that consistency between the segmentation algorithms isuseful. This gives a small window to fix inconsistent treatment of edgecases.


A./

On 1/16/2026 11:46 AM, Jukka K. Korpela via Unicode wrote:

My guess is that Pe, Pf, Pi and Ps were based on the usage ofpunctuation in English and some other languages. If thissubclassification is taken too seriously, problems will arise. Forexample, software that takes U+201D too seriously as Pf, treats textslike xxx ”xxx” xxx badly: since U+201D is Pf, a line break is notpermitted before it, even when a space intervenes. This is what MSWord does, irrespective of language settings, even for a language forwhich it knows that U+201D is both “start quotation” and “end quotation”.
Generally, whether a character is closing, final, initial, or openingpunctation should be based on language-specific information, such as CLDR.
Yucca
pe 16.1.2026 klo 18.09 Marius Spix via Unicode([email protected]) kirjoitti:
    I wonder what is the point of the General Categories Pe, Pf, Pi
    and Ps?

    Different languages use different quotation marks, for example:

    English:  “ (U+201C, Pi) + ” (U+201D, Pf)
    German: „ (U+201E, Ps) + “ (U+201C, Pi)
    Polish: „ (U+201E, Ps) + ” (U+201D, Pf)

    How does a character classify as closing, final, initial, or
    opening punctation? Are there any general criteria?

    Best regards,

    Marius

Re: General Categories Pe, Pf, Pi, Ps

Reply via email to