Re: Combining characters

Asmus Freytag via Unicode Sun, 14 Dec 2025 14:49:49 -0800

On 12/14/2025 10:47 AM, Phil Smith III via Unicode wrote:

Well, I’m sorta “asking for a friend” – a coworker who is deep in theweeds of working with something Unicode-related. I’m blaming him forhaving told me that :)

This actually deserves a deeper answer, or a more "bird's-eye" one, ifyou want. Read to the end.

The way you asked the question seems to hint that in your minds you andyour friend conflate the concept of "combining" mark and "diacritic".That would not be surprising if you are mainly familiar with Europeanscripts and languages, because in that case, this equivalence kind ofapplies.

And you may also be thinking mainly of languages and theirorthographies, and not of notations, phonetic or otherwise, that giverise to unusual combinations. Most European languages do have areasonably small, fixed set of letters with diacritics in theirorthographies, even though there are many languages where, if you askthe native users to list all the combinations, they will fall short.Example is the use of an accent with the letter 'e' in some of theScandinavian languages to distinguish two identically spelled smallwords that have very different functions in the syntax. You will seethat accent used in books and formal writing, but I doubt people botherwhen writing a text message.

The focus on code space is a red herring to a degree. The realdifficulty would be in cataloging all of the rare combinations, and getall fonts to be aware of them. It is much easier to encode the diacriticas a combining character and have general rules for layout. With modernfonts, you can, in principle, get acceptable display even for unexpectedcombinations without the effort of first cataloging, then publishing andthen having all font vendors explicitly adding an implementation forthat combination before it can be used.

Other languages and scripts have combinatorics as part of their DNA, soto speak. Their structural unit is not the letter (with or withoutdecorations) but the syllable, which is naturally combined fromcomponents that graphically attach to each other or even fuse into acombined shape. Because that process is not random, it's easier toencode these structural elements (some of which are combiningcharacters) than to try to enumerate the possible combinations. Itdoesn't hurt that the components nicely map onto discrete keys on therespective keyboards.

Notations, such as scientific notation, also often assigns a discreteidentity to the combining mark. A dot above can be the first derivativewith respect to time, which can be applied to any letter designating avariable, which can be, at the minimum any letter from the Latin orGreek alphabets, but why stop there. There's nothing in the notationitself that would enjoin a scientist from combining that dot with anycharacter they find suitable. The only sensible solution is encoding acombining mark, even though some letters exist that have a dot above aspart of an orthography and are also encoded in precomposed form.

In contrast, Chinese ideographs, while visually composed of identifiableelements, are treated by their users as units and well before Unicodecame along there was an established approach how to manage things likekeyboard entry while encoding these as precomposed entities and not astheir building blocks.

A big part of the encoding decision is always to do what makes sense forthe writing system or notation (and the script it is based on).

For a universal encoding, such as Unicode, there simply isn't a"one-size-fits-all" solution that would work. But if you look at thisuniversal encoding only from a very narrow perspective of theorthographies that you are most familiar with, then, understandably, youmight feel that anything that isn't directly required (from your pointof view) is an unnecessary complication.

However, once you adopt a more universal perspective, it's much easierto not rat-hole on some seeming inconsistencies, because you can alwaysdiscover how certain decisions relate to the specific requirements forone or more writing systems. Importantly, this often includesrequirements based on de-facto implementations for these systems beforethe advent of Unicode. Being universal, Unicode needed to be designed toallow easy conversion from all existing data sets. And for Europeanscripts, the business community and the librarians had competingsystems, one with limited sets of precomposed characters and one withcombining marks for diacritics. The ultimate source of the duality stemsfrom there, but the two communities had different goals. One wanted toefficiently handle the common case (primarily mapping all the modernnational typewriters into character encoding) while the other wasinterested in a full representation of anything that could be present inprinted book titles (for cataloging), including unusual or historiccombinations.

In conclusion, the question isn't a bad one, but the real answer is thatcomplexity is very much part of human writing, and when you design (andextend) a universal character encoding, you will need to be able torepresent that full degree of complexity. Therefore, what seem likeobvious simplifications really aren't feasible, unless you give up onattempting to be universal.

A./

Re: Combining characters

Reply via email to