Re: Another take on the English apostrophe in Unicode

Marcel Schneider Tue, 16 Jun 2015 10:17:16 -0700

On Sat, Jun 13, 2015, Mark Davis  wrote:

> In particular, I see no need to change our recommendation on the character 
> used 
> in contractions for English and many other languages (U+2019). Similarly, we 
> wouldn't 
> recommend use of anything but the colon for marking abbreviations in Swedish, 
> or 
> propose a new MODIFIER LETTER ELLIPSIS for "supercali...docious".

> (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the 
> confusion.)

On Mon, Jun 15, 2015 at 10:19 AM, Mark Davis ☕️  wrote:

> On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider  wrote:

>> When we take the topic down again from linguistics to the core mission of 
>> Unicode, that is character encoding and text processing standardisation, 
>> ellipsis and Swedish abbreviation colon differ from the single closing 
>> quotation mark in this, that they are not to be processed.

>> [...]

> Quite nice of you to inform me of the core mission of Unicode—I must have 
> somehow missed that.

I was rather astonished and amused when I read I could have aimed at informing 
you of Unicodeʼs core. The goal was to check Iʼm at the right level. Well, 
there would have been another manner to say it... which didnʼt come at mind to 
me.

However, what surprises me even more as I think about, is while knowing all on 
Unicode, youʼve got just a weak opinion on which apostrophe recommendation is 
the right one...

> More seriously, it is not all so black and white. As we developed Unicode, we 
> considered whether to separate characters by function, eg, an END OF SENTENCE 
> PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING PERIOD, etc. Or 
> DIARASIS vs UMLAUT. We quickly concluded that the costs far, far outweighed 
> the benefits.

Itʼs another proof of Unicodeʼs professionalism as to have thought about 
distinguishing DIAERESIS and UMLAUT. Despite of being a French-German bilingual 
and knowing the diacritics, I encountered that first in Microsoftʼs kbd.h, 
where the one is called DIARESIS and is mapped to UMLAUT. Iʼm not a friend of 
such distinctions (except in vocabulary and grammatics), because in writing 
practice they would be but useless and counterproductive complications. An 
abbreviation dot would have been much more useful, but to deploy its benefits, 
it would have needed a supplemental key mapping. On this background, Unicodeʼs 
choice of recommending to disambiguate the apostrophe is even more meritorious. 
I see it as a proof that there is really a good reason that people mind at the 
difference whenever they donʼt use the ASCII apostrophe for all of them. What 
would have bugged Microsoft then, was that it could have to implement this 
difference in its word processing and desktop publishing software, and to tell 
users about. Nothing easier for Microsoft with all the Help and Info! “The new 
smart quotes help you to check whether you need an apostrophe or a quote. This 
makes quotes conversion easy.” Or the like.

> In practice, whenever characters are essentially identical—and by that I mean 
> that the overlap between the acceptable glyphs for each character is very 
> high—people will inevitably mix up the characters on entry. So any processing 
> that depends on that distinction is forced to correct the data anyway. And 
> separating them causes even simple things like searching for a character on a 
> page to get screwed up without having equivalence classes.

Based on the Unicode principle to encode characters, not glyphs, I doubt 
whether two characters may be called _essentially_ identical when they look the 
same. A huge subset of the Code Chartsʼ xrefs is to help font designers on this 
point. About people mixing up, they are most likely to do so when the keyboard 
allows only one of two. This is not the case of U+02BC and U+2019, none of 
whose is on standard keyboards. Here itʼs the smart quotes algorithm which will 
mix up! And this one is easily helped not to do so, since itʼs embedded in 
high-end software with all its display and shortcut capabilities. Eventually, 
the only one who wanted to keep mixing up was—guess who?—Microsoft.

The reason? Word processing that depends on distinction between opening and 
closing quotation marks, which needs a very tiny algorighm, is much easier to 
implement than processing that depends on distinction between apostrophe and 
simple closing quotation mark, and between apostrophe and simple quotation 
marks on the whole. Informal English word forms are so rich and varying that 
some are ambiguous and scarcely any software dictionary can contain them all. 
But even formal English is not wholly supported since nested quotes often are 
not. Why would users not be interested in improved software, even if it would 
cost a little more?

About searching and equivalence classes: There is already plenty of equivalence 
implemented in the simplest search algorighm: casing! A class more with 
(U+0027, U+02BC, U+2019) wouldnʼt change that a lot.

>So we only separated essentially identical characters in limited cases: such 
>as letters from different scripts.

I repeat myself: Calling like-looking glyphs “essentially identical characters” 
is inconsistent with Unicodeʼs encoding characters, not glyphs. But whatever, I 
repeat myself again: Under these circumstances, Unicodeʼs recommendation of 
preferring U+02BC for apostrophe weighs the heavier!

Best regards,
Marcel Schneider

Re: Another take on the English apostrophe in Unicode

Reply via email to