On Sun, 27 Jan 2019 14:09:31 -0500 James Tauber via Unicode <unicode@unicode.org> wrote:
> On Sun, Jan 27, 2019 at 1:22 PM Richard Wordingham via Unicode < > unicode@unicode.org> wrote: > > However LibreOffice treats "don't" as a single word for U+0027, > > U+02BC and U+2019, but "dogs'" as a single word only for U+02BC. > > This complies with TR27. I'm not surprised, as LibreOffice does > > use or has used ICU. > This comes back to my original question that started this thread. Yes. I'm driving home the problem for those who somehow fail to understand your opening post. > Here's a concrete example from Smyth's Grammar: > > γένοιτ’ ἄν > > Double-clicking on the first word should select the U+2019 as well. > Interestingly on macOS Mojave it does in Pages[1] but not in Notes, > the Terminal or here in Gmail on Chrome. > > To be clear: when I say "should" I mean that that is the expectation > classicists have and the failure to meet it is why some of them > insist on using U+02BC. > > I'm happy if the answer is "use U+2019 and go get your text > segmentation implementations fixed"[2] but am looking for > confirmation of that. The problem with that approach is that it assumes one can have a language-sensitive implementation, and that that will suffice. Smyth’s grammar gives the concrete example, “γένοιτ’ ἄν”. It contains the word ‘ἄν’. Should double-clicking the first Greek word in the paragraph above select it? That's not going to work if the paragraph above is considered to be in English. And what about double clicking the third Greek word? What should that select? Or is that paragraph ungrammatical? To fix the problem with possessive plural "dogs’" with U+2019 one has to parse enough of the paragraph to distinguish an apostrophe from a closing single inverted comma. Moreover, it assumes that end-of-word apostrophes will not be included in a span bounded by single inverted commas. I may observe such a rule, but I don't remember being taught it. In Unicode 2.0 the apostrophe was U+02BC; it was changed to U+2019 in Unicode 2.1. The justification I could find given for the change is in the Unicore thread (members only) starting at https://www.unicode.org/mail-arch/unicore-ml/y1997-A/0185.html . The justification recorded there was merely that: 1) Windows and Mac Latin character sets had equivalents of U+0027, to which the 'letter apostrophe' was mapped, and U+2019, which was used for single quotes. 2) The 'punctuation apostrophe' was being mapped to the U+2019 by the 'smart quote' apparatus. 3) For consistency, the 'punctuation apostrophe' should therefore be encoded by U+2019 instead of U+02BC. This argument didn't persuade everyone even then, and it feels even weaker now. Perhaps I just have the problem that I don't see a sharp difference between the letter apostrophe and the punctuation apostrophe. For example, when the pronunciation of English "letter" with a glottal stop as the intervocalic consonant is represented in writing as something like "le'er", is it a letter apostrophe because it's a glottal stop, or a punctuation apostrophe because the 'tt' is dropped? The issue arises in the orthography of Finnish. The genitive singular of _keko_ 'a pile' is _keon_ - the 'k' is 'dropped' because of consonant gradation. However, regularly, the genitive singular of _raaka_ 'raw' is _raa'an_, where the U+0027 I wrote represent an apostrophe and is pronounced as a glottal stop. Is this a letter apostrophe or a punctuation apostrophe? The 'k' has been dropped by the same rule, but because of the vowel pattern it is replaced by a glottal stop and written with an apostrophe. English Wiktionary chooses U+2019: the Finnish Wiktionary ducks the issue and uses U+0027. Richard.