Re: Another take on the English Apostrophe in Unicode
Dear Mr Ewell, as I was very puzzled reading Mr Davis' last reply yesterday, I stood away from mailing to you separately as I'd the purpose to do. For the same reason, I forgot to remove an outdated period I'd never have written after reading Mr Kolehmainen's, Mr Suignard's and Mr Constable's e-mails I found yesterday. I beg everybody's pardon. On Wen, Jun 17, I wrote: Experience proves that often a lot of mails, e-mails, blog posts, fora posts, tweets and so on are needed to get things move. The best way of getting nothing to be done is to get everybody convinced itʼs all OK. Thatʼs what I sometimes feel reading this thread, or the one about ISO/IEC JTC1/SC2/WG2 that is on-going in the meantime! And the only way to get something change has always been to show itʼs wrong. From there on, the next step would be to find out who is responsible. Please read instead: | Experience proves that often a lot of mails, e-mails, blog posts, fora posts, tweets and so on are needed to get things move. | The best way of getting nothing to be done is to get everybody convinced itʼs all OK. Thatʼs what I sometimes feel reading this thread. | And the only way to get something change has always been to show itʼs wrong. | From there on, the next step would be to find out who is responsible. Best regards, Marcel S. Message du 17/06/15 18:29 De : Marcel Schneider A : MarkDavis☕️ , DougEwell Copie à : TedClancy , UnicodeMailingList Objet : Re: Another take on the English Apostrophe in Unicode On Tue, Jun 16, Mark Davis ☕️ wrote: And, Marcel, while you are at it, this is getting tiresome. Please find some other place to vent about events you know very little about; the internet is full of them. Dear Mark, I understand (a little) that I'm tiresome. Please consider nevertheless that the Unicode Public Maliling List is AFAIK the only spot where people can communicate with Unicode decision makers. No other mailing list nor any forum on the internet can do this. Even Microsoft's Community forum can do nothing at Microsoft, forum volunteers told me. I posted there in French and in English. In French my most useful post seems to be at http://answers.microsoft.com/fr-fr/office/forum/office_2010-word/recherche-invers%C3%A9e-dans-les-listes/845a02fa-aa2d-4d81-a03e-12ecb7f2f46b Since your message could not reach me yesterday, I prepared two replies I wanted to send today. It was exactly one to Doug and one to you. If you agree, I'll paste them both hereafter. On Tue, Jun 16, 2015, Doug Ewell wrote: You know what? If you want to use U+02BC as an English apostrophe, go ahead and use it. Nobody's stopping you really. Not Unicode, not Microsoft, not ISO. You know I did, and if it were just for my ownʼs sake, Iʼd probably never started mailing in this thread. A big part of text to be processed on quotes originates from other people. So when I use U+02BC, I did a good work (if I were quoted :)). A essential condition is that all text handling software is updated to handle correctly the letter apostrophe. Without an official recommendation, this is not likely to be done. And this recommendation cannot be usefully issued unless Microsoft agrees. We remember that without Microsoft, the Unicode Consortium probably wouldnʼt have been founded, and character encoding wouldnʼt thrive as it does today. On Mon, Jun 15, 2015, 20:14, Doug Ewell wrote: Perhaps a UTC member can confirm whether this is fact or speculation. Markus Kuhn's comment from 1999 about couldn't Unicode follow Microsoft...? doesn't prove that Unicode was in fact strong-armed by Microsoft. I know that Markus Kuhnʼs concern was very valuable and he did a great job by showing how to eradicate the clumsy quotes simulation that was current by the time, due to the lack of characters. You remember, they used accents as quotes, and at that stage, the mixup was between apostrophe and acute! https://www.cl.cam.ac.uk/~mgk25/ucs/apostrophe.html The curly glyph for 0x27 in old ASCII fonts and its reversed counterpart mapped to 0x60 Mr Kuhn shows on this page and how to replace them properly, remind the U+201B—U+2019 quotes pair where the deprecated REVERSED SINGLE COMMA QUOTATION MARK was discussed on this List, the conclusion being: On Thu, Jun 15, 2006, Andreas Prilop wrote: http://www.unicode.org/mail-arch/unicode-ml/y2006-m06/0265.html Actually, I have seen such quotation marks in English-language books printed in Britain and the USA. But, as I wrote, they are certainly not preferred. *If* you want such quotation marks, then please use U+201B for them! At that time, the matter was correct rendering. Today, itʼs correct processing. Yes, fortunately U+02BC is *not deprecated* for English apostrophe, and looking closer, IMO there is *no recommendation* for U+2019 neither, just a stated preference. As I wrote sooner in this thread, Unicode logically and seemingly changed the preference against
Re: Another take on the English apostrophe in Unicode
On Mon, Jun 16, 2015, Richard Wordingham wrote: I don't know if you have the wrong link for MSKLC, but that link claims it is only 'supported' up to Vista. That's not much of an invitation! I do know that MSKLC works on Windows 7, and its output there is appropriate for Windows 7, generate multiple versions of the DLL and its installer. I'm sorry, I didn't think about the issue. The download link is not wrong, AFAIK it's the only available download page for the (most recent) 1.4 version. And this version works for Windows 8, too [and, I hope, for the coming Windows 10], this thread on Microsoft Community shows: http://answers.microsoft.com/en-us/windows/forum/windows_8-winapps/msklc-microsoft-keyboard-layout-creator-for/a54a4db0-94c0-4f08-8909-37a7c5b758bb Marcel
Re: Another take on the English Apostrophe in Unicode
On Tue, Jun 16, Mark Davis ☕️ wrote: And, Marcel, while you are at it, this is getting tiresome. Please find some other place to vent about events you know very little about; the internet is full of them. Dear Mark, I understand (a little) that I'm tiresome. Please consider nevertheless that the Unicode Public Maliling List is AFAIK the only spot where people can communicate with Unicode decision makers. No other mailing list nor any forum on the internet can do this. Even Microsoft's Community forum can do nothing at Microsoft, forum volunteers told me. I posted there in French and in English. In French my most useful post seems to be at http://answers.microsoft.com/fr-fr/office/forum/office_2010-word/recherche-invers%C3%A9e-dans-les-listes/845a02fa-aa2d-4d81-a03e-12ecb7f2f46b Since your message could not reach me yesterday, I prepared two replies I wanted to send today. It was exactly one to Doug and one to you. If you agree, I'll paste them both hereafter. On Tue, Jun 16, 2015, Doug Ewell wrote: You know what? If you want to use U+02BC as an English apostrophe, go ahead and use it. Nobody's stopping you really. Not Unicode, not Microsoft, not ISO. You know I did, and if it were just for my ownʼs sake, Iʼd probably never started mailing in this thread. A big part of text to be processed on quotes originates from other people. So when I use U+02BC, I did a good work (if I were quoted :)). A essential condition is that all text handling software is updated to handle correctly the letter apostrophe. Without an official recommendation, this is not likely to be done. And this recommendation cannot be usefully issued unless Microsoft agrees. We remember that without Microsoft, the Unicode Consortium probably wouldnʼt have been founded, and character encoding wouldnʼt thrive as it does today. On Mon, Jun 15, 2015, 20:14, Doug Ewell wrote: Perhaps a UTC member can confirm whether this is fact or speculation. Markus Kuhn's comment from 1999 about couldn't Unicode follow Microsoft...? doesn't prove that Unicode was in fact strong-armed by Microsoft. I know that Markus Kuhnʼs concern was very valuable and he did a great job by showing how to eradicate the clumsy quotes simulation that was current by the time, due to the lack of characters. You remember, they used accents as quotes, and at that stage, the mixup was between apostrophe and acute! https://www.cl.cam.ac.uk/~mgk25/ucs/apostrophe.html The curly glyph for 0x27 in old ASCII fonts and its reversed counterpart mapped to 0x60 Mr Kuhn shows on this page and how to replace them properly, remind the U+201B—U+2019 quotes pair where the deprecated REVERSED SINGLE COMMA QUOTATION MARK was discussed on this List, the conclusion being: On Thu, Jun 15, 2006, Andreas Prilop wrote: http://www.unicode.org/mail-arch/unicode-ml/y2006-m06/0265.html Actually, I have seen such quotation marks in English-language books printed in Britain and the USA. But, as I wrote, they are certainly not preferred. *If* you want such quotation marks, then please use U+201B for them! At that time, the matter was correct rendering. Today, itʼs correct processing. Yes, fortunately U+02BC is *not deprecated* for English apostrophe, and looking closer, IMO there is *no recommendation* for U+2019 neither, just a stated preference. As I wrote sooner in this thread, Unicode logically and seemingly changed the preference against its will. Logically, because the first recommendation (like the whole Standard) was consciously designed, Mr Davis recalled us the day before yesterday. Seemingly, because the U+0027 comment line in the Code Chart has been changed from preferred character for apostrophe is 2019 to 2019 is preferred for apostrophe between the 3.0.0 and 4.0.0 versions (while the line “preferred characters in English for paired quotation marks are 2018 2019” remained unchanged; see the complete comparison at http://charupdate.info#ambiguation). On Tue, Jun 16, 2015, Doug Ewell wrote: I do wish we could put an end to all the accusations of malfeasance. Experience proves that often a lot of mails, e-mails, blog posts, fora posts, tweets and so on are needed to get things move. The best way of getting nothing to be done is to get everybody convinced itʼs all OK. Thatʼs what I sometimes feel reading this thread, or the one about ISO/IEC JTC1/SC2/WG2 that is on-going in the meantime! And the only way to get something change has always been to show itʼs wrong. From there on, the next step would be to find out who is responsible. About the apostrophe, weʼre all a bit responsible. Why to hide that British English usage does not much to disambiguate things, by preferring single quotes as current quotation marks, leading some authors to end up preferring chevrons even in English—see Chris Harvey (pleading for U+2019 as apostrophe) at http://www.languagegeek.com/typography/apostrophes.html#Anchor-Potentia-61409 But
Re: Another take on the English Apostrophe in Unicode
On Tue, Jun 16, 2015, Philippe Verdy wrote: When ISO 8859-1 was designed (in fact in an early version by Digital for its own version of Unix), allowing a bijective compatibility with 8-bit EBCDIC and its C1 controls was still a priority. Microsoft abandoned its own develomment of Unix to develop DOS and extend it with Windows in parallel of its work with IBM that had wanted DOS to be a very lightweight version of CP/M, but without a scheduler in order to run softwares on personal computers that could be used in small organisations that could not buy its mainframes, but had to prepare documents and data that could be reused on IBM mainframes... Thank you Philippe for the information. It was a very good idea to build a system without need of C1 and to remap the two ranges to completing characters, which are indispensable, notably in French, and to start with the single quotes. Marcel Message du 16/06/15 21:08 De : Philippe Verdy A : Marcel Schneider Copie à : Doug Ewell , Unicode Mailing List Objet : Re: Another take on the English Apostrophe in Unicode 2015-06-16 19:02 GMT+02:00 Marcel Schneider : On Mon, Jun 15, 2015, 17:12, Doug Ewell wrote: Marcel Schneider wrote: [...] Microsoft’s choice of mashing up apostrophe and close-quote to end up with an unprocessable hybrid was wrong. Very wrong. Windows-1252 and the other Windows code pages were developed during the 1980s, before Unicode, when almost all non-Asian character sets were limited to 256 code points. The distinctions between apostrophe and right-single-quote, weighed against the confusion caused by encoding two identical-looking characters, would never have been sufficient back then to justify separate encoding in this limited space. I replied: The problem is not about code pages [...] I thank you for your answers and I'll come back upon some of them below. There's some new fact to bring first. I concede that my last reply yesterday in the evening was incorrect. Additionally to Microsoftʼs action in the late nineties urging Unicode to give up its useful apostrophe recommendation (U+02BC), the design of code page Windows-1252 is in my scope, indeed. Since I learned there are very good and outweighing reasons to use U+02BC in English, and that Unicodeʼs respective recommendation has been withdrawn with respect to a widespread practice founded on CP Windows-1252, I soon suspected there would have been means to get the apostrophe into this code page. Here I need to recall that I always liked Windows-1252 for its completing the ISO 8859-1 charset (which was so useless* it had to be replaced with ISO 8859-15). * Please read this paper (in French): http://cahiers.gutenberg.eu.org/cg-bin/article/CG_1996___25_65_0.pdf Now that I examined closely CP1252ʼs layout, I found five empty code points, five code points left out, in the C1 ranges that Microsoft allocated to complete ISO 8859−1. Further, in this range, I found two MODIFIER LETTERS, CIRCUMFLEX ACCENT (136, 0x88, later U+02C6) and SMALL TILDE (152, 0x98, U+02DC). Obviously these two were added to disambiguate the extensively used spacing characters ^ (94, 0x5E) and ~ (126, 0x7E) on one side, and the diacritics on the other side. There is to say that when Windows was first released, the left and right single quotes were the only printable characters in these two ranges. All other characters plus × and ÷ came later. However, CP1252 remained stable since Windows 98, for which € and the žŽ pair were added. And five places were left empty. From this on I got convinced that it would have been very easy to place the letter apostrophe for example at code point 144 (0x90), near the single turned comma quotation mark 0x91 and the single comma quotation mark (right-single-quote) 0x92 which Microsoft recommended for use as apostrophe. About the “confusion” everybody refers to, there is to say that the only way to get people confused, is to do things and not to explain anything to anybody. The core problem would have been that code pages were designed with glyph-based *character* encoding in mind, not semantics-based *text* encoding. I repeat that others had done even worse. Others, that is some of the so-called expert members of the ISO WG designing 8859-1, as two of them not even aimed at encoding all needed characters, by refusing deliberately to encode the lower- and uppercase Œ digraph, and even the uppercase Ÿ. Microsoftʼs big merit has been to produce a ready remedy to this bungling, that as far as belongs to the OE digraph, was meant to match defective peripherics. Unfortunately, Microsoft visibly didnʼt finish this job, by aiming at encoding characters only, and thus not allocating more than one code point to that squiggle, whilst several places were left. Well, all that are errors of the past. If I donʼt see a need, I wonʼt
Re: Another take on the English Apostrophe in Unicode
On Mon, Jun 15, 2015, 17:12, Doug Ewell wrote: Marcel Schneider wrote: [...] Microsoft’s choice of mashing up apostrophe and close-quote to end up with an unprocessable hybrid was wrong. Very wrong. Windows-1252 and the other Windows code pages were developed during the 1980s, before Unicode, when almost all non-Asian character sets were limited to 256 code points. The distinctions between apostrophe and right-single-quote, weighed against the confusion caused by encoding two identical-looking characters, would never have been sufficient back then to justify separate encoding in this limited space. I replied: The problem is not about code pages [...] I thank you for your answers and I'll come back upon some of them below. There's some new fact to bring first. I concede that my last reply yesterday in the evening was incorrect. Additionally to Microsoftʼs action in the late nineties urging Unicode to give up its useful apostrophe recommendation (U+02BC), the design of code page Windows-1252 is in my scope, indeed. Since I learned there are very good and outweighing reasons to use U+02BC in English, and that Unicodeʼs respective recommendation has been withdrawn with respect to a widespread practice founded on CP Windows-1252, I soon suspected there would have been means to get the apostrophe into this code page. Here I need to recall that I always liked Windows-1252 for its completing the ISO 8859-1 charset (which was so useless* it had to be replaced with ISO 8859-15). * Please read this paper (in French): http://cahiers.gutenberg.eu.org/cg-bin/article/CG_1996___25_65_0.pdf Now that I examined closely CP1252ʼs layout, I found five empty code points, five code points left out, in the C1 ranges that Microsoft allocated to complete ISO 8859−1. Further, in this range, I found two MODIFIER LETTERS, CIRCUMFLEX ACCENT (136, 0x88, later U+02C6) and SMALL TILDE (152, 0x98, U+02DC). Obviously these two were added to disambiguate the extensively used spacing characters ^ (94, 0x5E) and ~ (126, 0x7E) on one side, and the diacritics on the other side. There is to say that when Windows was first released, the left and right single quotes were the only printable characters in these two ranges. All other characters plus × and ÷ came later. However, CP1252 remained stable since Windows 98, for which € and the žŽ pair were added. And five places were left empty. From this on I got convinced that it would have been very easy to place the letter apostrophe for example at code point 144 (0x90), near the single turned comma quotation mark 0x91 and the single comma quotation mark (right-single-quote) 0x92 which Microsoft recommended for use as apostrophe. About the “confusion” everybody refers to, there is to say that the only way to get people confused, is to do things and not to explain anything to anybody. The core problem would have been that code pages were designed with glyph-based *character* encoding in mind, not semantics-based *text* encoding. I repeat that others had done even worse. Others, that is some of the so-called expert members of the ISO WG designing 8859-1, as two of them not even aimed at encoding all needed characters, by refusing deliberately to encode the lower- and uppercase Œ digraph, and even the uppercase Ÿ. Microsoftʼs big merit has been to produce a ready remedy to this bungling, that as far as belongs to the OE digraph, was meant to match defective peripherics. Unfortunately, Microsoft visibly didnʼt finish this job, by aiming at encoding characters only, and thus not allocating more than one code point to that squiggle, whilst several places were left. Well, all that are errors of the past. If I donʼt see a need, I wonʼt meet it. By leaving œ and Œ off the charset, they got × and ÷ in, at least. Where things ran really bad, was when Unicode was on, and code pages Procrustesʼ beds were out. At least, they should have been. Whence that survival of CP1252-based confusion? Briefly, todayʼs text processing is suffering from the apostrophe-close-quote confusion. This confusion is firstly out of date, and secondly it was unnecessary from the beginning on. Avoiding this confusion at a trivial level (by not getting users confused to have to use two similar squiggles), is shifting it at process level, where the damage it causes is far bigger. Trust me, users who find themselves unable to set apart the apostrophes when theyʼre going to replace single quotes, wonʼt bless Microsoft for the input simplicity! Ted Clancyʼs blog post is here to prove. https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/ It was time to get rid of that confusion when Unicode recommended U+02BC for apostrophe. Microsoftʼs choice not to comply was wrong again. Very wrong. Let's come back to some of your replies. On Mon, Jun 15, 2015, 20:14, Doug Ewell
RE: Another take on the English Apostrophe in Unicode
Marcel Schneider charupdate at orange dot fr wrote: That's to despise people, that's to spit at their face. You know what? If you want to use U+02BC as an English apostrophe, go ahead and use it. Nobody's stopping you really. Not Unicode, not Microsoft, not ISO. I do wish we could put an end to all the accusations of malfeasance. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Another take on the English Apostrophe in Unicode
On Mon, Jun 15, 2015, Doug Ewell wrote: Marcel Schneider wrote: A free tool, the Microsoft Keyboard Layout Creator, allows every user to add U+02BC on his preferred keyboard layout I use John Cowan's Moby Latin keyboard, built with MSKLC, which is 100% compatible with the AltGr-less US keyboard and supports almost 900 other characters, including all of the apostrophes and quotes and dashes and other characters under discussion: http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html I spent years designing and updating my own keyboard layout and studying other layouts. I've ended this quest since I started using Moby Latin; it's the best I've seen in numerous ways. Yesterday late in the evening, I looked up John Cowans keyboard layouts. They are the best MSKLC based keyboard layouts Iʼve ever seen. They are memonic. I note that it naturally uses AltGr (right-hand Alt or Alt+Ctrl). In my last yesterdayʼs reply I reminded a multilingual layout from a research institute which really does not use more than two shift states. Itʼs not free. Mr Cowan writes about some allocations being temporary until a new MSKLC version for chained dead keys is released. This MSKLC 2,0 is still not born and I fear it will never. IMO this is the result of the disinterest of many people. You and others probably represent exceptions. This goes so far that MSKLC is declared “appears very rarely” in the Acronym Finder. Normally the release and update of MSKLC should have created a buzz on social media, and today nobody would complain about missing characters. Well, I too complained one year long without knowing about MSKLC. Today, one year ago, I installed my copy of the MSKLC. Later I tried to define a universal Latin layout too, but when I was at 1,921 Unicode characters, I never could remind it. I gave up this way, itʼs hard to get on one keyboard, among other Unicode characters, all 1,736 of 8.0.0 used in Latin script (if my subset is right). Do you know Ilya Zakharewichʼs approach? http://search.cpan.org/~ilyaz/UI-KeyboardLayout-0.64/lib/UI/KeyboardLayout.pm Best regards, Marcel Schneider
Re: Another take on the English Apostrophe in Unicode
And, Marcel, while you are at it, this is getting tiresome. Please find some other place to vent about events you know very little about; the internet is full of them. Mark Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Tue, Jun 16, 2015 at 7:33 PM, Doug Ewell d...@ewellic.org wrote: Marcel Schneider charupdate at orange dot fr wrote: That's to despise people, that's to spit at their face. You know what? If you want to use U+02BC as an English apostrophe, go ahead and use it. Nobody's stopping you really. Not Unicode, not Microsoft, not ISO. I do wish we could put an end to all the accusations of malfeasance. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Another take on the English apostrophe in Unicode
On Sat, Jun 13, 2015, Mark Davis wrote: In particular, I see no need to change our recommendation on the character used in contractions for English and many other languages (U+2019). Similarly, we wouldn't recommend use of anything but the colon for marking abbreviations in Swedish, or propose a new MODIFIER LETTER ELLIPSIS for supercali...docious. (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the confusion.) On Mon, Jun 15, 2015 at 10:19 AM, Mark Davis ☕️ wrote: On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider wrote: When we take the topic down again from linguistics to the core mission of Unicode, that is character encoding and text processing standardisation, ellipsis and Swedish abbreviation colon differ from the single closing quotation mark in this, that they are not to be processed. [...] Quite nice of you to inform me of the core mission of Unicode—I must have somehow missed that. I was rather astonished and amused when I read I could have aimed at informing you of Unicodeʼs core. The goal was to check Iʼm at the right level. Well, there would have been another manner to say it... which didnʼt come at mind to me. However, what surprises me even more as I think about, is while knowing all on Unicode, youʼve got just a weak opinion on which apostrophe recommendation is the right one... More seriously, it is not all so black and white. As we developed Unicode, we considered whether to separate characters by function, eg, an END OF SENTENCE PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING PERIOD, etc. Or DIARASIS vs UMLAUT. We quickly concluded that the costs far, far outweighed the benefits. Itʼs another proof of Unicodeʼs professionalism as to have thought about distinguishing DIAERESIS and UMLAUT. Despite of being a French-German bilingual and knowing the diacritics, I encountered that first in Microsoftʼs kbd.h, where the one is called DIARESIS and is mapped to UMLAUT. Iʼm not a friend of such distinctions (except in vocabulary and grammatics), because in writing practice they would be but useless and counterproductive complications. An abbreviation dot would have been much more useful, but to deploy its benefits, it would have needed a supplemental key mapping. On this background, Unicodeʼs choice of recommending to disambiguate the apostrophe is even more meritorious. I see it as a proof that there is really a good reason that people mind at the difference whenever they donʼt use the ASCII apostrophe for all of them. What would have bugged Microsoft then, was that it could have to implement this difference in its word processing and desktop publishing software, and to tell users about. Nothing easier for Microsoft with all the Help and Info! “The new smart quotes help you to check whether you need an apostrophe or a quote. This makes quotes conversion easy.” Or the like. In practice, whenever characters are essentially identical—and by that I mean that the overlap between the acceptable glyphs for each character is very high—people will inevitably mix up the characters on entry. So any processing that depends on that distinction is forced to correct the data anyway. And separating them causes even simple things like searching for a character on a page to get screwed up without having equivalence classes. Based on the Unicode principle to encode characters, not glyphs, I doubt whether two characters may be called _essentially_ identical when they look the same. A huge subset of the Code Chartsʼ xrefs is to help font designers on this point. About people mixing up, they are most likely to do so when the keyboard allows only one of two. This is not the case of U+02BC and U+2019, none of whose is on standard keyboards. Here itʼs the smart quotes algorithm which will mix up! And this one is easily helped not to do so, since itʼs embedded in high-end software with all its display and shortcut capabilities. Eventually, the only one who wanted to keep mixing up was—guess who?—Microsoft. The reason? Word processing that depends on distinction between opening and closing quotation marks, which needs a very tiny algorighm, is much easier to implement than processing that depends on distinction between apostrophe and simple closing quotation mark, and between apostrophe and simple quotation marks on the whole. Informal English word forms are so rich and varying that some are ambiguous and scarcely any software dictionary can contain them all. But even formal English is not wholly supported since nested quotes often are not. Why would users not be interested in improved software, even if it would cost a little more? About searching and equivalence classes: There is already plenty of equivalence implemented in the simplest search algorighm: casing! A class more with (U+0027, U+02BC, U+2019) wouldnʼt change that a lot. So we only separated essentially identical
Re: Another take on the English Apostrophe in Unicode
On Mon, Jun 15, Philippe Verdy wrote: But I think that keyboard should all have a dedicated Kana key to easily map additional groups without sacrificing other shift keys on the last row: keyboards really don't need two windows keys and so the space bar can remain with a cumfortable width [...]. IMHO the space bar should not exceed five keys in width. If a Kana key or present, in fact it should be to the right of the right control, or ro the right of the right Shift The best is always that the asymetric modifiers be actioned with the thumbs. If I had to choose between AltGr and Kana, I would prefer the latter because it does not interfere with Ctrl+Alt and does not disable dead keys on Word. But alternately we could map the MODIFIER LETTER APOSTROPHE on the right-hand Alt key for a fluid input of high-quality text files. [...] Keyboards on notebooks are extremely poorly designed, a complete nonsense. Yes there are many models from big manufacturers whose key dispatch I donʼt like. By contrast, my computer is a netbook, where nevertheless I find all keys I need, in an ergonomical array. Iʼm not bound, and Iʼm not paid to make adʼ. Itʼs just an advice. The manufacturer my netbook is from, shipped the same model for the United States *with* an Applications key, *with* a Pause key, *with* a second Function modifier key to the right, with up and down keys of the *same size* as left and right, and *with* an overlaid numpad: When you disable the numpad specials on a customised layout, you just press Fn while entering digits (or press the toggle before and after), the same as on Macbooks I read and heard. Itʼs Asus. Best regards, Marcel Schneider
Re: Another take on the English Apostrophe in Unicode
When ISO 8859-1 was designed (in fact in an early version by Digital for its own version of Unix), allowing a bijective compatibility with 8-bit EBCDIC and its C1 controls was still a priority. Microsoft abandoned its own develomment of Unix to develop DOS and extend it with Windows in parallel of its work with IBM that had wanted DOS to be a very lightweight version of CP/M, but without a scheduler in order to run softwares on personal computers that could be used in small organisations that could not buy its mainframes, but had to prepare documents and data that could be reused on IBM mainframes... 2015-06-16 19:02 GMT+02:00 Marcel Schneider charupd...@orange.fr: On Mon, Jun 15, 2015, 17:12, Doug Ewell d...@ewellic.org wrote: Marcel Schneider wrote: [...] Microsoft’s choice of mashing up apostrophe and close-quote to end up with an unprocessable hybrid was wrong. Very wrong. Windows-1252 and the other Windows code pages were developed during the 1980s, before Unicode, when almost all non-Asian character sets were limited to 256 code points. The distinctions between apostrophe and right-single-quote, weighed against the confusion caused by encoding two identical-looking characters, would never have been sufficient back then to justify separate encoding in this limited space. I replied: The problem is not about code pages [...] I thank you for your answers and I'll come back upon some of them below. There's some new fact to bring first. I concede that my last reply yesterday in the evening was incorrect. Additionally to Microsoftʼs action in the late nineties urging Unicode to give up its useful apostrophe recommendation (U+02BC), the design of code page Windows-1252 is in my scope, indeed. Since I learned there are very good and outweighing reasons to use U+02BC in English, and that Unicodeʼs respective recommendation has been withdrawn with respect to a widespread practice founded on CP Windows-1252, I soon suspected there would have been means to get the apostrophe into this code page. Here I need to recall that I always liked Windows-1252 for its completing the ISO 8859-1 charset (which was so useless* it had to be replaced with ISO 8859-15). * Please read this paper (in French): http://cahiers.gutenberg.eu.org/cg-bin/article/CG_1996___25_65_0.pdf Now that I examined closely CP1252ʼs layout, I found five empty code points, five code points left out, in the C1 ranges that Microsoft allocated to complete ISO 8859−1. Further, in this range, I found two MODIFIER LETTERS, CIRCUMFLEX ACCENT (136, 0x88, later U+02C6) and SMALL TILDE (152, 0x98, U+02DC). Obviously these two were added to disambiguate the extensively used spacing characters ^ (94, 0x5E) and ~ (126, 0x7E) on one side, and the diacritics on the other side. There is to say that when Windows was first released, the left and right single quotes were the only printable characters in these two ranges. All other characters plus × and ÷ came later. However, CP1252 remained stable since Windows 98, for which € and the žŽ pair were added. And five places were left empty. From this on I got convinced that it would have been very easy to place the letter apostrophe for example at code point 144 (0x90), near the single turned comma quotation mark 0x91 and the single comma quotation mark (right-single-quote) 0x92 which Microsoft recommended for use as apostrophe. About the “confusion” everybody refers to, there is to say that the only way to get people confused, is to do things and not to explain anything to anybody. The core problem would have been that code pages were designed with glyph-based *character* encoding in mind, not semantics-based *text* encoding. I repeat that others had done even worse. Others, that is some of the so-called expert members of the ISO WG designing 8859-1, as two of them not even aimed at encoding all needed characters, by refusing deliberately to encode the lower- and uppercase Œ digraph, and even the uppercase Ÿ. Microsoftʼs big merit has been to produce a ready remedy to this bungling, that as far as belongs to the OE digraph, was meant to match defective peripherics. Unfortunately, Microsoft visibly didnʼt finish this job, by aiming at encoding characters only, and thus not allocating more than one code point to that squiggle, whilst several places were left. Well, all that are errors of the past. If I donʼt see a need, I wonʼt meet it. By leaving œ and Œ off the charset, they got × and ÷ in, at least. Where things ran really bad, was when Unicode was on, and code pages Procrustesʼ beds were out. At least, they should have been. Whence that survival of CP1252-based confusion? Briefly, todayʼs text processing is suffering from the apostrophe-close-quote confusion. This confusion is firstly out of date, and secondly it was unnecessary from the beginning on. Avoiding this confusion at a trivial level (by not getting users confused
Re: Another take on the English apostrophe in Unicode
On Mon, 15 Jun 2015 08:40:57 +0200 (CEST) Marcel Schneider charupd...@orange.fr wrote: ...while in the meantime, in obliging anticipation, the worldʼs biggest software company stays inviting us to feel free to customise our keyboard with a free tool for free download at http://www.microsoft.com/en-us/download/details.aspx?id=22339 I don't know if you have the wrong link for MSKLC, but that link claims it is only 'supported' up to Vista. That's not much of an invitation! I do know that MSKLC works on Windows 7, and its output there is appropriate for Windows 7, generate multiple versions of the DLL and its installer. Richard.
Re: Another take on the English apostrophe in Unicode
On Mon, Jun 15, 2015 at 10:19 AM, Mark Davis ☕️ wrote: On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider wrote: When we take the topic down again from linguistics to the core mission of Unicode, that is character encoding and text processing standardisation, ellipsis and Swedish abbreviation colon differ from the single closing quotation mark in this, that they are not to be processed. Linguistics, however, delivered the foundation on which Unicode issued its first recommendation on what character to use for apostrophe. The result was neither a matter of opinion, nor of probabilities. Actually, the choice is between perpetuating confusion in word processing, and get people confused for a little time when announcing that U+2019 for apostrophe was a mistake. Quite nice of you to inform me of the core mission of Unicode—I must have somehow missed that. More seriously, it is not all so black and white. As we developed Unicode, we considered whether to separate characters by function, eg, an END OF SENTENCE PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING PERIOD, etc. Or DIARASIS vs UMLAUT. We quickly concluded that the costs far, far outweighed the benefits. In practice, whenever characters are essentially identical—and by that I mean that the overlap between the acceptable glyphs for each character is very high—people will inevitably mix up the characters on entry. So any processing that depends on that distinction is forced to correct the data anyway. And separating them causes even simple things like searching for a character on a page to get screwed up without having equivalence classes. So we only separated essentially identical characters in limited cases: such as letters from different scripts. It was a very good idea to disambiguate also apostrophe and single quote, and I feel it's not paid too much because it simplified greatly the processing of quotation marks in English. I mean, the replacement of each pair of one kind by a pair of another kind. When I search for quotes in a text, I don't want to be distracted by apostrophes. Don't worry about equivalence classes, they already present to us a word without apostrophe as equivalent to the same letters with an apostrophe/quote between. It's every time better the computer knows what a character is exactly, even when at output it doesn't need to let us know, than that it comes up with a useless mixup. You just brought up another good idea too: Period-terminated abbreviations are listed as exceptions in word processors. Another list could contain all words with leading apostrophe and all words with trailing apostrophe. This might allow to filter search results and to separate definitely apostrophes and single comma quotation marks. And at input, the smart quotes algorithms will become even smarter. Say, really smart. I don't believe working people would mix up letter apostrophe and close-quote if they were on keyboard. And even now that they aren't, people don't, because people just hit the apostrophe key, which without any dumb smart quotes algorithm leads always to visually satisfying results, as shown in the Unicode documentation. For good desktop publishing, people must work hard anyway, so it would be nice to give them the means, and not to overburden them with routine tasks due to deficient text encoding. The way things are working today is not satisfying concerning the English apostrophe. I still can't believe that the Unicode Committees were wrong when recommending U+02BC. Restoring this advantage today, will be at the honor of all involved parties, and we and future generations will thank you very much. If they'll exist. Best regards, Marcel Schneider
Re: Another take on the English apostrophe in Unicode
On Tue Mar 26 2002 - 10:01:43 EST, Mark Davis ☕️ wrote: http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/0598.html Apostrophe, hyphen, and various other puncutation by default continue a word, but this behavior may be overriden on a per-language basis. Heuristics or more sophisticated engines may be needed when the apostrophe is at the end of a word, as in “the peoples' choice”, since it is ambiguous. The modifier letter apostrophe, on the other hand, is always treated as a letter. [I replaced '' '' with '“' '”' to prevent confusion with a tag by the user agent.] On Tue Mar 26 2002 - 11:44:28 EST, Marco Cimarosti wrote: http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/0604.html Mark Davis wrote: Apostrophe, hyphen, and various other puncutation by default continue a word, but this behavior may be overriden on a per-language basis. This may work for things such as finding word boundaries, but not for identifiers. According to the ID_Start and ID_Continue properties in , neither U+0027 (APOSTROPHE) nor U+2019 (RIGHT SINGLE QUOTATION MARK) are allowed in an identifier. And this is not surprising, since they are primarily quotation marks. On the other hand, U+02BC (MODIFIER LETTER APOSTROPHE) is allowed in any position within an identifier. Using U+02BC as the apostrophe, would allow to use words such as: , or 'em in identifiers. But this hits against the fact that Unicode's own suggestion is to use U+2019 for the apostrophe. On Tue Mar 26 2002 - 12:08:41 EST , Marco Cimarosti wrote: http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/0608.html But, as you say, the apostrophe is legitimate and sometimes mandatory in the orthography of English and many other languages. So, it seems to me that its preferred encoding should make it possible to use it in identifiers, filenames, URI(')s, and so on. Don't we fall back into the times of all-0x27 and stay in front of on-going confusion when English apostrophe is ambiguated with closing-quote? As you told us, having both U+02BC and U+2019 in use will need some supplemental algorithms. But as you told in 2002, this is true when both are confused in only one character, too. I suspect that the cost of using MODIFIER LETTER APOSTROPHE for English apostrophe (and as apostrophe on the whole) today would mainly be the cost of updating implementations and text files. If this cost is too high, we would have to consider that text has not to be quoted nor to be converted between British and US English. I hope people will stay communicating and exchanging. Marcel Schneider
Re: Another take on the English apostrophe in Unicode
By the way, about smart quotes. I am using that for long time. My keyboard layout generates two characters on one key-press (so I have to enter [«»][←]{sth}[→] instead of [«]{sth}[»]). It's not that good, but I'm not afraid neither to lose quotation marks or parentheses nor become a victim of artificial intelligence :) About what is one word. Do you know the German prefixes? ... ... macht ... ... ... ... ... ... auf. Let me ask if double-quotes are parts of word or not? For example, in this sentence not is a noun, not particle? Was Titanic titanic?
Re: Another take on the English apostrophe in Unicode
2015-06-15 15:20 GMT+02:00 QSJN 4 UKR qsjn4...@gmail.com: By the way, about smart quotes. I am using that for long time. My keyboard layout generates two characters on one key-press (so I have to enter [«»][←]{sth}[→] instead of [«]{sth}[»]). It's not that good, You could generate three keystrokes [«][»][←] from a single keypress to get the same effect. Various editors already do that when you press the first key for the opening quote, and all you have to type then is the [→] key (instead of the key for a closing quote) after typing the word. Such system is used in many IDE or text editors for programmers when they enter the opening parenthese, or square bracket, or single/double quotes, or braces, or block comment prefixes, or any paired symbols or keywords used in the programming language (e.g. begin | end in Pascal, #if |\n#endif in C/C++ preprocessor directives : the pipe here notes the position of the cursor after typing what is just before it, what is after the pipe is inserted after the cursor position). If you disagree with those automatic insertions after the cursor, you can immediately press CTRL+Z to cancel this added suffix but keep what you just entered. another CTRL+Z will undo your previous keypress(es) for the character(s) just before the cursor position. Some editors are even smarter before the cursor position is not just a single position but a selected range and as long as you continue typing just before this range, the selection is preserved, and when you press [→] it will skip over this whole selection and you an also press then the backspace key to delete that autoinserted selected range. If you move your cursor elsewhere, the selection is unselected and you get back to the normal insertion cursor with an empty selection. Such system is used for example in Notepad++ (for Windows), or Eclipse (you can disable this automatic insertion in your preferences). This editor feature does not depend on the character layout but depends on the selected language for matching pairs: it does not have to be limited to programming languages and can be used as well for natural human languages, including in advanced word processors. It can also be used to insert automatically some additional space when you just press an initial quote: entering only [«] when editing French text, what you would get is [«][NNBSP]|[NNBSP][»] (with the cursor selection over the last two characters). These editors normally have a way to edit their automatic insertion rules (with the text to match before, the text to add jut after it, the new cursor position, and the text to insert just after it (and to hopefully preselect in such a way that when continuing entering text without moving the insertion position, it is not overwritten but just preseves this selected text). Such rules can be part of the parameters for the spell checker.
Re: Another take on the English Apostrophe in Unicode
2015-06-15 16:49 GMT+02:00 Marcel Schneider charupd...@orange.fr: It's indeed very useful to keep two Control modifiers. Because the modifiers at the left and right border of the block are acted with the little finger and should thus be symetrical. This does not apply to the Alt keys and other keys more or less centered around the space bar, which are acted with the thumbs. As Alt is less used than Kana (when there is a Kana key), Kana should be on left Alt, symetrical to the (on many keyboards already implemented) AltGr key. The Alt key comes then on the Applications key, which is mnemonic because of the contextual menu icon. Internally, indeed, the Alt keys (left and right) are called Menu keys (Virtual key Left Menu or VK_LMENU, and VK_RMENU). This contextual menu is then invoked pressing the right Windows key, which is consistently missing on laptops. Not just laptops. My desktop PC only has a single Windows key, on the left. Anyway there's little use of the Windows key that was introduced lately (and there are still lot of keyboards that don't have this key). The same remark applies to the ScrollLock key (which is now frequently remapped to Fn+Pause/SysAttn or other similar combination using the single Windows key when there's no Fn key which is typical of notebooks). However I disagree with your opinion about AltGr+Shift combinations: it works perfectly including with the ISO 9995 definitions: the unshifted and shifted position are in the same group. However ISO 9995 allows CapsLock to be used to create other groups instead of just reproducing the shifted/unshifted layout. It can be very useful for users in India to switch between Latin and local abugidas. It could be used as well by users writing in Arabic and Hebrew abjads, or with African (Ethiopic) or North-American syllabary scripts that are complex to map on a usable keyboard. But I think that keyboard should all have a dedicated Kana key to easily map additional groups without sacrificing other shift keys on the last row: keyboards really don't need two windows keys and so the space bar can remain with a cumfortable width (as well for the Shift key or Backspace which is too narrow on many keyboards). On the last row therre should never be more than 7 keys on both sides of the space bar, and the most external keys (Ctrl) have to remain wide). If a Kana key or present, in fact it should be to the right of the right control, or ro the right of the right Shift AltGr needs to keep some width extension compared to letter keys, and in fact could be larger than the left Alt, because it is used for entering text. The Application key is too large for me, just like the left Windows key (its extra width should be better given to the left Control key to make it a bit more central). Those that design keyboard almost never test them for real usability: they prefer slling them with many packed multimedia functions (or buttons for Calc, Mail, Web or swtiching windows, and that are rarely used). Only keyboards for gamers have some attention, but only to give them additional programmable function keys for specific games... Keyboards on notebooks are extremely poorly designed, a complete nonsense.
Re: Another take on the English Apostrophe in Unicode
On Fri, Jun 12, 2015, Philippe Verdy wrote: These are application shortcuts, but these modifier keys combinations are used with base function keys (F1...F12), not with keys on the alphanumeric parts of the keyboard. So there's no conflict. Thank you for your advice. It'll be very useful. I was not precise enough, the upper row of the alphanumerical block is used with Ctrl, Shift+Ctrl, Shift+Alt by the language bar but optionally only. It is normal then to not assign CTRL+keys or CONTROL+shift+keys (independantly of the capslock state) with non-control characters if the same keys are used to type non-control ASCII characters in range U+0040..U+005F. This means that 32 positions on the keyboard must not be used for any assignment. The same remark applies to ALT+digit and ALT+letter (otherwise keyboard shortcut for application menus or navigation in web forms won't work correctly, or will take the priority when you intended to type a valid character, forcing these application functions instead of accepting your character input). MSKLC performs this safety checks and will issue warnings if you do so. The Alt shift state is unassignable in the MSKLC. When used for shortcuts with Clavier+, these are prioritized and work fine. This is not just my advaice but documented in the ISO standard. That depends on which ISO Standard you refer to. If it's ISO/IEC 9995, then beware! IMHO this standard isn't to be taken seriously, otherwise you'll have to stay away from using the Shift + AltGr shift state, to take just one outstanding example. Assigning characters to positions defined for application shortcuts is a bad idea. Keyboard layouts should map characters in positions that are independant of applications (but layouts may be specific to an OS if the OS interface defines some standard shortcuts: this is a problem when using virtualized OSes, as there's a conflict with shortcuts used to switch from the guest to the host: personnally I have chosen the Application key for this instead of the right control, because the Application key is rarely needed, but I frequently type control with the right hand or two hands, notably CTRL+A, CTRL+C, CTRL+X, CTRL+V). It's indeed very useful to keep two Control modifiers. Because the modifiers at the left and right border of the block are acted with the little finger and should thus be symetrical. This does not apply to the Alt keys and other keys more or less centered around the space bar, which are acted with the thumbs. As Alt is less used than Kana (when there is a Kana key), Kana should be on left Alt, symetrical to the (on many keyboards already implemented) AltGr key. The Alt key comes then on the Applications key, which is mnemonic because of the contextual menu icon. Internally, indeed, the Alt keys (left and right) are called Menu keys (Virtual key Left Menu or VK_LMENU, and VK_RMENU). This contextual menu is then invoked pressing the right Windows key, which is consistently missing on laptops. Laptops must however have an Applications key to prevent the AltGr key from being positioned too far rightwards, beside of a space bar too long, because this hardware layout has some negative impact on ergonomics, specialists say. On the US keyboard layout at http://charupdate.info however, Applications is a Kana toggle, while Right Windows is a Compose key. For laptops this shifts rightwards to get Compose on Applications, and Kana toggle on, well, Right Control. Because there are laptops with nothing between Right Alt and Right Control, so I even thought at mapping the Kana toggle on Pause, but this turned out to be buggy, besides that keyboards without Applications (Menu) often are lacking the Pause key too. On the French keyboard, CONTROL and SHIFT+CONTROL must be reserved on 7 successive keys of the first row (5([, 6-|, 7è`, 8_\, 9ç^, 0à@, °)]), they are needed to get ASCII controls However CONTROL+@ is extremely rarely needed in applications to enter a NULL control that will be almost always filtered out silently, only some editors that allow loading and editing binary files will use it, e.g. Emacs or Vim which have a binary editing mode that avoids altering the encoding of newlines, but displays all controls explicitly, and that does not limit the line length. Personally I prefer not using text editors to edit binary files, this is too much unsafe with their insertion working mode, it is highly preferable and much simpler to use an hexadecimal editor). This means that CONTROL+0à@ may be assigned something else more useful (even if the MSKLC compiler warns about it). But you can assign characters with CONTROL and CONTROL+SHIFT for the 6 other keys of the first row (², 1, 2é~, 3#, 4'{ on the left side, and +=} on the last position to the right). I ended up assigning no characters on Control shift states at all any more. To get the most of a keyboard, the best is to use the
Re: Another take on the English Apostrophe in Unicode
Marcel Schneider charupdate at orange dot fr wrote: A free tool, the Microsoft Keyboard Layout Creator, allows every user to add U+02BC on his preferred keyboard layout I use John Cowan's Moby Latin keyboard, built with MSKLC, which is 100% compatible with the AltGr-less US keyboard and supports almost 900 other characters, including all of the apostrophes and quotes and dashes and other characters under discussion: http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html I spent years designing and updating my own keyboard layout and studying other layouts. I've ended this quest since I started using Moby Latin; it's the best I've seen in numerous ways. Elsewhere: ISO stands for stability We wish. Several of us on this list have worked on standards and standard-like activities that correct for, and defend against, instability in ISO standards. Microsoft’s choice of mashing up apostrophe and close-quote to end up with an unprocessable hybrid was wrong. Very wrong. Windows-1252 and the other Windows code pages were developed during the 1980s, before Unicode, when almost all non-Asian character sets were limited to 256 code points. The distinctions between apostrophe and right-single-quote, weighed against the confusion caused by encoding two identical-looking characters, would never have been sufficient back then to justify separate encoding in this limited space. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Another take on the English Apostrophe in Unicode
On Fri, Jun 12, 2015, Philippe Verdy wrote: 2015-06-12 17:02 GMT+02:00 Marcel Schneider : Would it be possible to have wordprocessing software where one uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC CONTROL and CONTROL+SHIFT cannot work on French keyboards where the existing ASCII apostrophe is on the numeric row where there are also ascii controls mapped matching the ASCII open brace that is itself mapped on ALTGR (or CTRL+ALT) in order to generate instead the C0 control. In general it is a bad idea to map any printable character or combining character or dead key with the CTRL or CTRL+SHIFT modifiers associated to any position in the alphanumerica part of the keyboard: this should remain reserved to map function keys or C0/C1 controls only, that local applications will use to assign them application-specific application functions. Even the Language bar uses the upper row to define shortcuts with Control, Shift+Control, Shift+Alt to switch between keyboard layouts, which are prioritized. So to test the shortcuts with Clavier+, I must first remove shortcuts in the Language bar. Then the way was free to test Mr Overingtonʼs shortcuts for curly apostrophes (I will send the result just after). When I deleted the shortcuts in Clavier+ to test your advice, I found no application shortcuts for Ctrl+4 while the keys 1, 2, 5 and 0 are usually mapped as Word shortcut with CONTROL, while the heading formatting is with ALT. But indeed among ASCII controls I found eight on the French keyboard: //VirtualKey |ScanCd |ISO_# |Ctrl {VK_ESCAPE /*T01 */ ,0x001b {VK_CANCEL /*X46 */ ,0x0003 {VK_BACK /*T0E E13*/ ,0x007f {VK_OEM_6 /*T1A D11*/ ,0x001b {VK_OEM_1 /*T1B D12*/ ,0x001d {VK_OEM_5 /*T2B C12*/ ,0x001c {VK_RETURN /*T1C C13*/ ,'\n' {VK_OEM_102 /*T56 B00*/ ,0x001c On the alphanumerical block, there are always the same five, three among them near the Enter key. The British-American Apostrophe key is exempt of Controls too. This is probably why Mr Overington wants to use CONTROL and SHIFT+CONTROL for U+2019 and U+02BC, as custom applications shortcuts. I had once defined a universal latin layout in the MSKLC, but as there is neither Kana nor chained dead keys, I allocated some dead keys (among a total of about 25) on CONTROL positions where I supposed there wouldnʼt be any shortcuts in any application, as on ù, ^, and even high digits on the upper row. It must be at http://dispoclavier.monsite-orange.fr, and somebody has been very astonished because precisely this may become buggy. Even more, this is disabled! Winwordc.exe did not process these dead keys. Other applications did, as I remember. But the layout was far too hard to remind, as I filled up double diacrited at the next free positions in the alphabet. This way I could allocate 1,921 Unicode characters (by editing the KLC source in spreadsheets), but since I know and use the WDK, I wonʼt make such a layout again. Now Iʼm trying to put even more characters but with chained dead keys, for double diacrited and for easy-to-remind compose sequences. For example, you will enter U+01BF LATIN LETTER WYNN by typing simply COMPOSE, w, y, n, n, or less if not needed to disambiguate. Same for digraphs and ligatures. The test version I use is now adapted to type the letter apostrophe U+02BC (Iʼll send after to the List some news about). Best regards, Marcel Schneider
Re: Another take on the English apostrophe in Unicode
On Fri, June 5, William_J_G Overington wrote: I replied: Would it be possible to have wordprocessing software where one uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input [...] I am wondering whether some existing software packages might be able to be used for the character inputting part using customized keyboard short cuts. There is a very good shortcut utility for Windows which doesnʼt modify the registry except to launch the app automatically: http://utilfr42.free.fr/util/Clavier.php Using this software, I tried, you can define CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input.After defining the shortcut by typing it, you will have to paste the character into the text editing field. You can specify that these shortcuts work only in the word processing software you use, as you wish to. To achieve this, pick the “target”icon, drag and drop it into an open window of the target application, its name will be added in the bar and youʼll have to choose that the shortcut be enabled in this software. You may even define that the shortcuts work with LEFT CONTROL only, in order to keep RIGHT CONTROL for other shortcuts with APOSTROPHE. As CONTROL SHIFT is not easy enough to type for character input, Iʼd suggest to define CONTROL L for U+2019, and to add CONTROL SEMICOLON for U+2018. This is because on the square bracket keys, there are already control characters allocated on CONTROL shift state. On these keys you may however choose LEFT ALT or RIGHT ALT for a shortcut. BTW: Clavier+ allows even to command the pointer and to enter mouse clicks, so that a shortcut can execute an action on the graphic interface of the app. This is very useful to add app shortcuts in apps that donʼt allow customising. Itʼs free, and the interface can be switched to English. To download your copy: http://utilfr42.free.fr/util/Clavier.php I have now thought of the alternative for now of being able to test what is in the text by using a special version of an open source font where there are distinctive glyphs one from the other for the two characters. I discovered that when U+02BC is input by autocorrect in replacement of U+0027, and the current font does not contain U+02BC (for example Lucida Console), then U+02BC is displayed in the fall-back font (Courier New) and the font-setting is *not* altered. This way, you have the MODIFIER LETTER APOSTROPHE displayed in a distinctive font at input. This is observed in Microsoft Word Starter, where every out-of-font character typed as such triggers the font-setting to fall-back, which is very annoying. Best regards, Marcel Schneider
Re: Another take on the English apostrophe in Unicode
On Wed, Jun 10, 2015, Ted Clancy wrote: The idea that words with apostrophes aren't valid words is a regrettable myth that exists in English, which has repeatedly led to the apostrophe being an afterthought in computing, leading to situations like this one. [...] I imagine spell-checkers of the future could underline a word where I erroneously use a closing quote instead of an apostrophe, or vice versa. There are other possible solutions too, but I don't want to get into a discussion about UI design. I'll leave that to UI designers. Thereʼs however one UI whose design is a matter of everybody, and every typist should be interested in, that is, we all, since everybody does at least partly a typistʼs work. Weʼre all typists, and weʼre all invited to help design that UI for ourselves and for our relations, friends, colleagues. This week-end I switched my current apostrophe from U+2019 to U+02BC by updating my (already customised, but still unfinished) French keyboard layout. As weʼve already one prominent dead key, Iʼd added two others on Base shift state. From now on, I type GRAVE – APOSTROPHE / QUOTATION MARK for a single or double opening quote, and get the closing one by using the ACUTE dead key. This recalls some legacy practice where spacing accents were used. The typographic apostrophe U+02BC is CIRCUMFLEX – APOSTROPHE. (Iʼd U+2019 on the apostrophe key when Kana was toggled off!) In addition, Iʼve added an autocorrect for U+0027 to be replaced with U+02BC when writing text on Microsoft Word Starter. The idea that we canʼt touch at our keyboard except on keycaps as theyʼre labeled, or that we can at most change for another predefined layout which often doesnʼt match these labels, is another regrettable widespread myth. As users, we confine ourselves in a receptive and waiting position, wishing and suggesting, and doing all imaginable and improbable things except adding a handful of characters on our keyboard straight before us, while in the meantime, in obliging anticipation, the worldʼs biggest software company stays inviting us to feel free to customise our keyboard with a free tool for free download at http://www.microsoft.com/en-us/download/details.aspx?id=22339 If this call were taken serious, all these discussions about keyboards would take another turn. Every corporate manager would make sure that his employees use appropriate keyboard layouts to save time and enhance output quality. To achieve this, he would not hesitate one minute to put himself at the place of a UI designer and to get that poor keyboard UI molted to a performative worktool. And to deploy the result at corporate level. The MSKLC is worth spending a day to get started with and to create a completed keyboard layout from oneʼs preferred one, because this will save much time and anger. You may design one where apostrophe and single quotes are far one from another (as on Saturdayʼs kbdenusw), to avoid mistyping and spelling errors without having to wait for any better on-screen UI. However, I wonʼt hide that the MSKLC does not allow to chain dead keys, nor does it support Kana shift states, things that are useful for a number of languages using latin or other scripts and to emulate a compose functionality. But all this plus a Kana toggle ends up to be rather simple with additional resources to program and compile the driver in C, all free of charge as well, namely a DDK or WDK https://www.microsoft.com/en-us/download/details.aspx?id=11800 The ‘kbdenukw’ and ‘kbdenusw’ of Saturday, no matter whether they were downloaded or not, are now available in their 2.0 version, which differs from the previous by including the two missing dashes. The goal of this exercise is to prove that at this funny speed, and with such a facility of adding characters on the keyboard, there is no more reason to deprive oneself of the Unicode non-ASCII characters one needs. You may open the included *.klc source—a file format which Microsoft designed for sharing—in the Microsoft Keyboard Layout Creator and in a text editor. For more information, please see my related previous mail. (The AltGr views of the US version show the dead key content.) kbdenukw: http://bit.ly/1dFMFb1 kbdenusw: http://bit.ly/1IWO8aJ Best regards, Marcel Schneider
Re: Another take on the English Apostrophe in Unicode
://ewellic.org | Thornton, CO Message du 15/06/15 17:21 De : Doug Ewell A : Unicode Mailing List Copie à : Objet : Re: Another take on the English Apostrophe in Unicode Marcel Schneider wrote: A free tool, the Microsoft Keyboard Layout Creator, allows every user to add U+02BC on his preferred keyboard layout I use John Cowan's Moby Latin keyboard, built with MSKLC, which is 100% compatible with the AltGr-less US keyboard and supports almost 900 other characters, including all of the apostrophes and quotes and dashes and other characters under discussion: http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html I spent years designing and updating my own keyboard layout and studying other layouts. I've ended this quest since I started using Moby Latin; it's the best I've seen in numerous ways. Elsewhere: ISO stands for stability We wish. Several of us on this list have worked on standards and standard-like activities that correct for, and defend against, instability in ISO standards. Microsoft’s choice of mashing up apostrophe and close-quote to end up with an unprocessable hybrid was wrong. Very wrong. Windows-1252 and the other Windows code pages were developed during the 1980s, before Unicode, when almost all non-Asian character sets were limited to 256 code points. The distinctions between apostrophe and right-single-quote, weighed against the confusion caused by encoding two identical-looking characters, would never have been sufficient back then to justify separate encoding in this limited space. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Another take on the English Apostrophe in Unicode
2015-06-15 8:23 GMT+02:00 Marcel Schneider charupd...@orange.fr: On Fri, Jun 12, 2015, Philippe Verdy verd...@wanadoo.fr wrote: Even the Language bar uses the upper row to define shortcuts with Control, Shift+Control, Shift+Alt to switch between keyboard layouts, which are prioritized. These are application shortcuts, but these modifier keys combinations are used with base function keys (F1...F12), not with keys on the alphanumeric parts of the keyboard. So there's no conflict. It is normal then to not assign CTRL+keys or CONTROL+shift+keys (independantly of the capslock state) with non-control characters if the same keys are used to type non-control ASCII characters in range U+0040..U+005F. This means that 32 positions on the keyboard must not be used for any assignment. The same remark applies to ALT+digit and ALT+letter (otherwise keyboard shortcut for application menus or navigation in web forms won't work correctly, or will take the priority when you intended to type a valid character, forcing these application functions instead of accepting your character input). MSKLC performs this safety checks and will issue warnings if you do so. This is not just my advaice but documented in the ISO standard. So to test the shortcuts with Clavier+, I must first remove shortcuts in the Language bar. Then the way was free to test Mr Overingtonʼs shortcuts for curly apostrophes (I will send the result just after). When I deleted the shortcuts in Clavier+ to test your advice, I found no application shortcuts for Ctrl+4 while the keys 1, 2, 5 and 0 are usually mapped as Word shortcut with CONTROL, while the heading formatting is with ALT. But indeed among ASCII controls I found eight on the French keyboard: //VirtualKey |ScanCd |ISO_# |Ctrl {VK_ESCAPE /*T01 */ ,0x001b {VK_CANCEL /*X46 */ ,0x0003 {VK_BACK /*T0E E13*/ ,0x007f {VK_OEM_6 /*T1A D11*/ ,0x001b {VK_OEM_1 /*T1B D12*/ ,0x001d {VK_OEM_5 /*T2B C12*/ ,0x001c {VK_RETURN /*T1C C13*/ ,'\n' {VK_OEM_102 /*T56 B00*/ ,0x001c On the alphanumerical block, there are always the same five, three among them near the Enter key. The British-American Apostrophe key is exempt of Controls too. This is probably why Mr Overington wants to use CONTROL and SHIFT+CONTROL for U+2019 and U+02BC, as custom applications shortcuts. Assigning characters to positions defined for application shortcuts is a bad idea. Keyboard layouts should map characters in positions that are independant of applications (but layouts may be specific to an OS if the OS interface defines some standard shortcuts: this is a problem when using virtualized OSes, as there's a conflict with shortcuts used to switch from the guest to the host: personnally I have chosen the Application key for this instead of the right control, because the Application key is rarely needed, but I frequently type control with the right hand or two hands, notably CTRL+A, CTRL+C, CTRL+X, CTRL+V). On the French keyboard, CONTROL and SHIFT+CONTROL must be reserved on 7 successive keys of the first row (5([, 6-|, 7è`, 8_\, 9ç^, 0à@, °)]), they are needed to get ASCII controls However CONTROL+@ is extremely rarely needed in applications to enter a NULL control that will be almost always filtered out silently, only some editors that allow loading and editing binary files will use it, e.g. Emacs or Vim which have a binary editing mode that avoids altering the encoding of newlines, but displays all controls explicitly, and that does not limit the line length. Personally I prefer not using text editors to edit binary files, this is too much unsafe with their insertion working mode, it is highly preferable and much simpler to use an hexadecimal editor). This means that CONTROL+0à@ may be assigned something else more useful (even if the MSKLC compiler warns about it). But you can assign characters with CONTROL and CONTROL+SHIFT for the 6 other keys of the first row (², 1, 2é~, 3#, 4'{ on the left side, and +=} on the last position to the right). This means that CONTRL+4 can be safely assigned to U+02BC for the apostrophe letter, but the most common encoding of the French apostrophe is U+2019 (the closing single quote) as French normally does not use single quotation marks, or if it does, it cannot be followed by a letter and cannot be confused with a French apostrophe that is always followed by a letter (or number 1). For now I've not seen any specific need of U+02BC in French (U+2019 is enough, even if it represents two distinct things in French, but in distinct non-colliding contexts). But of course U+02BC is needed for English that needs the distinction with single quotes, because the English apostrophes are used more permissively including at end of words just before a space or punctuation or end of line In French this is not valid to use the apostrophe for elisions at end of words, you need to use instead some abbreviation mark or style.. or no mark at all. The French abbreviation mark can
Re: Another take on the English apostrophe in Unicode
On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider charupd...@orange.fr wrote: When we take the topic down again from linguistics to the core mission of Unicode, that is character encoding and text processing standardisation, ellipsis and Swedish abbreviation colon differ from the single closing quotation mark in this, that they are not to be processed. Linguistics, however, delivered the foundation on which Unicode issued its first recommendation on what character to use for apostrophe. The result was neither a matter of opinion, nor of probabilities. Actually, the choice is between perpetuating confusion in word processing, and get people confused for a little time when announcing that U+2019 for apostrophe was a mistake. Quite nice of you to inform me of the core mission of Unicode—I must have somehow missed that. More seriously, it is not all so black and white. As we developed Unicode, we considered whether to separate characters by function, eg, an END OF SENTENCE PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING PERIOD, etc. Or DIARASIS vs UMLAUT. We quickly concluded that the costs far, far outweighed the benefits. In practice, whenever characters are essentially identical—and by that I mean that the overlap between the acceptable glyphs for each character is very high—people will inevitably mix up the characters on entry. So any processing that depends on that distinction is forced to correct the data anyway. And separating them causes even simple things like searching for a character on a page to get screwed up without having equivalence classes. So we only separated essentially identical characters in limited cases: such as letters from different scripts. Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —*
Re: Another take on the English apostrophe in Unicode
On Thu, Jun 11, 2015, Philippe Verdy wrote: The ASCII punctuations have been ovveriden for a lot of different roles. There's simply no way to map them to a category that matches their semantic role. So the ASCII hyphen and apostrophe-quote can only be given a very weak category that just exhibit their visual role. Pd (dash) is then appropriate for the ASCII hyphen-minus. You can't really tell from the character alone if it is a punctuation or a minus sign. If it is a minus sign you can reencode it better using the more specific mathematical minus sign. Otherwise, even if it is not a minus sign, it can be: - a connector between words in compound words (hyphen) - a trailing mark at end of lines for indicating a word has been broken in the middle (but remember that I asked previously for another character for that role because this word-breaking hyphen is not necessarily an horisontal hyphen (in dictionaries I've seen small slanted tildes, or slanted small equal signs, to make the distinction with true hyphens used in compound words, also because sometimes these breaks are not necessarily between two syllables in pocket books with very narrow columns and minimized spacing) - a bullet leading items in a vertical list (this should be an en dash, follwoed by some spacing) - a punctuation (not necessarily at begining of line) marking the change of person speaking (very common in litterature, notably in theatre). As a connector between words, there's a demonstrated need of differentiating regular hyphens, longer hyphens (preferably surrounded by thin spaces) for noting intervals (we can use the EN DASH for that), long hyphens between two separate names that are joined (example in propers names, after mariage, there's an example in France, where INSEE encodes it for now using TWO successive hyphens, which are also used in French identity cards, passports, social security green cards...). In most fonts, the glyph of the hyphen-minus U+002D is the same as the one of the hyphen U+2010, while the minus sign U+2212 is longer and higher, at half-height of digits, to match between or before, as opposed to the hyphen and hyphen-minus which are positioned at half height of lowercase letters. As a minus sign, these work well only with Elzevir digits. This is why, in most fonts, the hyphen-minus U+002D is very unpleasant when used as a minus sign, especially when the plus sign, equals sign and other operators are present too. In this, the hyphen differs from the apostrophe U+0027, whose differenciated characters (apostrophe U+02BC and single close-quote U+2019) have exactly the same glyph. But hyphen and apostrophe resemble in the fact that in many fonts, only the paired or assorted character is present, while the other is missing. So even in Arial, where the letter apostrophe U+02BC is present, the hyphen U+2010 is missing. The user is supposed to use U+002D as a hyphen and U+2212 as the minus sign. The system hyphen displayed in automatic word break at line end, is converted to U+002D for PDF. This isnʼt ideal, as you point out, because to reverse the word break, one canʼt simply replace all U+002D by nothing. Word processors allow to remove all instances of (U+002D, EOL), but this can delete some orthographic hyphens. The solution would be to use U+2010 for orthographic hyphens (with compatible fonts) and to let the system place its U+002D. The letter apostrophe U+02BC is indispensable because the glyph of U+0027 is unfit for typography. We are also told that U+0027 is unstable, but this is mainly due to the autocorrect smart quotes, which can be turned off at input. I use the autocorrect from now on to convert U+0027 to U+02BC. Another difference between apostrophes and hyphens, and perhaps the main difference, is that except if they are used for word break, hyphens generally donʼt need to be replaced at further stages. At input, the user will replace U+002D with U+2212 where appropriate, and the autocorrect may replace two hyphens with an en dash U+2013. In some fonts, U+002D will need to be replaced with U+2010 for glyphic reasons. By contrast, quotes are to be converted, Ted Clancy points out in his paper. Ambiguating one of them with the apostrophe was a very bad idea. Well, I still believe it was *not* the idea of any Unicode Committee, nor of any Standards Body at all. Marcel
Re: Another take on the English apostrophe in Unicode
At the following URL, a forum page illustrates the way users struggle since a decade (and more) against the chaotic confusion Microsoft perpetuated despite of Unicode, forcing the Committee to adopt its short views: http://painintheenglish.com/case/383 Please note Persephoneʼs workaround, which is a way to avoid the Apostrophe Catastrophe without turning off the “smart quotes”. This is the smartest thing Iʼve ever read about “smart quotes”. This workaround, which I ignored, might explain why Microsoft refused to reengineer the smart quotes algorithm: Users have just to type two quotes and to delete one! However, the problem of *handling* and *processing* such text stays unresolved. Users are conscious about a quote not being an apostrophe, this page shows. But they are compelled to use close-quotes for simulation of curly apostrophes. This works on the spot, but it brings bad quality text files. Regardless of whether this matches Microsoftʼs business model or not, there is no right of dissuading font-designers from publishing complete fonts! Allocating the same glyph (U+2019) to a supplemental code point (U+02BC) is very easy when creating a font, but as Microsoft compelled Unicode to tell eveybody that there is no need of U+02BC in English and that our text files must not contain U+02BC, we lost sixteen years and thousands of fonts (including Arial Unicode MS, which surprisingly is lacking U+02BC!) are nearly unusable with correct text files because they donʼt include any typographical apostrophe. Except that U+0027 is curly in many ornamental fonts, to meet usersʼ expectations. A ready workaround would thus be to disable the smart quotes and keep U+0027 as apostrophe (only), while entering U+2018/U+2019 by any means, and to replace eventually all instances of U+0027 by U+02BC. Or by U+2019 but only just before printing, never to publish in PDF and even less to send as a file or to publish on the internet! As usual, the status quo which originated from legacy code pages (which were already considerably enriched compared to ISO 8859-1, be said to the honor of Microsoft) has been justified a posteriori with a lot of mostly biased arguments: – The approval of U+2019 as apostrophe is based on glyphs and rendering and on a static view of text, excluding from scope the further word processing across documents and languages. – Unicodeʼs principles are misapplied and even misinterpreted. The fact that different meanings across languages do not need different code points, is applied inside a given language to argue that distinction of semantics by different code points is not needed. – Some arguments are obsoleted since they were uttered, so the U+02BC being a “spacing clone of Greek smooth breathing mark” (removed in 5.1) and thus never slanted, while in most fonts it has same shape as U+2019, slanted or curly. – Another fallacy cites as a proof the use of U+2019 as apostrophe in some locales, while this is already based on CP1252-inspired practice against the spirit of Unicode. – Bluring the issue by enumerating the various values of English apostrophe, which leads sometimes to include the close-quote function as punctuation apostrophe... Whatever, there is nothing to save of the status quo. Unfortunately, the mass of wrongly encoded text goes on increasing while discussions follow one another. At least, that does not hinder publishing good books and newspapers and sending nice mails (on paper, where nobodyʼs asking whatʼs the code point, because thereʼs no need). About other media, thereʼs to say that hand-processing wrong text files increases the job volume— :( for managers, but :) for workers, at the condition that they are really paid for. Marcel Schneider
Re: Another take on the English apostrophe in Unicode
On Sat, Jun 13, 2015, Mark Davis wrote: In particular, I see no need to change our recommendation on the character used in contractions for English and many other languages (U+2019). Similarly, we wouldn't recommend use of anything but the colon for marking abbreviations in Swedish, or propose a new MODIFIER LETTER ELLIPSIS for supercali...docious. (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the confusion.) When we take the topic down again from linguistics to the core mission of Unicode, that is character encoding and text processing standardisation, ellipsis and Swedish abbreviation colon differ from the single closing quotation mark in this, that they are not to be processed. Linguistics, however, delivered the foundation on which Unicode issued its first recommendation on what character to use for apostrophe. The result was neither a matter of opinion, nor of probabilities. Actually, the choice is between perpetuating confusion in word processing, and get people confused for a little time when announcing that U+2019 for apostrophe was a mistake. Marcel Schneider Message du 13/06/15 17:36 De : Mark Davis ☕️ A : Peter Constable Copie à : verd...@wanadoo.fr , Kalvesmaki, Joel , Unicode Mailing List Objet : Re: Another take on the English apostrophe in Unicode On Sat, Jun 13, 2015 at 5:10 PM, Peter Constable wrote: When it comes to orthography, the notion of what comprise words of a language is generally pure convention. That’s because there isn’t any single _linguistic_ definition of word that gives the same answer when phonological vs. morphological or syntactic criteria are applied. There are book-length works on just this topic, such as this: In particular, I see no need to change our recommendation on the character used in contractions for English and many other languages (U+2019). Similarly, we wouldn't recommend use of anything but the colon for marking abbreviations in Swedish, or propose a new MODIFIER LETTER ELLIPSIS for supercali...docious. (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the confusion.) Mark — Il meglio è l’inimico del bene —
Another take on the English Apostrophe in Unicode
On Fri, Jun 5, 2015, David Starner wrote: On Fri, Jun 5, 2015 at 12:16 AM Leo Broukhis wrote: I agree that conflating apostrophes and quotes is a source of problems, however, existence of the MODIFIER LETTER [same glyph as used for English contractions] in Unicode is a coincidence which should not have an effect on usage of apostrophes in English. Coincidence or not, the Unicode Consortium is not going to allocate a new code-point for the English apostrophe as long as MODIFIER LETTER APOSTROPHE exists. Any change is pretty unlikely, but changing to an existing character is vastly more likely then creating a new one. In fact this would be a return to the state until version 2.0.0. http://www.unicode.org/Public/2.0-Update/NamesList-1.txt Since version 3.0.0 (or more precisely, since update 2.1), U+2019 is preferred for apostrophe, not U+02BC any longer. http://www.unicode.org/Public/3.0-Update/NamesList-3.0.0.txt Prior to this discovery, I supposed it could have been later ISO prescriptions which triggered it the wrong way, but now it's impossible ISO initiated the move of preferred apostrophe from U+02BC to U+2019. This change took place not sooner than in update 2.1, whereas the merger was at 1.1 and ISO stands for stability. So ISO could never agree that the preferred character for English apostrophe stopped to be U+02BC and started to be U+2019, against the Stability Policy, and presumably using a gap in this policy which possibly don’t cover usage recommendations... I must do some more research in the Archives to find out more about why the apostrophe and the single close quote were ambiguated—a process that needs even a new word to put on it, as ordinarily everybody works for disambiguation... However, the 1999 Mail Archive already shows it was for simplification's sake, in word processing software. Could anybody tell us more about this issue? IMHO, the mischievous apostrophe that we use today, is due to a shortcut, narrowed design, and uncomplete check-ups. Briefly, the disconnect was between Unicode whose global approach lead to complete solutions including all you need for text handling and word processing, and Microsoft whose industrial approach prioritized the ready make-up of output appearance, letting out of scope the subsequent lifestages of text. The Windows code page 1252 apostrophe-close-quote looks nice on screen and in the documents, but as soon as you need to convert quotes from British to American or from free to nested, the only way to prevent your text from becoming unusable is to hand-process the quotes one by one. The money you saved when purchasing the software, is lost thousandfold at use. Microsoft’s choice of mashing up apostrophe and close-quote to end up with an unprocessable hybrid was wrong. Very wrong. Marcel Schneider
Re: Another take on the English Apostrophe in Unicode
On June 3, 2015, Ted Clancy wrote: https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/ I wish to thank you personally for having brought up this issue, as well as Mr Grosshans for having posted the URL launching this thread. However, your solution is not complete, and I don’t agree fully with all your statements. So let’s try to check up what’s the matter, and then look what might be done. First, the Unicode Technical Committee is *not* very wrong. A look in the Standard 2.0.0 or even simplier, a glance at the first NamesList in the UCD, that is the source code for the Version 2.0.0 Code Charts, shows that originally, the UTC recommended the use of U+02BC MODIFIER LETTER APOSTROPHE for the English apostrophe as well as for apostrophe on the whole, and to reserve the use of U+2019 RIGHT SINGLE QUOTATION MARK for what it is: close-quote. It wasn’t sooner than in the 2.1 update that the preferred character for apostrophe was shifted from U+02BC to U+2019, to conform with the usage (and presumably at the demand) of Microsoft, which did not comply to the Standard, despite of being a full member of the Unicode Consortium (and having thus agreed at the beginning that apostrophe should be U+02BC). I’m pretty sure that when they moved the apostrophe preference from U+02BC to U+2019, the Unicode Technical Committee and the Unicode Editorial Committee acted against their will. My opinion is induced from the original UTC position and from comparing two versioned NamesList extracts among those displayed at charupdate.info#ambiguation Second, your solution is *not* complete. Even if word-processors managed nested quotes, one single key for all occurring quotation marks of a given locale, as British English or US English, would scarcely be sufficient. Here’s why. Everybody knows that quotes are used not only to quote, but also to delimit, to warn or generally to flag otherwise than as a quotation. The latter occurs commonly when the writer (and by transposition, the speaker, making a quotes gesture) wants to flag a word or an expression as being controversial, not true, not in his belief, or ironical. From this they are sometimes called “irony quotes”. Languages that use angle quotation marks (chevrons) to quote, use comma quotation marks to flag. In English, I suppose that you need to use the “other” quotation marks to flag. So in US English you would flag using single quotes, while in British English you would use double quotes, the like as in French. However I don’t know how that works in quotations (while in languages as French and German this is no problem). Therefore, the user should always have means to type exactly the quotes he wishes to type. This will result in the need of at least one dead key or some supplemental dead list entries, and/or supplemental AltGr positions, or even supplemental shift states (Kana). Never one single key position can do all the job. Third (but this is an off-topic discussion in this thread and is set aside in your blog post), the close-quote as an apostrophe is not good for French neither, regardless of how many words are around. The use of U+2019 as apostrophe hasn’t lead in French to any “Apostrophe Catastrophe” only because in French, few people use single comma quotes (in rare cases or for special purposes), and because properly leading apostrophes are often placed otherwise, as in “Y’a” for “Il y a”, instead of “’Y a”. What shall we do? As you draw it, the so-called smart quotes algorithm must be reengineered and cannot stay working as it does, so users must be informed that to type “unexpected” quotes, they’ve to hit the key two times, or to type another character just after. But users must also make an effort by themselves instead of wishing to stay with the inherited keyboard layout regardless of what changes are on-going, and at the same time, to get more Unicode characters as reasonably supportable on this old keyboard. In other words, the gap between the expected rendering and the actually conceded input must be filled up whether by using a set of customised (or perhaps one day, standardised) autocorrect entries (see one suggestion at charupdate.info#curly) or by typing appropriate characters on extended keyboard layouts (which don’t lead to change for another hardware, except for special purposes). Thanks again, because without this discussion, I would have released more keyboard layouts with the wrong apostrophe! Marcel Schneider
Re: Another take on the English Apostrophe in Unicode
On Sun, Jul 18, 1999, Markus Kuhn wrote: http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0557.html I addition, I feel that the current ISO 8859 oriented national keyboard standards are not adequate for modern Unicode-era word processing practices, as they put obsolete typewriter characters such as U+0027 on too prominent keys, while they have no key positions for the extremely frequently needed typesetting characters that are for instance supported by CP1252 (directional single and double quotes, en and em dashes, etc.). Software either has to use shaky algorithms to make educated guesses on which character the user might have meant (such as Word tries to do), or sequences of ASCII characters are interpreted with new semantics (such as both TeX and Word do), in order to give typists some compromise access to these characters. I think it is urgent time to revise national keyboard standards here. We really need standardized ways to easily enter say at least 2018 LEFT SINGLE QUOTATION MARK 2019 RIGHT SINGLE QUOTATION MARK 201C LEFT DOUBLE QUOTATION MARK 201D RIGHT DOUBLE QUOTATION MARK 2013 EN DASH 2014 EM DASH on keyboards for English language users, and corresponding extensions on other national keyboard standards. This might be a good opportunity to introduce on US keyboards the Level 2 Select key (AltGr), while on European keyboards is is probably sufficient to just add appropriate labels to a number of new Level 2 Select positions. On Sun, Jul 18, 1999, Mark Davis wrote: http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0558.html However, I agree that having the curly quotes (single and double) on the standard keyboard would be handy. I switch back and forth between a Mac and Windows. On the Mac, the option key (a second level shift) has always made this easy. The installable Windows international keyboard is not nearly so useful, since you can't just leave it on all the time (it messes up your used of quotation marks). On Thu, Jun 4, 2015 at 2:38 PM, Markus Scherer wrote: How are normal users supposed to find both U+2019 and U+02BC on their keyboards, Yes this may be the main issue, how to get at hand U+20BC, U+2019 and U+2018 as well, plus the actual U+0027, on keyboards that are derived from typewriters’ ones. Word processors are overasked with management of all four, while many users whish to stay typing ‘apostrophe’ for all of them. And not to change for another keyboard driver(?). A free tool, the Microsoft Keyboard Layout Creator, allows every user to add U+02BC on his preferred keyboard layout, for example in the deadlist of apostrophe on the US International keyboard, a layout where U+2019 is already found, along with U+2018. You may choose a double stroke on Apostrophe to generate the modifier letter. But as this layout obviously is not so useful, you’ll prefer to get them on the US Standard layout, or depending on where you live, on the UK standard or extended or any other layout. A more achieved solution is obtained with the Windows Driver Kit, a free development kit which allows to implement a Kana toggle, to toggle Apostrophe on the US Standard keyboard between U+0027 and U+02BC *or* U+2019. The least used among all three will be put into the deadlist, when adding one dead key on this layout, say Grave. Then, [Grave] [Apostrophe] will result in the missing apostrophe character. how are they supposed to deal with incorrect usage? If the document is already incorrect, there will be nothing to do IMHO than check them one by one. Theoretically, word processors could integrate an exhaustive checking algorithm with an exhaustive dictionary. Which such a tool, there would be no “Apostrophe Catastrophe” as it has been called: http://www.newrepublic.com/article/113101/smart-quotes-are-killing-apostrophe (found by a search engine). So, on actual keyboard layouts, avoiding the Apostrophe Catastrophe would then have been unfeasible—the like as with actual consumption habits, avoiding a number of other catastrophes is unfeasible as well... Nevertheless, this morning I opened once more the Microsoft Keyboard Layout Creator. Ten minutes later I got the finished complete package of the US American keyboard layout with U+02BC MODIFIER LETTER APOSTROPHE and all English quotation marks in one dead key on ‘Grave’, that is key number E00 (ISO/IEC 9995-1). The same way I made up the keyboard layout for the United Kingdom, which uses AltGr, so the apostrophe and all quotes are also on AltGr. Ten minutes, again. If you don’t use the grave accent (or AltGr), there is strictly no change on these keyboard layouts, because I loaded the original Windows US and UK layouts into the MSKLC. If you use the grave accent, you must type a whitespace after hitting the grave key to get the grave accent (in conformance to the standard behavior of dead keys). – To get the modifier letter
Re: Another take on the English apostrophe in Unicode
I disagree: U+02BC already qualifies as a letter (even if it is not specific to the Latin script and is not dual-cased). It is perfectly integrable in language-specific alphabets and we don't need another character to encode it once again as a letter. So the only question is about choosing between: - on one side, U+02BC (the existing apostrophe letter), and other possible candidate letters for alternate forms (including U+02C8 for the vertical form, and the common fallback letter U+00B4 present in many legacy fonts for systems built before the UCS was standardized and using legacy 8-bit charsets such as ISO 8859-1). - and on the other side, U+2019 where it is encoded as a quotation punctuation mark (like also the legacy ASCII single quote) Note that U+00B4 (from ISO 8859-1) has also been used in association with U+0074 (from ASCII) to replace the more ambiguous ASCII quote U+0027 by assigning an orientation: the exact shape of these two is variable, between a thin rectangle, or a wedge, or a curly comma (shaped like 6 and 9 digits), as well as the exact angle when it is a wedge or thin rectangle (these characters however have been used since long in overstriking mode to add accents over Latin capital letters, so the curly comma shapes are very uncommon and they are more horizontal than vertical and U+00B4 will be a very poor cantidate for the apostrophe that should have a narrow advance width. So there remains in practice U+02BC and U+02C8 for this apostrophe letter (which one you'll use is a matter of preference but U+02C8 will not be used if there are two distinct apostrophes in the language (e.g. in Polynesian languages where the distinction was made even more clearer by using right or left rings U+02BE/U+02BF, or glottal letters U+02C0/U+02C1 if that letter has a very distinctive phonetic realisation as a plain consonnant with two variants like in Arabic or even U+02B0 when this is just a breath without stop: the full range range U+02B0-U+02C1 offers much enough variations for this letter if you need slight phonetic distinctions). 2015-06-13 8:28 GMT+02:00 Peter Constable peter...@microsoft.com: Nice article, as I recall. (Been a long time.) Peter -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Kalvesmaki, Joel Sent: Friday, June 5, 2015 7:27 AM To: Unicode Mailing List Subject: Re: Another take on the English apostrophe in Unicode I don't have a particular position staked out. But to this discussion should be added the very interesting work done by Zwicky and Pullum arguing that the apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC would satisfy that position. See: Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. Cliticization vs. Inflection: English N'T.Language59, no. 3 (1983): 502-513. It's nicely summarized and discussed here: http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/ jk -- Joel Kalvesmaki Editor in Byzantine Studies Dumbarton Oaks 202 339 6435
Re: Another take on the English apostrophe in Unicode
I don't agree with this Grévisse definition (and I'm not alone, other grammarians and dictionaries don't follow Grévisse, and even the French Academy disagrees). May be this is a form of composition but the correct term is nothz that it create a new word, it just means that words take new semantics in specific contexts (here, idiomatic expressions where the term pomme is a minor shift of meaning, that also occurs in pomme de pin = pineapple, or chou pomme and as well in the alternate semantic of pomme related only to its rouch shape to designate a human head and by extension a person, also used in idiomatic expressions like c'est pour ma pomme). But the word itself is not different and in fact the etymology is the same, this was only a progressive extension of semantic that created finally an idiomatic expression, but not a new word. A compound word (mot composé) needs a clear gluing, by an hyphen, or apostrophe, or absence of space and punctuation. Grévisse still records many good advices that are too frequently forgotten today, but here it got too far in details that was not needed to preserve the semantics of the language. Another proof is the cuisne expression pomme frite which does not mean a fried aple fruit, but a fried potato: pomme de terre has been abreviated to only pomme, and this term even disappears now when the participle verb frite used as an epithetic adjective is then substantivated. The idiomatic expression pomme de terre is not so much idiomatic, this is just a extension lemma added to the term pomme (apple). The composition has in fact never be clearly attested, but if it was, hyphens would have been used since long (many hyphens are now starting to disappear in compiund words, replaced by direct gluing which is admitted in most cases). 2015-06-13 5:11 GMT+02:00 Eric Muller eric.mul...@efele.net: On 6/10/2015 9:37 PM, Philippe Verdy wrote: The French pomme de terre (potato in English, French vulgar synonym : patate) is a single lemma in dictionaries, but is still 3 separate words (only the first one takes the plural mark), it is not considered a nom composé (so there's no hyphens). Grevisse, Le bon usage, 11th edition, 1980, page 118, part 1 Elements of the language, chapter 7 The words, section 3 Formation of new words, article 2, Composition, very first paragraph (179 overall): --- By *composition*, language creates new words, either by combining simple words with existing words, or by preceding these simple words with syllables that have no independent existence: *Chou-fleur, gendarme, pomme de terre, contredire, désunir, paratonnerre. * A word, despite being formed of graphically independent elements, is *composed* as soon at it brings to mind, not the distinct images of each of the words from which it is composed, but a single image. Thus the composites *hôtel de ville, pomme de terre, arc de triomphe* each remind of a unique image, and not of the distinct images of *hôtel* and of *ville*, of *pomme* and of *terre*, of *arc* and of *triomphe. * *---* *(hôtel de ville* = city hall; *pomme* = apple, *de* = of, *terre* = earth) Paragraph 181, 3rd remark: --- Sometimes the elements composing [the word] are welded in a simple word: *Bonheur**, contredire, entracte; *sometimes they are connected by an hyphen: *chou-fleur, coffre-fort;* sometimes they stay independent graphically: *Moyen âge, pomme de terre. --- *(“Le Grévisse” as we affectionately call it, or *Le bon usage / French Grammar with remarks on today’s french language*, is a must-have for the student of French. It is encyclopedic in its depth, and has tons of examples and counter-examples. Interestingly, the French wikipedia page says “a descriptive grammar of French”, while the English wikipedia page says “a prescriptive grammar”; it’s both!) I agree that we don’t need a new space coded character. I was just pointing out that some of the arguments for a new coded character for the apostrophe in *don’t* apply equally well to the spaces in the word *pomme de terre*. Eric.
RE: Another take on the English apostrophe in Unicode
Nice article, as I recall. (Been a long time.) Peter -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Kalvesmaki, Joel Sent: Friday, June 5, 2015 7:27 AM To: Unicode Mailing List Subject: Re: Another take on the English apostrophe in Unicode I don't have a particular position staked out. But to this discussion should be added the very interesting work done by Zwicky and Pullum arguing that the apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC would satisfy that position. See: Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. Cliticization vs. Inflection: English N'T.Language59, no. 3 (1983): 502-513. It's nicely summarized and discussed here: http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/ jk -- Joel Kalvesmaki Editor in Byzantine Studies Dumbarton Oaks 202 339 6435
RE: Another take on the English apostrophe in Unicode
I should qualify my statement. The Zwicky and Pullum article was a nice piece of linguistic analysis regarding the morphological characteristics of “n’t”. Their remark about apostrophe, however, was not so much about orthography — which was not the focus of their article — but was rather a way of putting an exclamation on their findings. When it comes to orthography, the notion of what comprise words of a language is generally pure convention. That’s because there isn’t any single _linguistic_ definition of word that gives the same answer when phonological vs. morphological or syntactic criteria are applied. There are book-length works on just this topic, such as this: Di Sciullo, Anna Maria, and Edwin Williams. 1987. On the definition of word. (Linguistic Inquiry monograph fourteen.) Cambridge, Massachusetts, USA: The MIT Press. Peter From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy Sent: Saturday, June 13, 2015 12:03 AM To: Peter Constable Cc: Kalvesmaki, Joel; Unicode Mailing List Subject: Re: Another take on the English apostrophe in Unicode I disagree: U+02BC already qualifies as a letter (even if it is not specific to the Latin script and is not dual-cased). It is perfectly integrable in language-specific alphabets and we don't need another character to encode it once again as a letter. So the only question is about choosing between: - on one side, U+02BC (the existing apostrophe letter), and other possible candidate letters for alternate forms (including U+02C8 for the vertical form, and the common fallback letter U+00B4 present in many legacy fonts for systems built before the UCS was standardized and using legacy 8-bit charsets such as ISO 8859-1). - and on the other side, U+2019 where it is encoded as a quotation punctuation mark (like also the legacy ASCII single quote) Note that U+00B4 (from ISO 8859-1) has also been used in association with U+0074 (from ASCII) to replace the more ambiguous ASCII quote U+0027 by assigning an orientation: the exact shape of these two is variable, between a thin rectangle, or a wedge, or a curly comma (shaped like 6 and 9 digits), as well as the exact angle when it is a wedge or thin rectangle (these characters however have been used since long in overstriking mode to add accents over Latin capital letters, so the curly comma shapes are very uncommon and they are more horizontal than vertical and U+00B4 will be a very poor cantidate for the apostrophe that should have a narrow advance width. So there remains in practice U+02BC and U+02C8 for this apostrophe letter (which one you'll use is a matter of preference but U+02C8 will not be used if there are two distinct apostrophes in the language (e.g. in Polynesian languages where the distinction was made even more clearer by using right or left rings U+02BE/U+02BF, or glottal letters U+02C0/U+02C1 if that letter has a very distinctive phonetic realisation as a plain consonnant with two variants like in Arabic or even U+02B0 when this is just a breath without stop: the full range range U+02B0-U+02C1 offers much enough variations for this letter if you need slight phonetic distinctions). 2015-06-13 8:28 GMT+02:00 Peter Constable peter...@microsoft.commailto:peter...@microsoft.com: Nice article, as I recall. (Been a long time.) Peter -Original Message- From: Unicode [mailto:unicode-boun...@unicode.orgmailto:unicode-boun...@unicode.org] On Behalf Of Kalvesmaki, Joel Sent: Friday, June 5, 2015 7:27 AM To: Unicode Mailing List Subject: Re: Another take on the English apostrophe in Unicode I don't have a particular position staked out. But to this discussion should be added the very interesting work done by Zwicky and Pullum arguing that the apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC would satisfy that position. See: Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. Cliticization vs. Inflection: English N'T.Language59, no. 3 (1983): 502-513. It's nicely summarized and discussed here: http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/ jk -- Joel Kalvesmaki Editor in Byzantine Studies Dumbarton Oaks 202 339 6435
Re: Another take on the English apostrophe in Unicode
On Sat, Jun 13, 2015 at 5:10 PM, Peter Constable peter...@microsoft.com wrote: When it comes to orthography, the notion of what comprise words of a language is generally pure convention. That’s because there isn’t any single *_linguistic_ *definition of word that gives the same answer when phonological vs. morphological or syntactic criteria are applied. There are book-length works on just this topic, such as this: In particular, I see no need to change our recommendation on the character used in contractions for English and many other languages (U+2019). Similarly, we wouldn't recommend use of anything but the colon for marking abbreviations in Swedish, or propose a new MODIFIER LETTER ELLIPSIS for supercali...docious. (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the confusion.) Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —*
Re: Another take on the English Apostrophe in Unicode
2015-06-12 17:02 GMT+02:00 Marcel Schneider charupd...@orange.fr: Would it be possible to have wordprocessing software where one uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC CONTROL and CONTROL+SHIFT cannot work on French keyboards where the existing ASCII apostrophe is on the numeric row where there are also ascii controls mapped matching the ASCII open brace that is itself mapped on ALTGR (or CTRL+ALT) in order to generate instead the C0 control. In general it is a bad idea to map any printable character or combining character or dead key with the CTRL or CTRL+SHIFT modifiers associated to any position in the alphanumerica part of the keyboard: this should remain reserved to map function keys or C0/C1 controls only, that local applications will use to assign them application-specific application functions.
Re: Another take on the English apostrophe in Unicode
On 6/10/2015 9:37 PM, Philippe Verdy wrote: The French "pomme de terre" ("potato" in English, French vulgar synonym : "patate") is a single lemma in dictionaries, but is still 3 separate words (only the first one takes the plural mark), it is not considered a "nom composé" (so there's no hyphens). Grevisse, Le bon usage, 11th edition, 1980, page 118, part 1 Elements of the language, chapter 7 The words, section 3 Formation of new words, article 2, Composition, very first paragraph (179 overall): --- By composition, language creates new words, either by combining simple words with existing words, or by preceding these simple words with syllables that have no independent existence: Chou-fleur, gendarme, pomme de terre, contredire, désunir, paratonnerre. A word, despite being formed of graphically independent elements, is composed as soon at it brings to mind, not the distinct images of each of the words from which it is composed, but a single image. Thus the composites hôtel de ville, pomme de terre, arc de triomphe each remind of a unique image, and not of the distinct images of hôtel and of ville, of pomme and of terre, of arc and of triomphe. --- (hôtel de ville = city hall; pomme = apple, de = of, terre = earth) Paragraph 181, 3rd remark: --- Sometimes the elements composing [the word] are welded in a simple word: Bonheur, contredire, entracte; sometimes they are connected by an hyphen: chou-fleur, coffre-fort; sometimes they stay independent graphically: Moyen âge, pomme de terre. --- (“Le Grévisse” as we affectionately call it, or Le bon usage / French Grammar with remarks on today’s french language, is a must-have for the student of French. It is encyclopedic in its depth, and has tons of examples and counter-examples. Interestingly, the French wikipedia page says “a descriptive grammar of French”, while the English wikipedia page says “a prescriptive grammar”; it’s both!) I agree that we don’t need a new space coded character. I was just pointing out that some of the arguments for a new coded character for the apostrophe in don’t apply equally well to the spaces in the word pomme de terre. Eric.
Re: Another take on the English apostrophe in Unicode
On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer wrote: Confusion between apostrophe and quoting -- blame the scribe who came up with the ambiguous use, not the people who gave it a number. There’s a lot of confusion in writing, especially since this job was done on typewriters, where computer keyboards are derived from while the narrowing of the character sets shifted from mechanics to code pages. This is all over, thanks to Unicode and its principle defined in TUS §1.3: “The Unicode Standard does not define glyph images. That is, the standard defines how characters are interpreted, not how glyphs are rendered.” Unfortunately the new precision and differenciation has sometimes been refused by sticking with legacy practice and for backwards compatibility’s sake. The use of a paired quotation mark (U+2019) as an English apostrophe against the UTC’s initial successful attempt to disambiguate the two by recommending U+02BC (same glyph) for use as apostrophe, is a leading example of how the hard labor of ordering and clarification aiming at what in ancient Greek is called ‘Kosmos’, can at every time be thrown back to chaos by applying short views and doubtful considerations. There’s been a discussion on this Mailing List in July of 1999, that was before the release of the 3.0.0 version of the Standard: “Apostrophes, quotation marks, keyboards and typography”, when the demand for simplification was already addressed with the corrections published as version 2.1: Couldn't Unicode follow Microsoft and just remove the recommendation that U+02BC be the recommended apostrophe character and instead give U+2019 the dual meaning that it de-facto has already today? http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0558.html [The quoted UTR#8 is now located at: http://www.unicode.org/reports/tr8/tr8-3.html] (The shift, as viewed at NamesList level, is now highlighted at http://charupdate.info#ambiguation On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer wrote further: If anything, Unicode might have made a mistake in encoding two of these that look identical. How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? I never believed it could have been a mistake, since we know that Unicode encodes semantics, not glyphs. Were there no modifier letters at all, Unicode had have to introduce an apostrophe character, because an apostrophe is not at all the same as a quotation mark and does not work the same way neither. By handling text, not theories, Ted Clancy at Mozilla clearly shows us that ambiguating the apostrophe with a close-quote brings up counterproductive complications that impact severely the productivity of the users. What, now, about “normal users”? To fix the issue, consider that wishing to stay all the life long with one and the same keyboard layout while at the same time, changing for a new smartphone every year or two, needs some explanation. I guess it is because keyboards don't display anyhing by themselves except keycap labels, so you're never pretty sure about them. We should consider, too, that before being a matter of finding on keyboard, the matter is about using. How are we supposed to choose the right one out of four apostrophe/quotes (U+0027, U+02BC, U+2019, U+2018) while many of us seem not to know or not to bother about where to place it? But supposed we do, it would effectively be much more useful to tell the machine whether we want to type an apostrophe or a quotation mark, and as about that, the existing key is enough (see T. Clancy’s blog). Is managing nested quotes already implemented in word processing? I never heard it is. Definitely, here’s a point where the simplification wished for a widespread word processing software worsened considerably the working conditions of all demanding people. The gap between word processing and desktop publishing is the smaller. Adding characters on your preferred keyboard on Windows is very easy using the Microsoft Keyboard Layout Creator, which has an end-user UI. As the compiled drivers are not even Windows-versioned (from NT-4 upwards), you can deploy them in your company and share among your friends without precautions. That is what users are supposed to do. If they don’t, Microsoft is not supposed to force upon. By contrast, if you want a Kana toggle to toggle the apostrophe key between U+0027 and U+02BC (and the quotation mark between U+0022 and a dead key for all quotation marks), you must use the Windows Driver Kit (along with some other resources) plus the MSKLC. If you wish to see it working, you may download an experimental keyboard layout on the unfinished webpage http://charupdate.info. It exemplifies also the Third level solution and the Compose key solution. I hope that helps. Marcel Schneider
Another take on the English Apostrophe in Unicode
On Fri, June 5, William_J_G Overington wrote: Markus Scherer wrote: How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? I replied: Would it be possible to have wordprocessing software where one uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and could there be a show in colour mode where U+2019 is displayed in cyan and U+02BC is displayed in red, while everything else is displayed in black? I am wondering whether some existing software packages might be able to be used for the character inputting part using customized keyboard short cuts. https://community.serif.com/forum/43862/question-about-customized-keyboard-short-cuts I realize that the cyan and red colours cannot be done at present, yet I have now thought of the alternative for now of being able to test what is in the text by using a special version of an open source font where there are distinctive glyphs one from the other for the two characters. If your goal is to check right now what apostrophes are in a given text, an easy way is to do a search for U+02BC and to ask the software to highlight all. Of PagePlus I’ve got only an expired demo version, but I can assure that on Word, a side pane may even show you the pages with all instances highlighted, and allows you to browse them. To start, press Ctrl+F and type a modifier letter apostrophe into the search bar, or select one in your text and then press Ctrl+F. Getting the apostrophes colored and with a distinctive glyph is possible too. As you are talking about changing the font, I suppose you are in front of raw text. In this case you can do a search-and-replace which gives all U+02BC a red color and another font, say Tahoma when the text is in Arial. Again I speak for Word, where a Plus button shows a Formatting button for the replacing text (replace by the same but with a font formatting on typeface and color), but I suppose PagePlus allows the same proceeding. About suggesting options, one might think about a blinking markup which would allow to find the problematic apostrophes even faster. As a shortcut for U+02BC I’d prefer CONTROL APOSTROPHE because it may occur more often. However, adding something on your keyboard using Right Alt (that is AltGr) is much more efficient because: — You add whenever you want and nearly what you want (if no Kana and no chained dead keys, you get the needed characters on your keyboard the time you write to lists and fora). — You are not bound to a given high-end software (the driver works whenever you type on your keyboard). — You go on to be an active part of your communities (Unicode, Serif, ...) by sharing the resulting drivers with other people. Definitely, any shortcut for an apostrophe would slow down the writing speed, therefore Apostrophe is preferred on Base shift state. So you may design a variant keyboard layout with U+02BC instead of U+0027, even if that be the only change, and toggle between the new one and the usual one by means of your OS's facilities. Or you may choose to add a Kana toggle to toggle the apostrophe key directly inside the driver, but achieving this is somewhat longer. For an example, you may look at the unfinished page http://charupdate.info where there is already an experimental keyboard layout for download. With U+02BC MODIFIER LETTER APOSTROPHE. I hope that helps. Marcel Schneider
Re: Another take on the English apostrophe in Unicode
To add a factor that I think hasn't been mentioned, there are languages in which apostrophe is used both as a letter by itself and as part of a complex letter. Most of the native languages of British Columbia write glottalized consonants as C+', e.g. t' for an ejective alveolar stop, and many use apostrophe by itself for the glottal stop. (Another common convention, which produces other difficulties, is to use the number 7 for glottal stop.) Bill On Wed, Jun 10, 2015 at 2:10 PM, Ted Clancy tcla...@mozilla.com wrote: On 4/Jun/2015 14:34 PM, Markus Scherer wrote: Looks all wrong to me. Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your points below. You can't use simple regular expressions to find word boundaries. That's why we have UAX #29. And UAX #29 doesn't work for words which begin or end with apostrophes, whether represented by U+0027 or U+2019. It erroneously thinks there's a word boundary between the apostrophe and the rest of the word. But UAX #29 *would* work if the apostrophes were represented by U+02BC, which is what I'm suggesting. Confusion between apostrophe and quoting -- blame the scribe who came up with the ambiguous use, not the people who gave it a number. I'm not trying to blame anyone. I'm trying to fix the problem. I know this problem has a long history. English is taught as that squiggle being punctuation, not a letter. I think we need make a distinction between the colloquial usage of the word punctuation and the Unicode general category punctuation which has specific technical implications. I somewhat wish that Unicode had a separate category for Things that look like punctuation but behave like letters, which might clear up this taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are actually modifiers, into that category too.) But we don't. And the English apostrophe behaves like a letter, regardless of what your primary school teacher might have told you, so with the options available in Unicode, it needs to be classed as a letter. don’t is a contraction of two words, it is not one word. This is utter nonsense. Should my spell-checker recognise hasn't as a valid word? Or should it consider hasn't to be the word hasn followed by the word t, and then flag both of them as spelling errors? Is fo'c'sle the three separate words fo, c, and sle? The idea that words with apostrophes aren't valid words is a regrettable myth that exists in English, which has repeatedly led to the apostrophe being an afterthought in computing, leading to situations like this one. If anything, Unicode might have made a mistake in encoding two of these that look identical. How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? Yeah, and there are fonts where I can't tell the difference between capital I and lower-case l. But my spell-checker will underline a word where I erroneously use an I instead of an l, and I imagine spell-checkers of the future could underline a word where I erroneously use a closing quote instead of an apostrophe, or vice versa. There are other possible solutions too, but I don't want to get into a discussion about UI design. I'll leave that to UI designers. - Ted
Re: Another take on the English apostrophe in Unicode
Also used in the Breton trigram c’h (considered as a single letter of the Breton alphabet, but actually entered as two letters with a diacritic-like apostrophe in the middle (which in this case is still not a letter of the alphabet...): the trigram c’h is distinct from the digram ch. Breton **also** uses a regular apostrophe for elision. In fact what you note for the ejective in native american languages is effectively a right-combining diacritic, and still not a letter by itself. However, given its position and the fact it is spacing, this is the spacing form of the apostrophe diacritic that should be used, and that form is then to choose between: * U+00B4 (acute, most often ugly, located too high, and too much horizontal), * U+02B9 (prime, nearly good, but still too high), * U+02BC (apostrophe), * U+02C8 (vertical high tick, but confusable with the mark of stress in IPA before a phonetic syllable), and * U+02CA (acute/2nd tone, which for me is not distinct from 00B4, only used with sinograms in Mandarin Chinese, with its metrics distinct from U+00B4 that match the Latin metrics). In my opinion 02BC is the best choice for the diacritic apostrophe. The other character for the **elision** apostrophe is a punctuation mark U+2019 (just like the full stop punctuation is also used as an abbreviation mark). There's no confusion with its alternate role as a right-side single quote because U+2019 is used in languages that normally never use the single quotes, but chevrons (or other punctuation signs in East-Asian scripts). But in English where single quote are used for small quotations, there's still a problem to represent this elision apostrophe when it does not occur between two letters where it also marks a gluing of two morphemes (as in don't or Peter's), but at the begining or end of a word. But elisions at end of words is also invalid when this is the final word of a quoted sentence. If you really want to cite a single English word terminated by an elision apostrophe, the single quotes won't be usable and you'll use chevrons like in this ‹demo’› and not single or double quotes which are difficult to discriminate. 2015-06-11 19:47 GMT+02:00 Bill Poser billpos...@gmail.com: To add a factor that I think hasn't been mentioned, there are languages in which apostrophe is used both as a letter by itself and as part of a complex letter. Most of the native languages of British Columbia write glottalized consonants as C+', e.g. t' for an ejective alveolar stop, and many use apostrophe by itself for the glottal stop. (Another common convention, which produces other difficulties, is to use the number 7 for glottal stop.) Bill On Wed, Jun 10, 2015 at 2:10 PM, Ted Clancy tcla...@mozilla.com wrote: On 4/Jun/2015 14:34 PM, Markus Scherer wrote: Looks all wrong to me. Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your points below. You can't use simple regular expressions to find word boundaries. That's why we have UAX #29. And UAX #29 doesn't work for words which begin or end with apostrophes, whether represented by U+0027 or U+2019. It erroneously thinks there's a word boundary between the apostrophe and the rest of the word. But UAX #29 *would* work if the apostrophes were represented by U+02BC, which is what I'm suggesting. Confusion between apostrophe and quoting -- blame the scribe who came up with the ambiguous use, not the people who gave it a number. I'm not trying to blame anyone. I'm trying to fix the problem. I know this problem has a long history. English is taught as that squiggle being punctuation, not a letter. I think we need make a distinction between the colloquial usage of the word punctuation and the Unicode general category punctuation which has specific technical implications. I somewhat wish that Unicode had a separate category for Things that look like punctuation but behave like letters, which might clear up this taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are actually modifiers, into that category too.) But we don't. And the English apostrophe behaves like a letter, regardless of what your primary school teacher might have told you, so with the options available in Unicode, it needs to be classed as a letter. don’t is a contraction of two words, it is not one word. This is utter nonsense. Should my spell-checker recognise hasn't as a valid word? Or should it consider hasn't to be the word hasn followed by the word t, and then flag both of them as spelling errors? Is fo'c'sle the three separate words fo, c, and sle? The idea that words with apostrophes aren't valid words is a regrettable myth that exists in English, which has repeatedly led to the apostrophe being an afterthought in computing, leading to situations like this one. If anything, Unicode might have made a mistake in encoding two of these that look
Re: Another take on the English apostrophe in Unicode
I agree with the recommendation of U+02BC. However, it is in fact rarely used because most of the people who write these languages or create supporting infrastructure are unawre of such issues. A small point: it isn't always the spacing diacritic that is used. In some languages, e.g. Halkomelem, people use the spacing apostrophe if they have to but prefer the non-spacing version. On Thu, Jun 11, 2015 at 11:39 AM, Philippe Verdy verd...@wanadoo.fr wrote: Also used in the Breton trigram c’h (considered as a single letter of the Breton alphabet, but actually entered as two letters with a diacritic-like apostrophe in the middle (which in this case is still not a letter of the alphabet...): the trigram c’h is distinct from the digram ch. Breton **also** uses a regular apostrophe for elision. In fact what you note for the ejective in native american languages is effectively a right-combining diacritic, and still not a letter by itself. However, given its position and the fact it is spacing, this is the spacing form of the apostrophe diacritic that should be used, and that form is then to choose between: * U+00B4 (acute, most often ugly, located too high, and too much horizontal), * U+02B9 (prime, nearly good, but still too high), * U+02BC (apostrophe), * U+02C8 (vertical high tick, but confusable with the mark of stress in IPA before a phonetic syllable), and * U+02CA (acute/2nd tone, which for me is not distinct from 00B4, only used with sinograms in Mandarin Chinese, with its metrics distinct from U+00B4 that match the Latin metrics). In my opinion 02BC is the best choice for the diacritic apostrophe. The other character for the **elision** apostrophe is a punctuation mark U+2019 (just like the full stop punctuation is also used as an abbreviation mark). There's no confusion with its alternate role as a right-side single quote because U+2019 is used in languages that normally never use the single quotes, but chevrons (or other punctuation signs in East-Asian scripts). But in English where single quote are used for small quotations, there's still a problem to represent this elision apostrophe when it does not occur between two letters where it also marks a gluing of two morphemes (as in don't or Peter's), but at the begining or end of a word. But elisions at end of words is also invalid when this is the final word of a quoted sentence. If you really want to cite a single English word terminated by an elision apostrophe, the single quotes won't be usable and you'll use chevrons like in this ‹demo’› and not single or double quotes which are difficult to discriminate. 2015-06-11 19:47 GMT+02:00 Bill Poser billpos...@gmail.com: To add a factor that I think hasn't been mentioned, there are languages in which apostrophe is used both as a letter by itself and as part of a complex letter. Most of the native languages of British Columbia write glottalized consonants as C+', e.g. t' for an ejective alveolar stop, and many use apostrophe by itself for the glottal stop. (Another common convention, which produces other difficulties, is to use the number 7 for glottal stop.) Bill On Wed, Jun 10, 2015 at 2:10 PM, Ted Clancy tcla...@mozilla.com wrote: On 4/Jun/2015 14:34 PM, Markus Scherer wrote: Looks all wrong to me. Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your points below. You can't use simple regular expressions to find word boundaries. That's why we have UAX #29. And UAX #29 doesn't work for words which begin or end with apostrophes, whether represented by U+0027 or U+2019. It erroneously thinks there's a word boundary between the apostrophe and the rest of the word. But UAX #29 *would* work if the apostrophes were represented by U+02BC, which is what I'm suggesting. Confusion between apostrophe and quoting -- blame the scribe who came up with the ambiguous use, not the people who gave it a number. I'm not trying to blame anyone. I'm trying to fix the problem. I know this problem has a long history. English is taught as that squiggle being punctuation, not a letter. I think we need make a distinction between the colloquial usage of the word punctuation and the Unicode general category punctuation which has specific technical implications. I somewhat wish that Unicode had a separate category for Things that look like punctuation but behave like letters, which might clear up this taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are actually modifiers, into that category too.) But we don't. And the English apostrophe behaves like a letter, regardless of what your primary school teacher might have told you, so with the options available in Unicode, it needs to be classed as a letter. don’t is a contraction of two words, it is not one word. This is utter nonsense. Should my spell-checker recognise hasn't as a
Re: Another take on the English apostrophe in Unicode
On Thu, Jun 11, 2015 at 1:17 AM, Philippe Verdy verd...@wanadoo.fr wrote: The ASCII punctuations have been ovveriden for a lot of different roles. There's simply no way to map them to a category that matches their semantic role. [...] Pd (dash) is then appropriate for the ASCII hyphen-minus. I agree, but I wasn't talking about the ASCII hyphen, U+002D (HYPHEN-MINUS). I was talking about U+2010 (HYPHEN). I also wasn't talking about changing the properties of U+0027 (APOSTROPHE). in dictionaries I've seen small slanted tildes, or slanted small equal signs, to make the distinction with true hyphens used in compound words This is drifting off-topic, but I wanted to address the thing you just said above. Firstly, in the dictionaries I've seen, the slanted double hyphen is only used when a line break happens to occur at the same place as a true hyphen. It replaces the true hyphen. When a line is broken at a hyphenation point between letters, an ordinary-looking hyphen is displayed. Secondly, this character is encoded in Unicode at U+2E17 (DOUBLE OBLIQUE HYPHEN). - Ted On Thu, Jun 11, 2015 at 1:17 AM, Philippe Verdy verd...@wanadoo.fr wrote: The ASCII punctuations have been ovveriden for a lot of different roles. There's simply no way to map them to a category that matches their semantic role. So the ASCII hyphen and apostrophe-quote can only be given a very weak category that just exhibit their visual role. Pd (dash) is then appropriate for the ASCII hyphen-minus. You can't really tell from the character alone if it is a punctuation or a minus sign. If it is a minus sign you can reencode it better using the more specific mathematical minus sign. Otherwise, even if it is not a minus sign, it can be: - a connector between words in compound words (hyphen) - a trailing mark at end of lines for indicating a word has been broken in the middle (but remember that I asked previously for another character for that role because this word-breaking hyphen is not necessarily an horisontal hyphen (in dictionaries I've seen small slanted tildes, or slanted small equal signs, to make the distinction with true hyphens used in compound words, also because sometimes these breaks are not necessarily between two syllables in pocket books with very narrow columns and minimized spacing) - a bullet leading items in a vertical list (this should be an en dash, follwoed by some spacing) - a punctuation (not necessarily at begining of line) marking the change of person speaking (very common in litterature, notably in theatre). As a connector between words, there's a demonstrated need of differentiating regular hyphens, longer hyphens (preferably surrounded by thin spaces) for noting intervals (we can use the EN DASH for that), long hyphens between two separate names that are joined (example in propers names, after mariage, there's an example in France, where INSEE encodes it for now using TWO successive hyphens, which are also used in French identity cards, passports, social security green cards...). Still nobody replied to my past comment (about 1 month ago) about the various forms of the word-breaking hypĥen / line-wrapping symbol: * I'm not speaking about the SHY control, but about the real character whose glyph appears when SHY is materialized at end of lines (and which should be neither minus, or en-dash but also not the same as the orthographic hyphen used between words in a compound word). * This character can also be found (and is needed) also for breaking long mathematical formulas and must be clearly distinct from the regular minus. * This character is also needed for rendering long lines of programming code or textual data (it is something that must not be entered in programs but that must be rendered because theses programs or codes have significant line breaks: the glyph indicates that the following rendered line break is to be discarded). Not all programming languages have a syntax allwong to use an escape before the line break (such escaping varies, it may be a backslash in C/C++, or an underscore in Basic, but in data dumps such as CSV files, it is impossible to note such escape in the data language itself, and we need to render some specific glyph). * This character is absolutely needed when rendering on a static medium (i.e. printing or broadcasting) ; for dynamic medium (such as personal displays with a personal UI) we could still use scrolling, but users don't like horizontal scrolls and highly prefer reading the text directly. So they expect to see a distinctive glyph (or icon) to see the distinction between line breaks where there are significant or where they just wrap too long lines, and still see the distinction with other regular hyphens and minus (that are also significant and very frequently distinct) 2015-06-11 0:51 GMT+02:00 Ted Clancy tcla...@mozilla.com: On 4/Jun/2015 19:01, Leo Broukhis wrote: Along the same
Re: Another take on the English apostrophe in Unicode
2015-06-11 20:46 GMT+02:00 Bill Poser billpos...@gmail.com: I agree with the recommendation of U+02BC. However, it is in fact rarely used because most of the people who write these languages or create supporting infrastructure are unawre of such issues. A small point: it isn't always the spacing diacritic that is used. In some languages, e.g. Halkomelem, people use the spacing apostrophe if they have to but prefer the non-spacing version. True but on the examples I gave, spacing is needed: the apostrophe is intended to not collide with the previous or next letter, including when writing capital letters. In the Breton trigram c’h where it it plays a diacritic role, but as well in the English elision don’t, the collision would occur after the apostrophe with the ascenders. The only alternative would have been to use a diacritic above one of the two letters for the diacritic apostrophe (and the best diacritic that would have been used for Breton or English would have been an acute accent over the first consonnant. But such usage of combining characters is non conforming for its use as an elision mark. An elision alone is not supposed to change the pronunciation of the remaining letters.So it would have not been appropriate for the elisions in English don’t, or in French j’ai or s’est (this is not a strict rule, French or English also have exceptions where some combinations are used and written that change the way the letters are effectively phonetically realized, including with elisions: don’t is a perfect example where n looses its consonnant value as it is glued with the previous vowel to nasalize it and slightly stress it and in other contexts the following t is also muted as in you don't have to do that in fast speech: this is still the same contraction/elision and it is justified to keep the elision mark separate without noting how the following or next letter are contextually realized, but in all case the elision glues two syllables into only one and the apostrophe is written between the remaining letters of morphemes on each side). If you use a non-spacing version, this can in fact only occur graphically when the following letter is a small letter without ascenders : I still think that this is the spacing version, but what happens is just the effect of some contextual typographic kerning (the same thing that happens in pairs like AV, fi, ij, To...) Also you claim that U+02BC is rarely used for the elision apostrophe. This is plain wrong for French at least, even if people only have an ASCII apostrophe on their native keyboard (there are many word processors that will correctly enter the appropriate curly apostrophe as U+02BC instead of the ugly ASCII vertical quote. Even in English when you look at correctly typeset documents the ASCII quote is replaced by U+2BC (look at large section headings, book titles). U+02BC is also prefered in English for the elision apostrophe. For English you may want to read this: http://www.creativebloq.com/typography/mistakes-everyone-makes-21514129 ASCII and the computer keyboards just perpetuate the limited charset that was supported by old mechanical typewriters. I don't understand why PC keyboards could be extended to add many multimedia control keys or function keys, but not the traditional quotes that are needed (and even sometimes letters still missing in all standard physical keyboard leyouts for French, such as œ/Œ, æ/Æ, or frequent capitals with accents such as É, which is however present on virtual onscreen keyboards for smartphones and tablets). It's high time to restore these letters (and also campaign so that manufacturer of physical keyboards will add a few more keys for national letters (they did it for Japanese only, why not for French or even English, to have more punctuation signs and missing letters or diacritics). It is perfectly possible to find a place for them on physical keyboards just above the numeric key (F1..F12 keys can be compacted if needed, and a couple of dead keys can also be mapped to the right of the Return key without reducing the size of the space bar or the Return/Backspace keys or other modifier keys). Some notebook manufacturers have used two additional preprogrammed keys (e.g. Acer, stupidly, for an unneeded additional Euro symbol whose location on AltGr+E or AltGr+4 in UK is standard, the second one being bound to the dollar symbol aslo not needed !). What is needed is 5 standard keys with standard keycodes, different from keycodes used for user-programmable keys (generally labelled PF1, PF2... but sometimes unlabelled) and different from application-dependant function keys (e.g. generic color keys, like on TV remote controls for navigation in menus: red, green, yellow, blue) Note that this is different from the existing feature on some keyboards defining programmable keys, whose layout is not programmable by the driver itself but by individual settings of the user, independantly of thre selected
Re: Another take on the English apostrophe in Unicode
On 4/Jun/2015 14:34 PM, Markus Scherer wrote: Looks all wrong to me. Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your points below. You can't use simple regular expressions to find word boundaries. That's why we have UAX #29. And UAX #29 doesn't work for words which begin or end with apostrophes, whether represented by U+0027 or U+2019. It erroneously thinks there's a word boundary between the apostrophe and the rest of the word. But UAX #29 *would* work if the apostrophes were represented by U+02BC, which is what I'm suggesting. Confusion between apostrophe and quoting -- blame the scribe who came up with the ambiguous use, not the people who gave it a number. I'm not trying to blame anyone. I'm trying to fix the problem. I know this problem has a long history. English is taught as that squiggle being punctuation, not a letter. I think we need make a distinction between the colloquial usage of the word punctuation and the Unicode general category punctuation which has specific technical implications. I somewhat wish that Unicode had a separate category for Things that look like punctuation but behave like letters, which might clear up this taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are actually modifiers, into that category too.) But we don't. And the English apostrophe behaves like a letter, regardless of what your primary school teacher might have told you, so with the options available in Unicode, it needs to be classed as a letter. don’t is a contraction of two words, it is not one word. This is utter nonsense. Should my spell-checker recognise hasn't as a valid word? Or should it consider hasn't to be the word hasn followed by the word t, and then flag both of them as spelling errors? Is fo'c'sle the three separate words fo, c, and sle? The idea that words with apostrophes aren't valid words is a regrettable myth that exists in English, which has repeatedly led to the apostrophe being an afterthought in computing, leading to situations like this one. If anything, Unicode might have made a mistake in encoding two of these that look identical. How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? Yeah, and there are fonts where I can't tell the difference between capital I and lower-case l. But my spell-checker will underline a word where I erroneously use an I instead of an l, and I imagine spell-checkers of the future could underline a word where I erroneously use a closing quote instead of an apostrophe, or vice versa. There are other possible solutions too, but I don't want to get into a discussion about UI design. I'll leave that to UI designers. - Ted
Re: Another take on the English apostrophe in Unicode
On 4/Jun/2015 19:01, Leo Broukhis wrote: Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for example, the work ack-ack isn't decomposable into words, or even morphemes, ack and ack. I do think that U+2010 (HYPHEN) is miscategorised. I think it should have General Category = Pc, not Pd. (That is, hyphens are connectors, not dashes.) That would make it a word character. Or, at the very least, U+2010 should have Word Break = MidNumLet (meaning it can occur in the middle of numbers or letters). UAX #29 says that U+2010 deliberately does *not* have Word Break = MidNumLet, though an implementation may treat it as if it did. (UAX #29 doesn't give any reasons for this decision. I can understand why U+002D (HYPHEN-MINUS) doesn't have Word Break = MidNumLet, due to its history of being used as a dash or minus sign, but U+2010 should never be used as a dash or minus sign, so I don't see the problem.) But luckily, the miscategorisation of U+2010 hasn't led to any pressing practical problems, unlike the misuse of U+2019 for the apostrophe. - Ted
Re: Another take on the English apostrophe in Unicode
The ASCII punctuations have been ovveriden for a lot of different roles. There's simply no way to map them to a category that matches their semantic role. So the ASCII hyphen and apostrophe-quote can only be given a very weak category that just exhibit their visual role. Pd (dash) is then appropriate for the ASCII hyphen-minus. You can't really tell from the character alone if it is a punctuation or a minus sign. If it is a minus sign you can reencode it better using the more specific mathematical minus sign. Otherwise, even if it is not a minus sign, it can be: - a connector between words in compound words (hyphen) - a trailing mark at end of lines for indicating a word has been broken in the middle (but remember that I asked previously for another character for that role because this word-breaking hyphen is not necessarily an horisontal hyphen (in dictionaries I've seen small slanted tildes, or slanted small equal signs, to make the distinction with true hyphens used in compound words, also because sometimes these breaks are not necessarily between two syllables in pocket books with very narrow columns and minimized spacing) - a bullet leading items in a vertical list (this should be an en dash, follwoed by some spacing) - a punctuation (not necessarily at begining of line) marking the change of person speaking (very common in litterature, notably in theatre). As a connector between words, there's a demonstrated need of differentiating regular hyphens, longer hyphens (preferably surrounded by thin spaces) for noting intervals (we can use the EN DASH for that), long hyphens between two separate names that are joined (example in propers names, after mariage, there's an example in France, where INSEE encodes it for now using TWO successive hyphens, which are also used in French identity cards, passports, social security green cards...). Still nobody replied to my past comment (about 1 month ago) about the various forms of the word-breaking hypĥen / line-wrapping symbol: * I'm not speaking about the SHY control, but about the real character whose glyph appears when SHY is materialized at end of lines (and which should be neither minus, or en-dash but also not the same as the orthographic hyphen used between words in a compound word). * This character can also be found (and is needed) also for breaking long mathematical formulas and must be clearly distinct from the regular minus. * This character is also needed for rendering long lines of programming code or textual data (it is something that must not be entered in programs but that must be rendered because theses programs or codes have significant line breaks: the glyph indicates that the following rendered line break is to be discarded). Not all programming languages have a syntax allwong to use an escape before the line break (such escaping varies, it may be a backslash in C/C++, or an underscore in Basic, but in data dumps such as CSV files, it is impossible to note such escape in the data language itself, and we need to render some specific glyph). * This character is absolutely needed when rendering on a static medium (i.e. printing or broadcasting) ; for dynamic medium (such as personal displays with a personal UI) we could still use scrolling, but users don't like horizontal scrolls and highly prefer reading the text directly. So they expect to see a distinctive glyph (or icon) to see the distinction between line breaks where there are significant or where they just wrap too long lines, and still see the distinction with other regular hyphens and minus (that are also significant and very frequently distinct) 2015-06-11 0:51 GMT+02:00 Ted Clancy tcla...@mozilla.com: On 4/Jun/2015 19:01, Leo Broukhis wrote: Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for example, the work ack-ack isn't decomposable into words, or even morphemes, ack and ack. I do think that U+2010 (HYPHEN) is miscategorised. I think it should have General Category = Pc, not Pd. (That is, hyphens are connectors, not dashes.) That would make it a word character. Or, at the very least, U+2010 should have Word Break = MidNumLet (meaning it can occur in the middle of numbers or letters). UAX #29 says that U+2010 deliberately does *not* have Word Break = MidNumLet, though an implementation may treat it as if it did. (UAX #29 doesn't give any reasons for this decision. I can understand why U+002D (HYPHEN-MINUS) doesn't have Word Break = MidNumLet, due to its history of being used as a dash or minus sign, but U+2010 should never be used as a dash or minus sign, so I don't see the problem.) But luckily, the miscategorisation of U+2010 hasn't led to any pressing practical problems, unlike the misuse of U+2019 for the apostrophe. - Ted
Re: Another take on the English apostrophe in Unicode
The French pomme de terre (potato in English, French vulgar synonym : patate) is a single lemma in dictionaries, but is still 3 separate words (only the first one takes the plural mark), it is not considered a nom composé (so there's no hyphens). And they are separated by standard spaces (that are breakable, and expansible/compressible like all others in case of justified text)... The lemma is still recognized if there are extra punctuation in the middle such as : « pomme » de terre. We don't need any new space character. What you want is to insert markup to exhibit the structure of sentences for grouping words semantically or grammaticaly. But nobody including grammarians will use this new space, what they'll use is in fact some additional symbols or presentation features (enclosing boxes, braces above or below, colors...) if they want to exhibit it on top of the standard text. 2015-06-06 3:08 GMT+02:00 Eric Muller eric.mul...@efele.net: On 6/5/2015 10:29 AM, John D. Burger wrote: Linguistically, don't and friends pass all the diagnostics that indicate they're single words. If I am not mistaken, the french pomme de terre also passes the diagnostics. So we need a new space character. Eric.
Re: Another take on the English apostrophe in Unicode
On Thu, Jun 4, 2015 at 9:25 PM, David Starner prosfil...@gmail.com wrote: Hyphens generally make multiple words into one anyway. There's not really multiple hyphens the way there's separate quotes and apostrophes. Generally, but not always, just as apostrophes aren't always at a contracted word boundary. There is only one hyphen because no language (AFAIK) claims it as part of its alphabet. Leo On 7:01pm, Thu, Jun 4, 2015 Leo Broukhis l...@mailcom.com wrote: Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for example, the work ack-ack isn't decomposable into words, or even morphemes, ack and ack. Leo On Thu, Jun 4, 2015 at 6:31 PM, David Starner prosfil...@gmail.com wrote: On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer markus@gmail.com wrote: don’t is a contraction of two words, it is not one word. But as he points out, it's not a contraction of don and t; it is, at best, a contraction of do and n't. It's eliding, not punctuating. In the comments, he also brings up the examples of Don’t you mind? being okay but not *Do not you mind?, and fo’c’sle. You can't use simple regular expressions to find word boundaries. Who uses _simple_ regular expressions? You can't use any code to reliably find word boundaries in English, and that's a problem.
Re: Another take on the English apostrophe in Unicode
But the point was that treating hyphens as parts of words is not generally a wrong thing. That brings us back to my original question: where's MODIFIER LETTER HYPHEN, then? A word is a sequence of letters, isn't it? :) I agree that conflating apostrophes and quotes is a source of problems, however, existence of the MODIFIER LETTER [same glyph as used for English contractions] in Unicode is a coincidence which should not have an effect on usage of apostrophes in English. Leo On Thu, Jun 4, 2015 at 11:58 PM, David Starner prosfil...@gmail.com wrote: On June 4, 2015, at 11:01 PM, Leo Broukhis l...@mailcom.com wrote: On Thu, Jun 4, 2015 at 9:25 PM, David Starner prosfil...@gmail.com wrote: Hyphens generally make multiple words into one anyway. There's not really multiple hyphens the way there's separate quotes and apostrophes. Generally, but not always, just as apostrophes aren't always at a contracted word boundary. There is only one hyphen because no language (AFAIK) claims it as part of its alphabet. But the point was that treating hyphens as parts of words is not generally a wrong thing. There is one generally consistent rule for hyphens. When apostrophes and quotes are conflated, there is no one generally acceptable rule.
Re: Another take on the English apostrophe in Unicode
On June 4, 2015, at 11:01 PM, Leo Broukhis l...@mailcom.com wrote: On Thu, Jun 4, 2015 at 9:25 PM, David Starner prosfil...@gmail.com wrote: Hyphens generally make multiple words into one anyway. There's not really multiple hyphens the way there's separate quotes and apostrophes. Generally, but not always, just as apostrophes aren't always at a contracted word boundary. There is only one hyphen because no language (AFAIK) claims it as part of its alphabet. But the point was that treating hyphens as parts of words is not generally a wrong thing. There is one generally consistent rule for hyphens. When apostrophes and quotes are conflated, there is no one generally acceptable rule.
Re: Another take on the English apostrophe in Unicode
The conflict is between linguists and programmers. In plain text apostrophe is a punctuation used instead letters (unreadable, one or more) or as separator for avoid connecting letters into ligature or syllable, between parts of composite word as well as inside the simple word, or finally, as quotation mark. Yes it is ambiguous! It is. It just is! Linguists say It is. We see that. We know that. And programmers say That's wrong! We can't understand that. Just are you so stupid if you can't! Modifier letter apostrophe is a letter that used as itself and means itself (ejective sound e.g.) only. Don't use it else. It just make more confusion.
Re: Another take on the English apostrophe in Unicode
Markus Scherer wrote: How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? Would it be possible to have wordprocessing software where one uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and could there be a show in colour mode where U+2019 is displayed in cyan and U+02BC is displayed in red, while everything else is displayed in black? That is, CONTROL U+0027 and CONTROL SHIFT U+0027 respectively. If people want this facility, maybe it could become published in a Unicode Technical Report so that standardization and interoperability could be achieved. William Overington 5 June 2015
Re: Another take on the English apostrophe in Unicode
On Fri, Jun 5, 2015 at 12:16 AM Leo Broukhis l...@mailcom.com wrote: I agree that conflating apostrophes and quotes is a source of problems, however, existence of the MODIFIER LETTER [same glyph as used for English contractions] in Unicode is a coincidence which should not have an effect on usage of apostrophes in English. Coincidence or not, the Unicode Consortium is not going to allocate a new code-point for the English apostrophe as long as MODIFIER LETTER APOSTROPHE exists. Any change is pretty unlikely, but changing to an existing character is vastly more likely then creating a new one.
Re: Another take on the English apostrophe in Unicode
On Fri, Jun 5, 2015 at 2:43 AM QSJN 4 UKR qsjn4...@gmail.com wrote: The conflict is between linguists and programmers. No, it's not. Yes it is ambiguous! It is. It just is! Linguists say It is. We see that. We know that. Now you programmers find some way to deal with that so you can produce useful corpuses for linguistic work. Which is what this is all about, is producing good linguistic interpretations of plain text, for, among others, linguists whose supply of scanned text has exceeded their ability to hand-process it. Modifier letter apostrophe is a letter that used as itself and means itself (ejective sound e.g.) only. Don't use it else. It just make more confusion. If you don't know what language a text is in, you can't tell what sounds letters make. Adding this character to English's repertoire won't change that.
Re: Another take on the English apostrophe in Unicode
I don’t have a particular position staked out. But to this discussion should be added the very interesting work done by Zwicky and Pullum arguing that the apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC would satisfy that position. See: Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. Cliticization vs. Inflection: English N’T.Language59, no. 3 (1983): 502–513. It’s nicely summarized and discussed here: http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/ jk -- Joel Kalvesmaki Editor in Byzantine Studies Dumbarton Oaks 202 339 6435
Re: Another take on the English apostrophe in Unicode
Markus Scherer wrote: How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? I replied: Would it be possible to have wordprocessing software where one uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and could there be a show in colour mode where U+2019 is displayed in cyan and U+02BC is displayed in red, while everything else is displayed in black? I am wondering whether some existing software packages might be able to be used for the character inputting part using customized keyboard short cuts. https://community.serif.com/forum/43862/question-about-customized-keyboard-short-cuts I realize that the cyan and red colours cannot be done at present, yet I have now thought of the alternative for now of being able to test what is in the text by using a special version of an open source font where there are distinctive glyphs one from the other for the two characters. William Overington 5 June 2015
Re: Another take on the English apostrophe in Unicode
On Jun 4, 2015, at 17:34 , Markus Scherer markus@gmail.com wrote: Looks all wrong to me. don’t is a contraction of two words, it is not one word. Yes it is. Is keyboard two words? How about newspaper? If don't is two words, please tell me what two words make up won't? (Hint, neither of them is will.) Linguistically, don't and friends pass all the diagnostics that indicate they're single words. - John Burger English is taught as that squiggle being punctuation, not a letter. (Unlike, say, the Hawaiʻian ʻOkina.) You can't use simple regular expressions to find word boundaries. That's why we have UAX #29. Confusion between apostrophe and quoting -- blame the scribe who came up with the ambiguous use, not the people who gave it a number. If anything, Unicode might have made a mistake in encoding two of these that look identical. How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? markus
Re: Another take on the English apostrophe in Unicode
QSJN 4 UKR qsjn4ukr at gmail dot com wrote: And programmers say That's wrong! We can't understand that. Just are you so stupid if you can't! You know, we really aren't all like that. Some of us actually try to meet user needs. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Another take on the English apostrophe in Unicode
On 6/5/2015 10:29 AM, John D. Burger wrote: Linguistically, don't and friends pass all the diagnostics that indicate they're single words. If I am not mistaken, the french pomme de terre also passes the diagnostics. So we need a new space character. Eric.
Re: Another take on the English apostrophe in Unicode
On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer markus@gmail.com wrote: don’t is a contraction of two words, it is not one word. But as he points out, it's not a contraction of don and t; it is, at best, a contraction of do and n't. It's eliding, not punctuating. In the comments, he also brings up the examples of Don’t you mind? being okay but not *Do not you mind?, and fo’c’sle. You can't use simple regular expressions to find word boundaries. Who uses _simple_ regular expressions? You can't use any code to reliably find word boundaries in English, and that's a problem.
Re: Another take on the English apostrophe in Unicode
Hyphens generally make multiple words into one anyway. There's not really multiple hyphens the way there's separate quotes and apostrophes. On 7:01pm, Thu, Jun 4, 2015 Leo Broukhis l...@mailcom.com wrote: Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for example, the work ack-ack isn't decomposable into words, or even morphemes, ack and ack. Leo On Thu, Jun 4, 2015 at 6:31 PM, David Starner prosfil...@gmail.com wrote: On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer markus@gmail.com wrote: don’t is a contraction of two words, it is not one word. But as he points out, it's not a contraction of don and t; it is, at best, a contraction of do and n't. It's eliding, not punctuating. In the comments, he also brings up the examples of Don’t you mind? being okay but not *Do not you mind?, and fo’c’sle. You can't use simple regular expressions to find word boundaries. Who uses _simple_ regular expressions? You can't use any code to reliably find word boundaries in English, and that's a problem.
Re: Another take on the English apostrophe in Unicode
Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for example, the work ack-ack isn't decomposable into words, or even morphemes, ack and ack. Leo On Thu, Jun 4, 2015 at 6:31 PM, David Starner prosfil...@gmail.com wrote: On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer markus@gmail.com wrote: don’t is a contraction of two words, it is not one word. But as he points out, it's not a contraction of don and t; it is, at best, a contraction of do and n't. It's eliding, not punctuating. In the comments, he also brings up the examples of Don’t you mind? being okay but not *Do not you mind?, and fo’c’sle. You can't use simple regular expressions to find word boundaries. Who uses _simple_ regular expressions? You can't use any code to reliably find word boundaries in English, and that's a problem.
Re: Another take on the English apostrophe in Unicode
Looks all wrong to me. don’t is a contraction of two words, it is not one word. English is taught as that squiggle being punctuation, not a letter. (Unlike, say, the Hawaiʻian ʻOkina http://en.wikipedia.org/wiki/%CA%BBOkina.) You can't use simple regular expressions to find word boundaries. That's why we have UAX #29. Confusion between apostrophe and quoting -- blame the scribe who came up with the ambiguous use, not the people who gave it a number. If anything, Unicode might have made a mistake in encoding two of these that look identical. How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? markus
Another take on the English apostrophe in Unicode
An interesting argument for U+02BC MODIFIER LETTER APOSTROPHE as English apostrophe : https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/ Frédéric