Re: Another take on the English Apostrophe in Unicode

2015-06-18 Thread Marcel Schneider
Dear Mr Ewell,

as I was very puzzled reading Mr Davis' last reply yesterday, I stood away from 
mailing to you separately as I'd the purpose to do.
For the same reason, I forgot to remove an outdated period I'd never have 
written after reading Mr Kolehmainen's, Mr Suignard's 
and Mr Constable's e-mails I found yesterday. I beg everybody's pardon.

On Wen, Jun 17, I wrote:
 Experience proves that often a lot of mails, e-mails, blog posts, fora posts, 
 tweets and so on are needed to get things move. 
 The best way of getting nothing to be done is to get everybody convinced itʼs 
 all OK. Thatʼs what I sometimes feel reading this thread, 
 or the one about ISO/IEC JTC1/SC2/WG2 that is on-going in the meantime! 
 And the only way to get something change has always been to show itʼs wrong. 
 From there on, the next step would be to find out who is responsible. 

Please read instead:
| Experience proves that often a lot of mails, e-mails, blog posts, fora posts, 
tweets and so on are needed to get things move. 
| The best way of getting nothing to be done is to get everybody convinced itʼs 
all OK. Thatʼs what I sometimes feel reading this thread. 
| And the only way to get something change has always been to show itʼs wrong. 
| From there on, the next step would be to find out who is responsible. 
 
Best regards,

Marcel S. 

 Message du 17/06/15 18:29
 De : Marcel Schneider 
 A : MarkDavis☕️ , DougEwell 
 Copie à : TedClancy , UnicodeMailingList 
 Objet : Re: Another take on the English Apostrophe in Unicode
 

 On Tue, Jun 16, Mark Davis ☕️  wrote:

 And, Marcel, while you are at it, this is getting tiresome.
 Please find some other place to vent about events you know very little about; 
 the internet is full of them.

Dear Mark,

I understand (a little) that I'm tiresome. Please consider nevertheless that 
the Unicode Public Maliling List is AFAIK the only spot where people can 
communicate with Unicode decision makers. No other mailing list nor any forum 
on the internet can do this. Even Microsoft's Community forum can do nothing at 
Microsoft, forum volunteers told me. I posted there in French and in English. 
In French my most useful post seems to be at 
http://answers.microsoft.com/fr-fr/office/forum/office_2010-word/recherche-invers%C3%A9e-dans-les-listes/845a02fa-aa2d-4d81-a03e-12ecb7f2f46b

Since your message could not reach me yesterday, I prepared two replies I 
wanted to send today. It was exactly one to Doug and one to you. 
If you agree, I'll paste them both hereafter.

On Tue, Jun 16, 2015, Doug Ewell  wrote:

 You know what? If you want to use U+02BC as an English apostrophe, go ahead 
 and use it. Nobody's stopping you really. Not Unicode, not Microsoft, not ISO.

You know I did, and if it were just for my ownʼs sake, Iʼd probably never 
started mailing in this thread. A big part of text to be processed on quotes 
originates from other people. So when I use U+02BC, I did a good work (if I 
were quoted :)). 

A essential condition is that all text handling software is updated to handle 
correctly the letter apostrophe. Without an official recommendation, this is 
not likely to be done. And this recommendation cannot be usefully issued unless 
Microsoft agrees. We remember that without Microsoft, the Unicode Consortium 
probably wouldnʼt have been founded, and character encoding wouldnʼt thrive as 
it does today.

On Mon, Jun 15, 2015, 20:14, Doug Ewell  wrote:

 Perhaps a UTC member can confirm whether this is fact or speculation. Markus 
 Kuhn's comment from 1999 about couldn't Unicode follow Microsoft...? 
 doesn't prove that Unicode was in fact strong-armed by Microsoft.

I know that Markus Kuhnʼs concern was very valuable and he did a great job by 
showing how to eradicate the clumsy quotes simulation that was current by the 
time, due to the lack of characters. You remember, they used accents as quotes, 
and at that stage, the mixup was between apostrophe and acute!
https://www.cl.cam.ac.uk/~mgk25/ucs/apostrophe.html
The curly glyph for 0x27 in old ASCII fonts and its reversed counterpart mapped 
to 0x60 Mr Kuhn shows on this page and how to replace them properly, remind the 
U+201B—U+2019 quotes pair where the deprecated REVERSED SINGLE COMMA QUOTATION 
MARK was discussed on this List, the conclusion being:

On Thu, Jun 15, 2006, Andreas Prilop  wrote:
http://www.unicode.org/mail-arch/unicode-ml/y2006-m06/0265.html

 Actually, I have seen such quotation marks in English-language books printed 
 in Britain and the USA. But, as I wrote, they are certainly not preferred. 
 *If* you want such quotation marks, then please use U+201B for them!

At that time, the matter was correct rendering. Today, itʼs correct processing. 

Yes, fortunately U+02BC is *not deprecated* for English apostrophe, and looking 
closer, IMO there is *no recommendation* for U+2019 neither, just a stated 
preference. As I wrote sooner in this thread, Unicode logically and seemingly 
changed the preference against

Re: Another take on the English apostrophe in Unicode

2015-06-17 Thread Marcel Schneider
On Mon, Jun 16, 2015, Richard Wordingham  wrote:

 I don't know if you have the wrong link for MSKLC, but that link
 claims it is only 'supported' up to Vista. That's not much of an
 invitation! I do know that MSKLC works on Windows 7, and its output
 there is appropriate for Windows 7, generate multiple versions of
 the DLL and its installer. 

I'm sorry, I didn't think about the issue. 

The download link is not wrong, AFAIK it's the only available download page for 
the (most recent) 1.4 version. 


And this version works for Windows 8, too [and, I hope, for the coming Windows 
10], this thread on Microsoft Community shows:

http://answers.microsoft.com/en-us/windows/forum/windows_8-winapps/msklc-microsoft-keyboard-layout-creator-for/a54a4db0-94c0-4f08-8909-37a7c5b758bb

 

Marcel 


Re: Another take on the English Apostrophe in Unicode

2015-06-17 Thread Marcel Schneider
On Tue, Jun 16, Mark Davis ☕️  wrote:

 And, Marcel, while you are at it, this is getting tiresome.
 Please find some other place to vent about events you know very little about; 
 the internet is full of them.

Dear Mark,

I understand (a little) that I'm tiresome. Please consider nevertheless that 
the Unicode Public Maliling List is AFAIK the only spot where people can 
communicate with Unicode decision makers. No other mailing list nor any forum 
on the internet can do this. Even Microsoft's Community forum can do nothing at 
Microsoft, forum volunteers told me. I posted there in French and in English. 
In French my most useful post seems to be at 
http://answers.microsoft.com/fr-fr/office/forum/office_2010-word/recherche-invers%C3%A9e-dans-les-listes/845a02fa-aa2d-4d81-a03e-12ecb7f2f46b

Since your message could not reach me yesterday, I prepared two replies I 
wanted to send today. It was exactly one to Doug and one to you. 
If you agree, I'll paste them both hereafter.

On Tue, Jun 16, 2015, Doug Ewell  wrote:

 You know what? If you want to use U+02BC as an English apostrophe, go ahead 
 and use it. Nobody's stopping you really. Not Unicode, not Microsoft, not ISO.

You know I did, and if it were just for my ownʼs sake, Iʼd probably never 
started mailing in this thread. A big part of text to be processed on quotes 
originates from other people. So when I use U+02BC, I did a good work (if I 
were quoted :)). 

A essential condition is that all text handling software is updated to handle 
correctly the letter apostrophe. Without an official recommendation, this is 
not likely to be done. And this recommendation cannot be usefully issued unless 
Microsoft agrees. We remember that without Microsoft, the Unicode Consortium 
probably wouldnʼt have been founded, and character encoding wouldnʼt thrive as 
it does today.

On Mon, Jun 15, 2015, 20:14, Doug Ewell  wrote:

 Perhaps a UTC member can confirm whether this is fact or speculation. Markus 
 Kuhn's comment from 1999 about couldn't Unicode follow Microsoft...? 
 doesn't prove that Unicode was in fact strong-armed by Microsoft.

I know that Markus Kuhnʼs concern was very valuable and he did a great job by 
showing how to eradicate the clumsy quotes simulation that was current by the 
time, due to the lack of characters. You remember, they used accents as quotes, 
and at that stage, the mixup was between apostrophe and acute!
https://www.cl.cam.ac.uk/~mgk25/ucs/apostrophe.html
The curly glyph for 0x27 in old ASCII fonts and its reversed counterpart mapped 
to 0x60 Mr Kuhn shows on this page and how to replace them properly, remind the 
U+201B—U+2019 quotes pair where the deprecated REVERSED SINGLE COMMA QUOTATION 
MARK was discussed on this List, the conclusion being:


On Thu, Jun 15, 2006, Andreas Prilop  wrote:
http://www.unicode.org/mail-arch/unicode-ml/y2006-m06/0265.html

 Actually, I have seen such quotation marks in English-language books printed 
 in Britain and the USA. But, as I wrote, they are certainly not preferred. 
 *If* you want such quotation marks, then please use U+201B for them!

At that time, the matter was correct rendering. Today, itʼs correct processing. 

Yes, fortunately U+02BC is *not deprecated* for English apostrophe, and looking 
closer, IMO there is *no recommendation* for U+2019 neither, just a stated 
preference. As I wrote sooner in this thread, Unicode logically and seemingly 
changed the preference against its will. 
Logically, because the first recommendation (like the whole Standard) was 
consciously designed, Mr Davis recalled us the day before yesterday.
Seemingly, because the U+0027 comment line in the Code Chart has been changed 
from
 preferred character for apostrophe is 2019
to
 2019 is preferred for apostrophe
between the 3.0.0 and 4.0.0 versions (while the line “preferred characters in 
English for paired quotation marks are 2018  2019” remained unchanged; see the 
complete comparison at http://charupdate.info#ambiguation).

On Tue, Jun 16, 2015, Doug Ewell  wrote:

 I do wish we could put an end to all the accusations of malfeasance.

Experience proves that often a lot of mails, e-mails, blog posts, fora posts, 
tweets and so on are needed to get things move. The best way of getting nothing 
to be done is to get everybody convinced itʼs all OK. Thatʼs what I sometimes 
feel reading this thread, or the one about ISO/IEC JTC1/SC2/WG2 that is 
on-going in the meantime! 
And the only way to get something change has always been to show itʼs wrong. 
From there on, the next step would be to find out who is responsible. 


About the apostrophe, weʼre all a bit responsible. 
Why to hide that British English usage does not much to disambiguate things, by 
preferring single quotes as current quotation marks, leading some authors to 
end up preferring chevrons even in English—see Chris Harvey (pleading for 
U+2019 as apostrophe) at 
http://www.languagegeek.com/typography/apostrophes.html#Anchor-Potentia-61409

But 

Re: Another take on the English Apostrophe in Unicode

2015-06-17 Thread Marcel Schneider
On Tue, Jun 16, 2015, Philippe Verdy  wrote:

 When ISO 8859-1 was designed (in fact in an early version by Digital for its 
 own version of Unix), allowing a bijective compatibility with 8-bit EBCDIC 
 and its C1 controls was still a priority.

 Microsoft abandoned its own develomment of Unix to develop DOS and extend it 
 with Windows in parallel of its work with IBM that had wanted DOS to be a 
 very lightweight version of CP/M, but without a scheduler in order to run 
 softwares on personal computers that could be used in small organisations 
 that could not buy its mainframes, but had to prepare documents and data that 
 could be reused on IBM mainframes...

 

Thank you Philippe for the information. It was a very good idea to build a 
system without need of C1 and to remap the two ranges to completing characters, 
which are indispensable, notably in French, and to start with the single quotes.


 

Marcel 

 Message du 16/06/15 21:08
 De : Philippe Verdy 
 A : Marcel Schneider 
 Copie à : Doug Ewell , Unicode Mailing List 
 Objet : Re: Another take on the English Apostrophe in Unicode
 





2015-06-16 19:02 GMT+02:00 Marcel Schneider :


 On Mon, Jun 15, 2015, 17:12, Doug Ewell  wrote:
 
  Marcel Schneider wrote:
 [...]
  Microsoft’s choice of mashing up apostrophe and close-quote to end up
  with an unprocessable hybrid was wrong. Very wrong.
 
  Windows-1252 and the other Windows code pages were developed during the
  1980s, before Unicode, when almost all non-Asian character sets were
  limited to 256 code points. The distinctions between apostrophe and
  right-single-quote, weighed against the confusion caused by encoding two
  identical-looking characters, would never have been sufficient back then
  to justify separate encoding in this limited space.
 
 I replied:
 
  The problem is not about code pages [...]
 
 I thank you for your answers and I'll come back upon some of them below. 
 There's some new fact to bring first. 

 I concede that my last reply yesterday in the evening was incorrect. 

 Additionally to Microsoftʼs action in the late nineties urging Unicode to 
 give up its useful apostrophe recommendation (U+02BC), the design of code 
 page Windows-1252 is in my scope, indeed.
 
 Since I learned there are very good and outweighing reasons to use U+02BC in 
 English, and that Unicodeʼs respective recommendation has been withdrawn with 
 respect to a widespread practice founded on CP Windows-1252, I soon suspected 
 there would have been means to get the apostrophe into this code page. Here I 
 need to recall that I always liked Windows-1252 for its completing the ISO 
 8859-1 charset (which was so useless* it had to be replaced with ISO 8859-15).
 * Please read this paper (in French):
 http://cahiers.gutenberg.eu.org/cg-bin/article/CG_1996___25_65_0.pdf
 
 Now that I examined closely CP1252ʼs layout, I found five empty code points, 
 five code points left out, in the C1 ranges that Microsoft allocated to 
 complete ISO 8859−1. Further, in this range, I found two MODIFIER LETTERS, 
 CIRCUMFLEX ACCENT (136, 0x88, later U+02C6) and SMALL TILDE (152, 0x98, 
 U+02DC). Obviously these two were added to disambiguate the extensively used 
 spacing characters ^ (94, 0x5E) and ~ (126, 0x7E) on one side, and the 
 diacritics on the other side. There is to say that when Windows was first 
 released, the left and right single quotes were the only printable characters 
 in these two ranges. All other characters plus × and ÷ came later. However, 
 CP1252 remained stable since Windows 98, for which € and the žŽ pair were 
 added. And five places were left empty.
 
 From this on I got convinced that it would have been very easy to place the 
 letter apostrophe for example at code point 144 (0x90), near the single 
 turned comma quotation mark 0x91 and the single comma quotation mark 
 (right-single-quote) 0x92 which Microsoft recommended for use as apostrophe.
 
 About the “confusion” everybody refers to, there is to say that the only way 
 to get people confused, is to do things and not to explain anything to 
 anybody. 
 
 The core problem would have been that code pages were designed with 
 glyph-based *character* encoding in mind, not semantics-based *text* 
 encoding. 
 
 I repeat that others had done even worse. Others, that is some of the 
 so-called expert members of the ISO WG designing 8859-1, as two of them not 
 even aimed at encoding all needed characters, by refusing deliberately to 
 encode the lower- and uppercase Œ digraph, and even the uppercase Ÿ. 
 Microsoftʼs big merit has been to produce a ready remedy to this bungling, 
 that as far as belongs to the OE digraph, was meant to match defective 
 peripherics.
 
 Unfortunately, Microsoft visibly didnʼt finish this job, by aiming at 
 encoding characters only, and thus not allocating more than one code point to 
 that squiggle, whilst several places were left.
 
 Well, all that are errors of the past. If I donʼt see a need, I wonʼt

Re: Another take on the English Apostrophe in Unicode

2015-06-16 Thread Marcel Schneider
On Mon, Jun 15, 2015, 17:12, Doug Ewell  wrote:

 Marcel Schneider wrote:
[...]
 Microsoft’s choice of mashing up apostrophe and close-quote to end up
 with an unprocessable hybrid was wrong. Very wrong.

 Windows-1252 and the other Windows code pages were developed during the
 1980s, before Unicode, when almost all non-Asian character sets were
 limited to 256 code points. The distinctions between apostrophe and
 right-single-quote, weighed against the confusion caused by encoding two
 identical-looking characters, would never have been sufficient back then
 to justify separate encoding in this limited space.

I replied:

 The problem is not about code pages [...]

I thank you for your answers and I'll come back upon some of them below. 
There's some new fact to bring first. 

I concede that my last reply yesterday in the evening was incorrect. 

Additionally to Microsoftʼs action in the late nineties urging Unicode to give 
up its useful apostrophe recommendation (U+02BC), the design of code page 
Windows-1252 is in my scope, indeed.

Since I learned there are very good and outweighing reasons to use U+02BC in 
English, and that Unicodeʼs respective recommendation has been withdrawn with 
respect to a widespread practice founded on CP Windows-1252, I soon suspected 
there would have been means to get the apostrophe into this code page. Here I 
need to recall that I always liked Windows-1252 for its completing the ISO 
8859-1 charset (which was so useless* it had to be replaced with ISO 8859-15).
* Please read this paper (in French):
http://cahiers.gutenberg.eu.org/cg-bin/article/CG_1996___25_65_0.pdf

Now that I examined closely CP1252ʼs layout, I found five empty code points, 
five code points left out, in the C1 ranges that Microsoft allocated to 
complete ISO 8859−1. Further, in this range, I found two MODIFIER LETTERS, 
CIRCUMFLEX ACCENT (136, 0x88, later U+02C6) and SMALL TILDE (152, 0x98, 
U+02DC). Obviously these two were added to disambiguate the extensively used 
spacing characters ^ (94, 0x5E) and ~ (126, 0x7E) on one side, and the 
diacritics on the other side. There is to say that when Windows was first 
released, the left and right single quotes were the only printable characters 
in these two ranges. All other characters plus × and ÷ came later. However, 
CP1252 remained stable since Windows 98, for which € and the žŽ pair were 
added. And five places were left empty.

From this on I got convinced that it would have been very easy to place the 
letter apostrophe for example at code point 144 (0x90), near the single turned 
comma quotation mark 0x91 and the single comma quotation mark 
(right-single-quote) 0x92 which Microsoft recommended for use as apostrophe.

About the “confusion” everybody refers to, there is to say that the only way to 
get people confused, is to do things and not to explain anything to anybody. 

The core problem would have been that code pages were designed with glyph-based 
*character* encoding in mind, not semantics-based *text* encoding. 

I repeat that others had done even worse. Others, that is some of the so-called 
expert members of the ISO WG designing 8859-1, as two of them not even aimed at 
encoding all needed characters, by refusing deliberately to encode the lower- 
and uppercase Œ digraph, and even the uppercase Ÿ. Microsoftʼs big merit has 
been to produce a ready remedy to this bungling, that as far as belongs to the 
OE digraph, was meant to match defective peripherics.

Unfortunately, Microsoft visibly didnʼt finish this job, by aiming at encoding 
characters only, and thus not allocating more than one code point to that 
squiggle, whilst several places were left.

Well, all that are errors of the past. If I donʼt see a need, I wonʼt meet it. 
By leaving œ and Œ off the charset, they got × and ÷ in, at least. Where things 
ran really bad, was when Unicode was on, and code pages Procrustesʼ beds were 
out. At least, they should have been. Whence that survival of CP1252-based 
confusion?

Briefly, todayʼs text processing is suffering from the apostrophe-close-quote 
confusion. This confusion is firstly out of date, and secondly it was 
unnecessary from the beginning on. Avoiding this confusion at a trivial level 
(by not getting users confused to have to use two similar squiggles), is 
shifting it at process level, where the damage it causes is far bigger. Trust 
me, users who find themselves unable to set apart the apostrophes when theyʼre 
going to replace single quotes, wonʼt bless Microsoft for the input simplicity! 
Ted Clancyʼs blog post is here to prove.
https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/


It was time to get rid of that confusion when Unicode recommended U+02BC for 
apostrophe. Microsoftʼs choice not to comply was wrong again. Very wrong.

 

Let's come back to some of your replies.


 

On Mon, Jun 15, 2015, 20:14, Doug Ewell  

RE: Another take on the English Apostrophe in Unicode

2015-06-16 Thread Doug Ewell
Marcel Schneider charupdate at orange dot fr wrote:

 That's to despise people, that's to spit at their face.

You know what? If you want to use U+02BC as an English apostrophe, go
ahead and use it. Nobody's stopping you really. Not Unicode, not
Microsoft, not ISO.

I do wish we could put an end to all the accusations of malfeasance.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Another take on the English Apostrophe in Unicode

2015-06-16 Thread Marcel Schneider
On Mon, Jun 15, 2015, Doug Ewell  wrote:

 Marcel Schneider wrote:

 A free tool, the Microsoft Keyboard Layout Creator, allows every user
 to add U+02BC on his preferred keyboard layout

 I use John Cowan's Moby Latin keyboard, built with MSKLC, which is 100%
 compatible with the AltGr-less US keyboard and supports almost 900 other
 characters, including all of the apostrophes and quotes and dashes and
 other characters under discussion:
 
 http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html
 
 I spent years designing and updating my own keyboard layout and studying
 other layouts. I've ended this quest since I started using Moby Latin;
 it's the best I've seen in numerous ways.

Yesterday late in the evening, I looked up John Cowans keyboard layouts. They 
are the best MSKLC based keyboard layouts Iʼve ever seen. They are memonic. I 
note that it naturally uses AltGr (right-hand Alt or Alt+Ctrl). In my last 
yesterdayʼs reply I reminded a multilingual layout from a research institute 
which really does not use more than two shift states. Itʼs not free.

Mr Cowan writes about some allocations being temporary until a new MSKLC 
version for chained dead keys is released. This MSKLC 2,0 is still not born and 
I fear it will never. IMO this is the result of the disinterest of many people. 
You and others probably represent exceptions. This goes so far that MSKLC is 
declared “appears very rarely” in the Acronym Finder. Normally the release and 
update of MSKLC should have created a buzz on social media, and today nobody 
would complain about missing characters. Well, I too complained one year long 
without knowing about MSKLC. 


Today, one year ago, I installed my copy of the MSKLC. Later I tried to define 
a universal Latin layout too, but when I was at 1,921 Unicode characters, I 
never could remind it. I gave up this way, itʼs hard to get on one keyboard, 
among other Unicode  characters, all 1,736 of 8.0.0 used in Latin script (if my 
subset is right). Do you know Ilya Zakharewichʼs approach?
http://search.cpan.org/~ilyaz/UI-KeyboardLayout-0.64/lib/UI/KeyboardLayout.pm



Best regards,
Marcel Schneider


Re: Another take on the English Apostrophe in Unicode

2015-06-16 Thread Mark Davis ☕️
And, Marcel, while you are at it, this is getting tiresome.

Please find some other place to vent about events you know very little
about; the internet is full of them.

Mark


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Tue, Jun 16, 2015 at 7:33 PM, Doug Ewell d...@ewellic.org wrote:

 Marcel Schneider charupdate at orange dot fr wrote:

  That's to despise people, that's to spit at their face.

 You know what? If you want to use U+02BC as an English apostrophe, go
 ahead and use it. Nobody's stopping you really. Not Unicode, not
 Microsoft, not ISO.

 I do wish we could put an end to all the accusations of malfeasance.

 --
 Doug Ewell | http://ewellic.org | Thornton, CO 





Re: Another take on the English apostrophe in Unicode

2015-06-16 Thread Marcel Schneider
On Sat, Jun 13, 2015, Mark Davis  wrote:

 In particular, I see no need to change our recommendation on the character 
 used 
 in contractions for English and many other languages (U+2019). Similarly, we 
 wouldn't 
 recommend use of anything but the colon for marking abbreviations in Swedish, 
 or 
 propose a new MODIFIER LETTER ELLIPSIS for supercali...docious.

 (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the 
 confusion.)

On Mon, Jun 15, 2015 at 10:19 AM, Mark Davis ☕️  wrote:

 On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider  wrote:

 When we take the topic down again from linguistics to the core mission of 
 Unicode, that is character encoding and text processing standardisation, 
 ellipsis and Swedish abbreviation colon differ from the single closing 
 quotation mark in this, that they are not to be processed.

 [...]


 Quite nice of you to inform me of the core mission of Unicode—I must have 
 somehow missed that.


I was rather astonished and amused when I read I could have aimed at informing 
you of Unicodeʼs core. The goal was to check Iʼm at the right level. Well, 
there would have been another manner to say it... which didnʼt come at mind to 
me.

However, what surprises me even more as I think about, is while knowing all on 
Unicode, youʼve got just a weak opinion on which apostrophe recommendation is 
the right one...

 More seriously, it is not all so black and white. As we developed Unicode, we 
 considered whether to separate characters by function, eg, an END OF SENTENCE 
 PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING PERIOD, etc. Or 
 DIARASIS vs UMLAUT. We quickly concluded that the costs far, far outweighed 
 the benefits.

Itʼs another proof of Unicodeʼs professionalism as to have thought about 
distinguishing DIAERESIS and UMLAUT. Despite of being a French-German bilingual 
and knowing the diacritics, I encountered that first in Microsoftʼs kbd.h, 
where the one is called DIARESIS and is mapped to UMLAUT. Iʼm not a friend of 
such distinctions (except in vocabulary and grammatics), because in writing 
practice they would be but useless and counterproductive complications. An 
abbreviation dot would have been much more useful, but to deploy its benefits, 
it would have needed a supplemental key mapping. On this background, Unicodeʼs 
choice of recommending to disambiguate the apostrophe is even more meritorious. 
I see it as a proof that there is really a good reason that people mind at the 
difference whenever they donʼt use the ASCII apostrophe for all of them. What 
would have bugged Microsoft then, was that it could have to implement this 
difference in its word processing and desktop publishing software, and to tell 
users about. Nothing easier for Microsoft with all the Help and Info! “The new 
smart quotes help you to check whether you need an apostrophe or a quote. This 
makes quotes conversion easy.” Or the like.

 In practice, whenever characters are essentially identical—and by that I mean 
 that the overlap between the acceptable glyphs for each character is very 
 high—people will inevitably mix up the characters on entry. So any processing 
 that depends on that distinction is forced to correct the data anyway. And 
 separating them causes even simple things like searching for a character on a 
 page to get screwed up without having equivalence classes.

Based on the Unicode principle to encode characters, not glyphs, I doubt 
whether two characters may be called _essentially_ identical when they look the 
same. A huge subset of the Code Chartsʼ xrefs is to help font designers on this 
point. About people mixing up, they are most likely to do so when the keyboard 
allows only one of two. This is not the case of U+02BC and U+2019, none of 
whose is on standard keyboards. Here itʼs the smart quotes algorithm which will 
mix up! And this one is easily helped not to do so, since itʼs embedded in 
high-end software with all its display and shortcut capabilities. Eventually, 
the only one who wanted to keep mixing up was—guess who?—Microsoft.

The reason? Word processing that depends on distinction between opening and 
closing quotation marks, which needs a very tiny algorighm, is much easier to 
implement than processing that depends on distinction between apostrophe and 
simple closing quotation mark, and between apostrophe and simple quotation 
marks on the whole. Informal English word forms are so rich and varying that 
some are ambiguous and scarcely any software dictionary can contain them all. 
But even formal English is not wholly supported since nested quotes often are 
not. Why would users not be interested in improved software, even if it would 
cost a little more?

About searching and equivalence classes: There is already plenty of equivalence 
implemented in the simplest search algorighm: casing! A class more with 
(U+0027, U+02BC, U+2019) wouldnʼt change that a lot.

So we only separated essentially identical 

Re: Another take on the English Apostrophe in Unicode

2015-06-16 Thread Marcel Schneider
On Mon, Jun 15, Philippe Verdy  wrote:

 But I think that keyboard should all have a dedicated Kana key to easily map 
 additional groups without sacrificing other shift keys 
 on the last row: keyboards really don't need two windows keys and so the 
 space bar can remain with a cumfortable width [...]. 

IMHO the space bar should not exceed five keys in width.

 If a Kana key or present, in fact it should be to the right of the right 
 control, or ro the right of the right Shift

The best is always that the asymetric modifiers be actioned with the thumbs. If 
I had to choose between AltGr and Kana, I would prefer the latter because it 
does not interfere with Ctrl+Alt and does not disable dead keys on Word. But 
alternately we could map the MODIFIER LETTER APOSTROPHE on the right-hand Alt 
key for a fluid input of high-quality text files.

 [...] Keyboards on notebooks are extremely poorly designed, a complete 
 nonsense.

Yes there are many models from big manufacturers whose key dispatch I donʼt 
like. By contrast, my computer is a netbook, where nevertheless I find all keys 
I need, in an ergonomical array. Iʼm not bound, and Iʼm not paid to make adʼ. 
Itʼs just an advice. The manufacturer my netbook is from, shipped the same 
model for the United States *with* an Applications key, *with* a Pause key, 
*with* a second Function modifier key to the right, with up and down keys of 
the *same size* as left and right, and *with* an overlaid numpad: When you 
disable the numpad specials on a customised layout, you just press Fn while 
entering digits (or press the toggle before and after), the same as on Macbooks 
I read and heard. Itʼs Asus.


Best regards,
Marcel Schneider


Re: Another take on the English Apostrophe in Unicode

2015-06-16 Thread Philippe Verdy
When ISO 8859-1 was designed (in fact in an early version by Digital for
its own version of Unix), allowing a bijective compatibility with 8-bit
EBCDIC and its C1 controls was still a priority.

Microsoft abandoned its own develomment of Unix to develop DOS and extend
it with Windows in parallel of its work with IBM that had wanted DOS to be
a very lightweight version of CP/M, but without a scheduler in order to run
softwares on personal computers that could be used in small organisations
that could not buy its mainframes, but had to prepare documents and data
that could be reused on IBM mainframes...


2015-06-16 19:02 GMT+02:00 Marcel Schneider charupd...@orange.fr:

 On Mon, Jun 15, 2015, 17:12, Doug Ewell d...@ewellic.org wrote:

  Marcel Schneider wrote:
 [...]
  Microsoft’s choice of mashing up apostrophe and close-quote to end up
  with an unprocessable hybrid was wrong. Very wrong.

  Windows-1252 and the other Windows code pages were developed during the
  1980s, before Unicode, when almost all non-Asian character sets were
  limited to 256 code points. The distinctions between apostrophe and
  right-single-quote, weighed against the confusion caused by encoding two
  identical-looking characters, would never have been sufficient back then
  to justify separate encoding in this limited space.

 I replied:

  The problem is not about code pages [...]

 I thank you for your answers and I'll come back upon some of them below.
 There's some new fact to bring first.

 I concede that my last reply yesterday in the evening was incorrect.

 Additionally to Microsoftʼs action in the late nineties urging Unicode to
 give up its useful apostrophe recommendation (U+02BC), the design of code
 page Windows-1252 is in my scope, indeed.

 Since I learned there are very good and outweighing reasons to use U+02BC
 in English, and that Unicodeʼs respective recommendation has been withdrawn
 with respect to a widespread practice founded on CP Windows-1252, I soon
 suspected there would have been means to get the apostrophe into this code
 page. Here I need to recall that I always liked Windows-1252 for its
 completing the ISO 8859-1 charset (which was so useless* it had to be
 replaced with ISO 8859-15).
 * Please read this paper (in French):
 http://cahiers.gutenberg.eu.org/cg-bin/article/CG_1996___25_65_0.pdf

 Now that I examined closely CP1252ʼs layout, I found five empty code
 points, five code points left out, in the C1 ranges that Microsoft
 allocated to complete ISO 8859−1. Further, in this range, I found two
 MODIFIER LETTERS, CIRCUMFLEX ACCENT (136, 0x88, later U+02C6) and SMALL
 TILDE (152, 0x98, U+02DC). Obviously these two were added to disambiguate
 the extensively used spacing characters ^ (94, 0x5E) and ~ (126, 0x7E) on
 one side, and the diacritics on the other side. There is to say that when
 Windows was first released, the left and right single quotes were the only
 printable characters in these two ranges. All other characters plus × and ÷
 came later. However, CP1252 remained stable since Windows 98, for which €
 and the žŽ pair were added. And five places were left empty.

 From this on I got convinced that it would have been very easy to place
 the letter apostrophe for example at code point 144 (0x90), near the single
 turned comma quotation mark 0x91 and the single comma quotation mark
 (right-single-quote) 0x92 which Microsoft recommended for use as apostrophe.

 About the “confusion” everybody refers to, there is to say that the only
 way to get people confused, is to do things and not to explain anything to
 anybody.

 The core problem would have been that code pages were designed with
 glyph-based *character* encoding in mind, not semantics-based *text*
 encoding.

 I repeat that others had done even worse. Others, that is some of the
 so-called expert members of the ISO WG designing 8859-1, as two of them not
 even aimed at encoding all needed characters, by refusing deliberately to
 encode the lower- and uppercase Œ digraph, and even the uppercase Ÿ.
 Microsoftʼs big merit has been to produce a ready remedy to this bungling,
 that as far as belongs to the OE digraph, was meant to match defective
 peripherics.

 Unfortunately, Microsoft visibly didnʼt finish this job, by aiming at
 encoding characters only, and thus not allocating more than one code point
 to that squiggle, whilst several places were left.

 Well, all that are errors of the past. If I donʼt see a need, I wonʼt meet
 it. By leaving œ and Œ off the charset, they got × and ÷ in, at least.
 Where things ran really bad, was when Unicode was on, and code pages
 Procrustesʼ beds were out. At least, they should have been. Whence that
 survival of CP1252-based confusion?

 Briefly, todayʼs text processing is suffering from the
 apostrophe-close-quote confusion. This confusion is firstly out of date,
 and secondly it was unnecessary from the beginning on. Avoiding this
 confusion at a trivial level (by not getting users confused 

Re: Another take on the English apostrophe in Unicode

2015-06-16 Thread Richard Wordingham
On Mon, 15 Jun 2015 08:40:57 +0200 (CEST)
Marcel Schneider charupd...@orange.fr wrote:

 ...while in the meantime, in obliging
 anticipation, the worldʼs biggest software company stays inviting us
 to feel free to customise our keyboard with a free tool for free
 download at
 http://www.microsoft.com/en-us/download/details.aspx?id=22339

I don't know if you have the wrong link for MSKLC, but that link
claims it is only 'supported' up to Vista.  That's not much of an
invitation!  I do know that MSKLC works on Windows 7, and its output
there is appropriate for Windows 7, generate multiple versions of
the DLL and its installer.   

Richard.



Re: Another take on the English apostrophe in Unicode

2015-06-15 Thread Marcel Schneider
On Mon, Jun 15, 2015 at 10:19 AM, Mark Davis ☕️  wrote:

 On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider  wrote:

 When we take the topic down again from linguistics to the core mission of 
 Unicode, that is character encoding and text processing standardisation, 
 ellipsis and Swedish abbreviation colon differ from the single closing 
 quotation mark in this, that they are not to be processed.

 Linguistics, however, delivered the foundation on which Unicode issued its 
 first recommendation on what character to use for apostrophe. The result was 
 neither a matter of opinion, nor of probabilities.

 Actually, the choice is between perpetuating confusion in word processing, 
 and get people confused for a little time when announcing that U+2019 for 
 apostrophe was a mistake.


 Quite nice of you to inform me of the core mission of Unicode—I must have 
 somehow missed that.

 More seriously, it is not all so black and white. As we developed Unicode, we 
 considered whether to separate characters by function, eg, an END OF SENTENCE 
 PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING PERIOD, etc. Or 
 DIARASIS vs UMLAUT. We quickly concluded that the costs far, far outweighed 
 the benefits.

In practice, whenever characters are essentially identical—and by that I mean 
that the overlap between the acceptable glyphs for each character is very 
high—people will inevitably mix up the characters on entry. So any processing 
that depends on that distinction is forced to correct the data anyway. And 
separating them causes even simple things like searching for a character on a 
page to get screwed up without having equivalence classes.

So we only separated essentially identical characters in limited cases: such 
as letters from different scripts.

 

It was a very good idea to disambiguate also apostrophe and single quote, and I 
feel it's not paid too much because it simplified greatly the processing of 
quotation marks in English. I mean, the replacement of each pair of one kind by 
a pair of another kind. When I search for quotes in a text, I don't want to be 
distracted by apostrophes. Don't worry about equivalence classes, they already 
present to us a word without apostrophe as equivalent to the same letters with 
an apostrophe/quote between. It's every time better the computer knows what a 
character is exactly, even when at output it doesn't need to let us know, than 
that it comes up with a useless mixup.


 

You just brought up another good idea too: Period-terminated abbreviations are 
listed as exceptions in word processors. Another list could contain all words 
with leading apostrophe and all words with trailing apostrophe. This might 
allow to filter search results and to separate definitely apostrophes and 
single comma quotation marks. And at input, the smart quotes algorithms will 
become even smarter. Say, really smart.


 

I don't believe working people would mix up letter apostrophe and close-quote 
if they were on keyboard. And even now that they aren't, people don't, because 
people just hit the apostrophe key, which without any dumb smart quotes 
algorithm leads always to visually satisfying results, as shown in the Unicode 
documentation. For good desktop publishing, people must work hard anyway, so it 
would be nice to give them the means, and not to overburden them with routine 
tasks due to deficient text encoding.


 

The way things are working today is not satisfying concerning the English 
apostrophe. I still can't believe that the Unicode Committees were wrong when 
recommending U+02BC. Restoring this advantage today, will be at the honor of 
all involved parties, and we and future generations will thank you very much. 

 

If they'll exist.


 

Best regards,


Marcel Schneider




Re: Another take on the English apostrophe in Unicode

2015-06-15 Thread Marcel Schneider
On Tue Mar 26 2002 - 10:01:43 EST, Mark Davis ☕️  wrote:

http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/0598.html

 Apostrophe, hyphen, and various other puncutation by default continue
 a word, but this behavior may be overriden on a per-language basis.
 Heuristics or more sophisticated engines may be needed when the
 apostrophe is at the end of a word, as in “the peoples' choice”, since
 it is ambiguous. The modifier letter apostrophe, on the other hand, is
 always treated as a letter.

 

[I replaced '' '' with '“' '”' to prevent confusion with a tag by the user 
agent.]

 

On Tue Mar 26 2002 - 11:44:28 EST, Marco Cimarosti  wrote:

http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/0604.html


 

 Mark Davis wrote: 
 Apostrophe, hyphen, and various other puncutation by default continue 
 a word, but this behavior may be overriden on a per-language basis. 

 This may work for things such as finding word boundaries, but not for 
 identifiers. 

 According to the ID_Start and ID_Continue properties in 
 , neither 
 U+0027 (APOSTROPHE) nor U+2019 (RIGHT SINGLE QUOTATION MARK) are allowed in 
 an identifier. And this is not surprising, since they are primarily 
 quotation marks. 

 On the other hand, U+02BC (MODIFIER LETTER APOSTROPHE) is allowed in any 
 position within an identifier. Using U+02BC as the apostrophe, would allow 
 to use words such as: ,  or 'em in identifiers. 

 But this hits against the fact that Unicode's own suggestion is to use 
 U+2019 for the apostrophe.

 


On Tue Mar 26 2002 - 12:08:41 EST , Marco Cimarosti  wrote:

http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/0608.html



 But, as you say, the apostrophe is legitimate and sometimes mandatory in the 
 orthography of English and many other languages. So, it seems to me that its 
 preferred encoding should make it possible to use it in identifiers, 
 filenames, URI(')s, and so on.

 

 

Don't we fall back into the times of all-0x27 and stay in front of on-going 
confusion when 


English apostrophe is ambiguated with closing-quote? 


As you told us, having both U+02BC and U+2019 in use will need some 
supplemental algorithms.


But as you told in 2002, this is true when both are confused in only one 
character, too.


 

I suspect that the cost of using MODIFIER LETTER APOSTROPHE for English 
apostrophe (and as 


apostrophe on the whole) today would mainly be the cost of updating 
implementations and text files. 


If this cost is too high, we would have to consider that text has not to be 
quoted nor to be converted 


between British and US English. I hope people will stay communicating and 
exchanging.


 

Marcel Schneider


 

 

 

 













Re: Another take on the English apostrophe in Unicode

2015-06-15 Thread QSJN 4 UKR
By the way, about smart quotes. I am using that for long time. My
keyboard layout generates two characters on one key-press (so I have
to enter [«»][←]{sth}[→] instead of [«]{sth}[»]). It's not that good,
but I'm not afraid neither to lose quotation marks or parentheses nor
become a victim of artificial intelligence :)
About what is one word. Do you know the German prefixes? ... ...
macht ... ... ... ... ... ... auf.
Let me ask if double-quotes are parts of word or not?  For example, in
this sentence not is a noun, not particle? Was Titanic titanic?



Re: Another take on the English apostrophe in Unicode

2015-06-15 Thread Philippe Verdy
2015-06-15 15:20 GMT+02:00 QSJN 4 UKR qsjn4...@gmail.com:

 By the way, about smart quotes. I am using that for long time. My
 keyboard layout generates two characters on one key-press (so I have
 to enter [«»][←]{sth}[→] instead of [«]{sth}[»]). It's not that good,


You could generate three keystrokes [«][»][←] from a single keypress to get
the same effect.

Various editors already do that when you press the first key for the
opening quote, and all you have to type then is the [→] key (instead of the
key for a closing quote) after typing the word.

Such system is used in many IDE or text editors for programmers when they
enter the opening parenthese, or square bracket, or single/double quotes,
or braces, or block comment prefixes, or any paired symbols or keywords
used in the programming language (e.g. begin | end in Pascal, #if
|\n#endif in C/C++ preprocessor directives : the pipe here notes the
position of the cursor after typing what is just before it, what is after
the pipe is inserted after the cursor position).

If you disagree with those automatic insertions after the cursor, you can
immediately press CTRL+Z to cancel this added suffix but keep what you just
entered. another CTRL+Z will undo your previous keypress(es) for the
character(s) just before the cursor position. Some editors are even smarter
before the cursor position is not just a single position but a selected
range and as long as you continue typing just before this range, the
selection is preserved, and when you press [→] it will skip over this whole
selection and you an also press then the backspace key to delete that
autoinserted selected range. If you move your cursor elsewhere, the
selection is unselected and you get back to the normal insertion cursor
with an empty selection.

Such system is used for example in Notepad++ (for Windows), or Eclipse (you
can disable this automatic insertion in your preferences).

This editor feature does not depend on the character layout but depends on
the selected language for matching pairs: it does not have to be limited to
programming languages and can be used as well for natural human languages,
including in advanced word processors. It can also be used to insert
automatically some additional space when you just press an initial quote:
entering only [«] when editing French text, what you would get is
[«][NNBSP]|[NNBSP][»] (with the cursor selection over the last two
characters). These editors normally have a way to edit their automatic
insertion rules (with the text to match before, the text to add jut after
it, the new cursor position, and the text to insert just after it (and to
hopefully preselect in such a way that when continuing entering text
without moving the insertion position, it is not overwritten but just
preseves this selected text). Such rules can be part of the parameters for
the spell checker.


Re: Another take on the English Apostrophe in Unicode

2015-06-15 Thread Philippe Verdy
2015-06-15 16:49 GMT+02:00 Marcel Schneider charupd...@orange.fr:


 It's indeed very useful to keep two Control modifiers. Because the
 modifiers at the left and right border of the block are acted with the
 little finger and should thus be symetrical. This does not apply to the Alt
 keys and other keys more or less centered around the space bar, which are
 acted with the thumbs. As Alt is less used than Kana (when there is a Kana
 key), Kana should be on left Alt, symetrical to the (on many keyboards
 already implemented) AltGr key. The Alt key comes then on the Applications
 key, which is mnemonic because of the contextual menu icon. Internally,
 indeed, the Alt keys (left and right) are called Menu keys (Virtual key
 Left Menu or VK_LMENU, and VK_RMENU). This contextual menu is then invoked
 pressing the right Windows key, which is consistently missing on laptops.

Not just laptops. My desktop PC only has a single Windows key, on the left.
Anyway there's little use of the Windows key that was introduced lately
(and there are still lot of keyboards that don't have this key). The same
remark applies to the ScrollLock key (which is now frequently remapped to
Fn+Pause/SysAttn or other similar combination using the single Windows key
when there's no Fn key which is typical of notebooks).

However I disagree with your opinion about AltGr+Shift combinations: it
works perfectly including with the ISO 9995 definitions: the unshifted and
shifted position are in the same group.

However ISO 9995 allows CapsLock to be used to create other groups instead
of just reproducing the shifted/unshifted layout. It can be very useful for
users in India to switch between Latin and local abugidas. It could be used
as well by users writing in Arabic and Hebrew abjads, or with
African (Ethiopic) or North-American syllabary scripts that are complex to
map on a usable keyboard.

But I think that keyboard should all have a dedicated Kana key to easily
map additional groups without sacrificing other shift keys on the last row:
keyboards really don't need two windows keys and so the space bar can
remain with a cumfortable width (as well for the Shift key or Backspace
which is too narrow on many keyboards).
On the last row therre should never be more than 7 keys on both sides of
the space bar, and the most external keys (Ctrl) have to remain wide). If a
Kana key or present, in fact it should be to the right of the right
control, or ro the right of the right Shift

AltGr needs to keep some width extension compared to letter keys, and in
fact could be larger than the left Alt, because it is used for entering
text. The Application key is too large for me, just like the left Windows
key (its extra width should be better given to the left Control key to make
it a bit more central).

Those that design keyboard almost never test them for real usability: they
prefer slling them with many packed multimedia functions (or buttons for
Calc, Mail, Web or swtiching windows, and that are rarely used). Only
keyboards for gamers have some attention, but only to give them additional
programmable function keys for specific games... Keyboards on notebooks are
extremely poorly designed, a complete nonsense.


Re: Another take on the English Apostrophe in Unicode

2015-06-15 Thread Marcel Schneider
On Fri, Jun 12, 2015, Philippe Verdy  wrote:

 These are application shortcuts, but these modifier keys combinations are 
 used with base function keys (F1...F12), not with keys on the alphanumeric 
 parts of the keyboard. So there's no conflict.

Thank you for your advice. It'll be very useful.
I was not precise enough, the upper row of the alphanumerical block is used 
with Ctrl, Shift+Ctrl, Shift+Alt by the language bar but optionally only.

 It is normal then to not assign CTRL+keys or CONTROL+shift+keys 
 (independantly of the capslock state) with non-control characters if the same 
 keys are used to type non-control ASCII characters in range U+0040..U+005F. 
 This means that 32 positions on the keyboard must not be used for any 
 assignment.
 The same remark applies to ALT+digit and ALT+letter (otherwise keyboard 
 shortcut for application menus or navigation in web forms won't work 
 correctly, or will take the priority when you intended to type a valid 
 character, forcing these application functions instead of accepting your 
 character input).
MSKLC performs this safety checks and will issue warnings if you do so.

The Alt shift state is unassignable in the MSKLC. When used for shortcuts with 
Clavier+, these are prioritized and work fine.

 This is not just my advaice but documented in the ISO standard.

That depends on which ISO Standard you refer to. If it's ISO/IEC 9995, then 
beware! IMHO this standard isn't to be taken seriously, otherwise you'll have 
to stay away from using the Shift + AltGr shift state, to take just one 
outstanding example.

 Assigning characters to positions defined for application shortcuts is a bad 
 idea. Keyboard layouts should map characters in positions that are 
 independant of applications (but layouts may be specific to an OS if the OS 
 interface defines some standard shortcuts: this is a problem when using 
 virtualized OSes, as there's a conflict with shortcuts used to switch from 
 the guest to the host: personnally I have chosen the Application key for this 
 instead of the right control, because the Application key is rarely needed, 
 but I frequently type control with the right hand or two hands, notably 
 CTRL+A, CTRL+C, CTRL+X, CTRL+V).

It's indeed very useful to keep two Control modifiers. Because the modifiers at 
the left and right border of the block are acted with the little finger and 
should thus be symetrical. This does not apply to the Alt keys and other keys 
more or less centered around the space bar, which are acted with the thumbs. As 
Alt is less used than Kana (when there is a Kana key), Kana should be on left 
Alt, symetrical to the (on many keyboards already implemented) AltGr key. The 
Alt key comes then on the Applications key, which is mnemonic because of the 
contextual menu icon. Internally, indeed, the Alt keys (left and right) are 
called Menu keys (Virtual key Left Menu or VK_LMENU, and VK_RMENU). This 
contextual menu is then invoked pressing the right Windows key, which is 
consistently missing on laptops. Laptops must however have an Applications key 
to prevent the AltGr key from being positioned too far rightwards, beside of a 
space bar too long, because this hardware layout has some negative impact on 
ergonomics, specialists say.
On the US keyboard layout at http://charupdate.info however, Applications is a 
Kana toggle, while Right Windows is a Compose key. For laptops this shifts 
rightwards to get Compose on Applications, and Kana toggle on, well, Right 
Control. Because there are laptops with nothing between Right Alt and Right 
Control, so I even thought at mapping the Kana toggle on Pause, but this turned 
out to be buggy, besides that keyboards without Applications (Menu) often are 
lacking the Pause key too.

 On the French keyboard, CONTROL and SHIFT+CONTROL must be reserved on 7 
 successive keys of the first row (5([, 6-|, 7è`, 8_\, 9ç^, 0à@, 
 °)]), they are needed to get ASCII controls
 However CONTROL+@ is extremely rarely needed in applications to enter a NULL 
 control that will be almost always filtered out silently, only some editors 
 that allow loading and editing binary files will use it, e.g. Emacs or Vim 
 which have a binary editing mode that avoids altering the encoding of 
 newlines, but displays all controls explicitly, and that does not limit the 
 line length. Personally I prefer not using text editors to edit binary 
 files, this is too much unsafe with their insertion working mode, it is 
 highly preferable and much simpler to use an hexadecimal editor).
 This means that CONTROL+0à@ may be assigned something else more useful 
 (even if the MSKLC compiler warns about it).
 But you can assign characters with CONTROL and CONTROL+SHIFT for the 6 other 
 keys of the first row (², 1, 2é~, 3#, 4'{ on the left side, and 
 +=} on the last position to the right).

I ended up assigning no characters on Control shift states at all any more. To 
get the most of a keyboard, the best is to use the 

Re: Another take on the English Apostrophe in Unicode

2015-06-15 Thread Doug Ewell
Marcel Schneider charupdate at orange dot fr wrote:

 A free tool, the Microsoft Keyboard Layout Creator, allows every user
 to add U+02BC on his preferred keyboard layout

I use John Cowan's Moby Latin keyboard, built with MSKLC, which is 100%
compatible with the AltGr-less US keyboard and supports almost 900 other
characters, including all of the apostrophes and quotes and dashes and
other characters under discussion:

http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html

I spent years designing and updating my own keyboard layout and studying
other layouts. I've ended this quest since I started using Moby Latin;
it's the best I've seen in numerous ways.

Elsewhere:

 ISO stands for stability

We wish. Several of us on this list have worked on standards and
standard-like activities that correct for, and defend against,
instability in ISO standards.

 Microsoft’s choice of mashing up apostrophe and close-quote to end up
 with an unprocessable hybrid was wrong. Very wrong.

Windows-1252 and the other Windows code pages were developed during the
1980s, before Unicode, when almost all non-Asian character sets were
limited to 256 code points. The distinctions between apostrophe and
right-single-quote, weighed against the confusion caused by encoding two
identical-looking characters, would never have been sufficient back then
to justify separate encoding in this limited space.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Another take on the English Apostrophe in Unicode

2015-06-15 Thread Marcel Schneider
On Fri, Jun 12, 2015, Philippe Verdy  wrote:

 2015-06-12 17:02 GMT+02:00 Marcel Schneider :

 Would it be possible to have wordprocessing software where one uses 
 CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC 

 CONTROL and CONTROL+SHIFT cannot work on French keyboards where 

 the existing ASCII apostrophe is on the numeric row where there are 
 also ascii controls mapped matching the ASCII open brace that is itself 
 mapped 

 on ALTGR (or CTRL+ALT) in order to generate instead the C0 control.

 In general it is a bad idea to map any printable character or combining 
 character or dead key with 

 the CTRL or CTRL+SHIFT modifiers associated to any position in the 
 alphanumerica part 

 of the keyboard: this should remain reserved to map function keys or C0/C1 
 controls only, 

 that local applications will use to assign them application-specific 
 application functions.


Even the Language bar uses the upper row to define shortcuts with Control, 
Shift+Control, Shift+Alt to switch between keyboard layouts, which are 
prioritized. So to test the shortcuts with Clavier+, I must first remove 
shortcuts in the Language bar. Then the way was free to test Mr Overingtonʼs 
shortcuts for curly apostrophes (I will send the result just after). When I 
deleted the shortcuts in Clavier+ to test your advice, I found no application 
shortcuts for Ctrl+4 while the keys 1, 2, 5 and 0 are usually mapped as Word 
shortcut with CONTROL, while the heading formatting is with ALT. But indeed 
among ASCII controls I found eight on the French keyboard:

//VirtualKey |ScanCd |ISO_# |Ctrl
{VK_ESCAPE /*T01 */ ,0x001b
{VK_CANCEL /*X46 */ ,0x0003
{VK_BACK /*T0E E13*/ ,0x007f
{VK_OEM_6 /*T1A D11*/ ,0x001b
{VK_OEM_1 /*T1B D12*/ ,0x001d
{VK_OEM_5 /*T2B C12*/ ,0x001c
{VK_RETURN /*T1C C13*/ ,'\n'
{VK_OEM_102 /*T56 B00*/ ,0x001c

On the alphanumerical block, there are always the same five, three among them 
near the Enter key. The British-American Apostrophe key is exempt of Controls 
too. This is probably why Mr Overington wants to use CONTROL and SHIFT+CONTROL 
for U+2019 and U+02BC, as custom applications shortcuts. I had once defined a 
universal latin layout in the MSKLC, but as there is neither Kana nor chained 
dead keys, I allocated some dead keys (among a total of about 25) on CONTROL 
positions where I supposed there wouldnʼt be any shortcuts in any application, 
as on ù, ^, and even high digits on the upper row. It must be at 
http://dispoclavier.monsite-orange.fr, and somebody has been very astonished 
because precisely this may become buggy. Even more, this is disabled! 
Winwordc.exe did not process these dead keys. Other applications did, as I 
remember. But the layout was far too hard to remind, as I filled up double 
diacrited at the next free positions in the alphabet. This way I could allocate 
1,921 Unicode characters (by editing the KLC source in spreadsheets), but since 
I know and use the WDK, I wonʼt make such a layout again. Now Iʼm trying to put 
even more characters but with chained dead keys, for double diacrited and for 
easy-to-remind compose sequences. For example, you will enter U+01BF LATIN 
LETTER WYNN by typing simply COMPOSE, w, y, n, n, or less if not needed to 
disambiguate. Same for digraphs and ligatures. The test version I use is now 
adapted to type the letter apostrophe U+02BC (Iʼll send after to the List some 
news about).





Best regards,
Marcel Schneider


Re: Another take on the English apostrophe in Unicode

2015-06-15 Thread Marcel Schneider
On Fri, June 5, William_J_G Overington wrote:

 I replied:

 Would it be possible to have wordprocessing software where one 
 uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC 
 for input 
[...]

 I am wondering whether some existing software packages 
 might be able to be used for the character inputting part using customized 
 keyboard short cuts.

There is a very good shortcut utility for Windows which doesnʼt modify the 
registry except to launch the app automatically:
http://utilfr42.free.fr/util/Clavier.php

Using this software, I tried, you can define CONTROL APOSTROPHE for U+2019 and 
CONTROL SHIFT APOSTROPHE for U+02BC for input.€After defining the shortcut by 
typing it, you will have to paste the character into the text editing field.
You can specify that these shortcuts work only in the word processing software 
you use, as you wish to. To achieve this, pick the “target”icon, drag and drop 
it into an open window of the target application, its name will be added in the 
bar and youʼll have to choose that the shortcut be enabled in this software. 

You may even define that the shortcuts work with LEFT CONTROL only, in order to 
keep RIGHT CONTROL for other shortcuts with APOSTROPHE.
As CONTROL SHIFT is not easy enough to type for character input, Iʼd suggest to 
define CONTROL L for U+2019, and to add CONTROL SEMICOLON for U+2018. This is 
because on the square bracket keys, there are already control characters 
allocated on CONTROL shift state. On these keys you may however choose LEFT ALT 
or RIGHT ALT for a shortcut.

BTW: Clavier+ allows even to command the pointer and to enter mouse clicks, so 
that a shortcut can execute an action on the graphic interface of the app. This 
is very useful to add app shortcuts in apps that donʼt allow customising.
Itʼs free, and the interface can be switched to English. To download your copy:
http://utilfr42.free.fr/util/Clavier.php

 I have now thought of the alternative for now of being able to test what is 
 in the text by using a special version of an open source font where there are 
 distinctive glyphs one from the other for the two characters.

I discovered that when U+02BC is input by autocorrect in replacement of U+0027, 
and the current font does not contain U+02BC (for example Lucida Console), then 
U+02BC is displayed in the fall-back font (Courier New) and the font-setting is 
*not* altered. This way, you have the MODIFIER LETTER APOSTROPHE displayed in a 
distinctive font at input. This is observed in Microsoft Word Starter, where 
every out-of-font character typed as such triggers the font-setting to 
fall-back, which is very annoying.

Best regards,
Marcel Schneider


Re: Another take on the English apostrophe in Unicode

2015-06-15 Thread Marcel Schneider
On Wed, Jun 10, 2015, Ted Clancy  wrote:

 The idea that words with apostrophes aren't valid words is a regrettable myth 
 that exists in English, 
 which has repeatedly led to the apostrophe being an afterthought in 
 computing, leading to situations like this one.

[...]

 I imagine spell-checkers of the future could underline a word where I 
 erroneously use a closing quote instead of an apostrophe, or vice versa.

 There are other possible solutions too, but I don't want to get into a 
 discussion about UI design. I'll leave that to UI designers. 


Thereʼs however one UI whose design is a matter of everybody, and every typist 
should be interested in, that is, we all, since everybody does at least partly 
a typistʼs work. Weʼre all typists, and weʼre all invited to help design that 
UI for ourselves and for our relations, friends, colleagues.

This week-end I switched my current apostrophe from U+2019 to U+02BC by 
updating my (already customised, but still unfinished) French keyboard layout. 
As weʼve already one prominent dead key, Iʼd added two others on Base shift 
state. From now on, I type GRAVE – APOSTROPHE / QUOTATION MARK for a single or 
double opening quote, and get the closing one by using the ACUTE dead key. This 
recalls some legacy practice where spacing accents were used. The typographic 
apostrophe U+02BC is CIRCUMFLEX – APOSTROPHE. (Iʼd U+2019 on the apostrophe key 
when Kana was toggled off!) In addition, Iʼve added an autocorrect for U+0027 
to be replaced with U+02BC when writing text on Microsoft Word Starter.

The idea that we canʼt touch at our keyboard except on keycaps as theyʼre 
labeled, or that we can at most change for another predefined layout which 
often doesnʼt match these labels, is another regrettable widespread myth. As 
users, we confine ourselves in a receptive and waiting position, wishing and 
suggesting, and doing all imaginable and improbable things except adding a 
handful of characters on our keyboard straight before us, while in the 
meantime, in obliging anticipation, the worldʼs biggest software company stays 
inviting us to feel free to customise our keyboard with a free tool for free 
download at 
http://www.microsoft.com/en-us/download/details.aspx?id=22339

If this call were taken serious, all these discussions about keyboards would 
take another turn. Every corporate manager would make sure that his employees 
use appropriate keyboard layouts to save time and enhance output quality. To 
achieve this, he would not hesitate one minute to put himself at the place of a 
UI designer and to get that poor keyboard UI molted to a performative worktool. 
And to deploy the result at corporate level.

The MSKLC is worth spending a day to get started with and to create a completed 
keyboard layout from oneʼs preferred one, because this will save much time and 
anger. You may design one where apostrophe and single quotes are far one from 
another (as on Saturdayʼs kbdenusw), to avoid mistyping and spelling errors 
without having to wait for any better on-screen UI.

However, I wonʼt hide that the MSKLC does not allow to chain dead keys, nor 
does it support Kana shift states, things that are useful for a number of 
languages using latin or other scripts and to emulate a compose functionality. 
But all this plus a Kana toggle ends up to be rather simple with additional 
resources to program and compile the driver in C, all free of charge as well, 
namely a DDK or WDK
https://www.microsoft.com/en-us/download/details.aspx?id=11800

The ‘kbdenukw’ and ‘kbdenusw’ of Saturday, no matter whether they were 
downloaded or not, are now available in their 2.0 version, which differs from 
the previous by including the two missing dashes. The goal of this exercise is 
to prove that at this funny speed, and with such a facility of adding 
characters on the keyboard, there is no more reason to deprive oneself of the 
Unicode non-ASCII characters one needs. You may open the included *.klc 
source—a file format which Microsoft designed for sharing—in the Microsoft 
Keyboard Layout Creator and in a text editor. For more information, please see 
my related previous mail. (The AltGr views of the US version show the dead key 
content.)

kbdenukw: http://bit.ly/1dFMFb1
kbdenusw: http://bit.ly/1IWO8aJ


Best regards,
Marcel Schneider


Re: Another take on the English Apostrophe in Unicode

2015-06-15 Thread Marcel Schneider
://ewellic.org | Thornton, CO 
 
 


 

 Message du 15/06/15 17:21
 De : Doug Ewell 
 A : Unicode Mailing List 
 Copie à : 
 Objet : Re: Another take on the English Apostrophe in Unicode
 
 Marcel Schneider wrote:
 
  A free tool, the Microsoft Keyboard Layout Creator, allows every user
  to add U+02BC on his preferred keyboard layout
 
 I use John Cowan's Moby Latin keyboard, built with MSKLC, which is 100%
 compatible with the AltGr-less US keyboard and supports almost 900 other
 characters, including all of the apostrophes and quotes and dashes and
 other characters under discussion:
 
 http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html
 
 I spent years designing and updating my own keyboard layout and studying
 other layouts. I've ended this quest since I started using Moby Latin;
 it's the best I've seen in numerous ways.
 
 Elsewhere:
 
  ISO stands for stability
 
 We wish. Several of us on this list have worked on standards and
 standard-like activities that correct for, and defend against,
 instability in ISO standards.
 
  Microsoft’s choice of mashing up apostrophe and close-quote to end up
  with an unprocessable hybrid was wrong. Very wrong.
 
 Windows-1252 and the other Windows code pages were developed during the
 1980s, before Unicode, when almost all non-Asian character sets were
 limited to 256 code points. The distinctions between apostrophe and
 right-single-quote, weighed against the confusion caused by encoding two
 identical-looking characters, would never have been sufficient back then
 to justify separate encoding in this limited space.
 
 --
 Doug Ewell | http://ewellic.org | Thornton, CO 
 
 



Re: Another take on the English Apostrophe in Unicode

2015-06-15 Thread Philippe Verdy
2015-06-15 8:23 GMT+02:00 Marcel Schneider charupd...@orange.fr:

 On Fri, Jun 12, 2015, Philippe Verdy verd...@wanadoo.fr wrote:
 Even the Language bar uses the upper row to define shortcuts with Control,
 Shift+Control, Shift+Alt to switch between keyboard layouts, which are
 prioritized.

These are application shortcuts, but these modifier keys combinations are
used with base function keys (F1...F12), not with keys on the alphanumeric
parts of the keyboard. So there's no conflict.

It is normal then to not assign CTRL+keys or CONTROL+shift+keys
(independantly of the capslock state) with non-control characters if the
same keys are used to type non-control ASCII characters in range
U+0040..U+005F. This means that 32 positions on the keyboard must not be
used for any assignment.

The same remark applies to ALT+digit and ALT+letter (otherwise keyboard
shortcut for application menus or navigation in web forms won't work
correctly, or will take the priority when you intended to type a valid
character, forcing these application functions instead of accepting your
character input).

MSKLC performs this safety checks and will issue warnings if you do so.

This is not just my advaice but documented in the ISO standard.

 So to test the shortcuts with Clavier+, I must first remove shortcuts in
 the Language bar. Then the way was free to test Mr Overingtonʼs shortcuts
 for curly apostrophes (I will send the result just after). When I deleted
 the shortcuts in Clavier+ to test your advice, I found no application
 shortcuts for Ctrl+4 while the keys 1, 2, 5 and 0 are usually mapped as
 Word shortcut with CONTROL, while the heading formatting is with ALT. But
 indeed among ASCII controls I found eight on the French keyboard:


 //VirtualKey |ScanCd |ISO_# |Ctrl
 {VK_ESCAPE /*T01 */ ,0x001b
 {VK_CANCEL /*X46 */ ,0x0003
 {VK_BACK /*T0E E13*/ ,0x007f
 {VK_OEM_6 /*T1A D11*/ ,0x001b
 {VK_OEM_1 /*T1B D12*/ ,0x001d
 {VK_OEM_5 /*T2B C12*/ ,0x001c
 {VK_RETURN /*T1C C13*/ ,'\n'
 {VK_OEM_102 /*T56 B00*/ ,0x001c

 On the alphanumerical block, there are always the same five, three among
 them near the Enter key. The British-American Apostrophe key is exempt of
 Controls too. This is probably why Mr Overington wants to use CONTROL and
 SHIFT+CONTROL for U+2019 and U+02BC, as custom applications shortcuts.

Assigning characters to positions defined for application shortcuts is a
bad idea. Keyboard layouts should map characters in positions that are
independant of applications (but layouts may be specific to an OS if the OS
interface defines some standard shortcuts: this is a problem when using
virtualized OSes, as there's a conflict with shortcuts used to switch from
the guest to the host: personnally I have chosen the Application key for
this instead of the right control, because the Application key is rarely
needed, but I frequently type control with the right hand or two hands,
notably CTRL+A, CTRL+C, CTRL+X, CTRL+V).

On the French keyboard, CONTROL and SHIFT+CONTROL must be reserved on 7
successive keys of the first row (5([, 6-|, 7è`, 8_\, 9ç^, 0à@,
°)]), they are needed to get ASCII controls

However CONTROL+@ is extremely rarely needed in applications to enter a
NULL control that will be almost always filtered out silently, only some
editors that allow loading and editing binary files will use it, e.g. Emacs
or Vim which have a binary editing mode that avoids altering the encoding
of newlines, but displays all controls explicitly, and that does not limit
the line length. Personally I prefer not using text editors to edit
binary files, this is too much unsafe with their insertion working mode,
it is highly preferable and much simpler to use an hexadecimal editor).
This means that CONTROL+0à@ may be assigned something else more useful
(even if the MSKLC compiler warns about it).

But you can assign characters with CONTROL and CONTROL+SHIFT for the 6
other keys of the first row (², 1, 2é~, 3#, 4'{ on the left
side, and +=} on the last position to the right).

This means that CONTRL+4 can be safely assigned to U+02BC for the
apostrophe letter, but the most common encoding of the French apostrophe is
U+2019 (the closing single quote) as French normally does not use single
quotation marks, or if it does, it cannot be followed by a letter and
cannot be confused with a French apostrophe that is always followed by a
letter (or number 1).



For now I've not seen any specific need of U+02BC in French (U+2019 is
enough, even if it represents two distinct things in French, but in
distinct non-colliding contexts).

But of course U+02BC is needed for English that needs the distinction with
single quotes, because the English apostrophes are used more permissively
including at end of words just before a space or punctuation or end of line

In French this is not valid to use the apostrophe for elisions at end of
words, you need to use instead some abbreviation mark or style.. or no mark
at all.



The French abbreviation mark can 

Re: Another take on the English apostrophe in Unicode

2015-06-15 Thread Mark Davis ☕️
On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider charupd...@orange.fr
wrote:

 When we take the topic down again from linguistics to the core mission of
 Unicode, that is character encoding and text processing standardisation,
 ellipsis and Swedish abbreviation colon differ from the single closing
 quotation mark in this, that they are not to be processed.



 Linguistics, however, delivered the foundation on which Unicode issued its
 first recommendation on what character to use for apostrophe. The result
 was neither a matter of opinion, nor of probabilities.



 Actually, the choice is between perpetuating confusion in word processing,
 and get people confused for a little time when announcing that U+2019 for
 apostrophe was a mistake.


​Quite nice of you to inform me of the core mission of Unicode—I must have
somehow missed that.


More seriously, it is not all so black and white. As we developed​ Unicode,
we considered whether to separate characters by function, eg, an END OF
SENTENCE PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING
PERIOD, etc. Or DIARASIS vs UMLAUT. We quickly concluded that the costs
far, far outweighed the benefits.

In practice, whenever characters are essentially identical—and by that I
mean that the overlap between the acceptable glyphs for each character is
very high—people will inevitably mix up the characters on entry. So any
processing that depends on that distinction is forced to correct the data
anyway. And separating them causes even simple things like searching for a
character on a page to get screwed up without having equivalence classes.

So we only separated essentially identical characters in limited cases:
such as letters from different scripts.

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*


Re: Another take on the English apostrophe in Unicode

2015-06-15 Thread Marcel Schneider
On Thu, Jun 11, 2015, Philippe Verdy  wrote:

 The ASCII punctuations have been ovveriden for a lot of different roles. 
 There's simply no way to map them to a category that matches their semantic 
 role. So the ASCII hyphen and apostrophe-quote can only be given a very weak 
 category that just exhibit their visual role. Pd (dash) is then appropriate 
 for the ASCII hyphen-minus. You can't really tell from the character alone if 
 it is a punctuation or a minus sign.

 If it is a minus sign you can reencode it better using the more specific 
 mathematical minus sign. Otherwise, even if it is not a minus sign, it can be:
 - a connector between words in compound words (hyphen)
 - a trailing mark at end of lines for indicating a word has been broken in 
 the middle (but remember that I asked previously for another character for 
 that role because this word-breaking hyphen is not necessarily an horisontal 
 hyphen (in dictionaries I've seen small slanted tildes, or slanted small 
 equal signs, to make the distinction with true hyphens used in compound 
 words, also because sometimes these breaks are not necessarily between two 
 syllables in pocket books with very narrow columns and minimized spacing)
 - a bullet leading items in a vertical list (this should be an en dash, 
 follwoed by some spacing)
 - a punctuation (not necessarily at begining of line) marking the change of 
 person speaking (very common in litterature, notably in theatre).

 As a connector between words, there's a demonstrated need of differentiating 
 regular hyphens, longer hyphens (preferably surrounded by thin spaces) for 
 noting intervals (we can use the EN DASH for that), long hyphens between two 
 separate names that are joined (example in propers names, after mariage, 
 there's an example in France, where INSEE encodes it for now using TWO 
 successive hyphens, which are also used in French identity cards, passports, 
 social security green cards...).



In most fonts, the glyph of the hyphen-minus U+002D is the same as the one of 
the hyphen U+2010, while the minus sign U+2212 is longer and higher, at 
half-height of digits, to match between or before, as opposed to the hyphen and 
hyphen-minus which are positioned at half height of lowercase letters. As a 
minus sign, these work well only with Elzevir digits. This is why, in most 
fonts, the hyphen-minus U+002D is very unpleasant when used as a minus sign, 
especially when the plus sign, equals sign and other operators are present too.

In this, the hyphen differs from the apostrophe U+0027, whose differenciated 
characters (apostrophe U+02BC and single close-quote U+2019) have exactly the 
same glyph. But hyphen and apostrophe resemble in the fact that in many fonts, 
only the paired or assorted character is present, while the other is missing. 
So even in Arial, where the letter apostrophe U+02BC is present, the hyphen 
U+2010 is missing. The user is supposed to use U+002D as a hyphen and U+2212 as 
the minus sign. The system hyphen displayed in automatic word break at line 
end, is converted to U+002D for PDF. This isnʼt ideal, as you point out, 
because to reverse the word break, one canʼt simply replace all U+002D by 
nothing. Word processors allow to remove all instances of (U+002D, EOL), but 
this can delete some orthographic hyphens. The solution would be to use U+2010 
for orthographic hyphens (with compatible fonts) and to let the system place 
its U+002D.

The letter apostrophe U+02BC is indispensable because the glyph of U+0027 is 
unfit for typography. We are also told that U+0027 is unstable, but this is 
mainly due to the autocorrect smart quotes, which can be turned off at input. I 
use the autocorrect from now on to convert U+0027 to U+02BC.

Another difference between apostrophes and hyphens, and perhaps the main 
difference, is that except if they are used for word break, hyphens generally 
donʼt need to be replaced at further stages. At input, the user will replace 
U+002D with U+2212 where appropriate, and the autocorrect may replace two 
hyphens with an en dash U+2013. In some fonts, U+002D will need to be replaced 
with U+2010 for glyphic reasons. 

By contrast, quotes are to be converted, Ted Clancy points out in his paper. 
Ambiguating one of them with the apostrophe was a very bad idea. 
Well, I still believe it was *not* the idea of any Unicode Committee, nor of 
any Standards Body at all.


Marcel


Re: Another take on the English apostrophe in Unicode

2015-06-15 Thread Marcel Schneider
At the following URL, a forum page illustrates the way users struggle since a 
decade (and more) against the chaotic confusion Microsoft perpetuated despite 
of Unicode, forcing the Committee to adopt its short views:
http://painintheenglish.com/case/383

Please note Persephoneʼs workaround, which is a way to avoid the Apostrophe 
Catastrophe without turning off the “smart quotes”. This is the smartest thing 
Iʼve ever read about “smart quotes”.

This workaround, which I ignored, might explain why Microsoft refused to 
reengineer the smart quotes algorithm: Users have just to type two quotes and 
to delete one!

However, the problem of *handling* and *processing* such text stays unresolved. 
Users are conscious about a quote not being an apostrophe, this page shows. But 
they are compelled to use close-quotes for simulation of curly apostrophes. 
This works on the spot, but it brings bad quality text files.

Regardless of whether this matches Microsoftʼs business model or not, there is 
no right of dissuading font-designers from publishing complete fonts! 
Allocating the same glyph (U+2019) to a supplemental code point (U+02BC) is 
very easy when creating a font, but as Microsoft compelled Unicode to tell 
eveybody that there is no need of U+02BC in English and that our text files 
must not contain U+02BC, we lost sixteen years and thousands of fonts 
(including Arial Unicode MS, which surprisingly is lacking U+02BC!) are nearly 
unusable with correct text files because they donʼt include any typographical 
apostrophe. Except that U+0027 is curly in many ornamental fonts, to meet 
usersʼ expectations.

A ready workaround would thus be to disable the smart quotes and keep U+0027 as 
apostrophe (only), while entering U+2018/U+2019 by any means, and to replace 
eventually all instances of U+0027 by U+02BC. Or by U+2019 but only just before 
printing, never to publish in PDF and even less to send as a file or to publish 
on the internet!


As usual, the status quo which originated from legacy code pages (which were 
already considerably enriched compared to ISO 8859-1, be said to the honor of 
Microsoft) has been justified a posteriori with a lot of mostly biased 
arguments:

– The approval of U+2019 as apostrophe is based on glyphs and rendering and on 
a static view of text, excluding from scope the further word processing across 
documents and languages.

– Unicodeʼs principles are misapplied and even misinterpreted. The fact that 
different meanings across languages do not need different code points, is 
applied inside a given language to argue that distinction of semantics by 
different code points is not needed. 

– Some arguments are obsoleted since they were uttered, so the U+02BC being a 
“spacing clone of Greek smooth breathing mark” (removed in 5.1) and thus never 
slanted, while in most fonts it has same shape as U+2019, slanted or curly.

– Another fallacy cites as a proof the use of U+2019 as apostrophe in some 
locales, while this is already based on CP1252-inspired practice against the 
spirit of Unicode.

– Bluring the issue by enumerating the various values of English apostrophe, 
which leads sometimes to include the close-quote function as punctuation 
apostrophe...

Whatever, there is nothing to save of the status quo. Unfortunately, the mass 
of wrongly encoded text goes on increasing while discussions follow one 
another. At least, that does not hinder publishing good books and newspapers 
and sending nice mails (on paper, where nobodyʼs asking whatʼs the code point, 
because thereʼs no need). About other media, thereʼs to say that 
hand-processing wrong text files increases the job volume— :( for managers, but 
:) for workers, at the condition that they are really paid for.


Marcel Schneider


Re: Another take on the English apostrophe in Unicode

2015-06-15 Thread Marcel Schneider
 On Sat, Jun 13, 2015, Mark Davis  wrote:

 In particular, I see no need to change our recommendation on the character 
 used 
 in contractions for English and many other languages (U+2019). Similarly, we 
 wouldn't 
 recommend use of anything but the colon for marking abbreviations in Swedish, 
 or 
 propose a new MODIFIER LETTER ELLIPSIS for supercali...docious.

 (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the 
 confusion.)


When we take the topic down again from linguistics to the core mission of 
Unicode, that is character encoding and text processing standardisation, 
ellipsis and Swedish abbreviation colon differ from the single closing 
quotation mark in this, that they are not to be processed.

 

Linguistics, however, delivered the foundation on which Unicode issued its 
first recommendation on what character to use for apostrophe. The result was 
neither a matter of opinion, nor of probabilities.


 

Actually, the choice is between perpetuating confusion in word processing, and 
get people confused for a little time when announcing that U+2019 for 
apostrophe was a mistake.


 

 

Marcel Schneider



 

 

 Message du 13/06/15 17:36
 De : Mark Davis ☕️ 
 A : Peter Constable 

 Copie à : verd...@wanadoo.fr , Kalvesmaki,
Joel , Unicode Mailing List 
 Objet : Re: Another take on the English apostrophe in Unicode
 




On Sat, Jun 13, 2015 at 5:10 PM, Peter Constable 
wrote:


When it comes to orthography, the notion of what comprise words of a language 
is generally pure convention. That’s because there isn’t any single 
_linguistic_ definition of word that gives the same answer when phonological 
vs. morphological or syntactic criteria are applied. There are book-length 
works on just this topic, such as this:

 





In particular, I see no need to change our recommendation on the character used 
in contractions for English and many other languages (U+2019). Similarly, we 
wouldn't recommend use of anything but the colon for marking abbreviations in 
Swedish, or propose a new MODIFIER LETTER ELLIPSIS for supercali...docious.

 
(IMO, U+02BC was probably just a mistake; the minor benefit is not worth the 
confusion.)



 


Mark

 
— Il meglio è l’inimico del bene —
 









Another take on the English Apostrophe in Unicode

2015-06-13 Thread Marcel Schneider
On Fri, Jun 5, 2015, David Starner 
wrote:

 On Fri, Jun 5, 2015 at 12:16 AM Leo Broukhis  wrote:

 I agree that conflating apostrophes and quotes is a source of
 problems, however, existence of the MODIFIER LETTER [same glyph as
 used for English contractions] in Unicode is a coincidence which
 should not have an effect on usage of apostrophes in English.

 Coincidence or not, the Unicode Consortium is not going to allocate a new 
 code-point for the English apostrophe as long as MODIFIER LETTER APOSTROPHE 
 exists. Any change is pretty unlikely, but changing to an existing character 
 is vastly more likely then creating a new one.


In fact this would be a return to the state until version 2.0.0. 
http://www.unicode.org/Public/2.0-Update/NamesList-1.txt
Since version 3.0.0 (or more precisely, since update 2.1), U+2019 is preferred 
for apostrophe, not U+02BC any longer. 
http://www.unicode.org/Public/3.0-Update/NamesList-3.0.0.txt

Prior to this discovery, I supposed it could have been later ISO prescriptions 
which triggered it the wrong way, but now it's impossible ISO initiated the 
move of preferred apostrophe from U+02BC to U+2019. This change took place not 
sooner than in update 2.1, whereas the merger was at 1.1 and ISO stands for 
stability. So ISO could never agree that the preferred character for English 
apostrophe stopped to be U+02BC and started to be U+2019, against the Stability 
Policy, and presumably using a gap in this policy which possibly don’t cover 
usage recommendations...

I must do some more research in the Archives to find out more about why the 
apostrophe and the single close quote were ambiguated—a process that needs even 
a new word to put on it, as ordinarily everybody works for disambiguation...

However, the 1999 Mail Archive already shows it was for simplification's sake, 
in word processing software. 

Could anybody tell us more about this issue?


IMHO, the mischievous apostrophe that we use today, is due to a shortcut, 
narrowed design, and uncomplete check-ups. Briefly, the disconnect was between 
Unicode whose global approach lead to complete solutions including all you need 
for text handling and word processing, and Microsoft whose industrial approach 
prioritized the ready make-up of output appearance, letting out of scope the 
subsequent lifestages of text. The Windows code page 1252 
apostrophe-close-quote looks nice on screen and in the documents, but as soon 
as you need to convert quotes from British to American or from free to nested, 
the only way to prevent your text from becoming unusable is to hand-process the 
quotes one by one. The money you saved when purchasing the software, is lost 
thousandfold at use. Microsoft’s choice of mashing up apostrophe and 
close-quote to end up with an unprocessable hybrid was wrong. Very wrong.


Marcel Schneider


Re: Another take on the English Apostrophe in Unicode

2015-06-13 Thread Marcel Schneider
On June 3, 2015, Ted Clancy wrote:

 https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/



I wish to thank you personally for having brought up this issue, 

as well as Mr Grosshans for having posted the URL launching this thread.

However, your solution is not complete, and I don’t agree fully with all your 
statements. 

So let’s try to check up what’s the matter, and then look what might be done.

First, the Unicode Technical Committee is *not* very wrong.
A look in the Standard 2.0.0 or even simplier, a glance at the first NamesList 
in the UCD, that is the source code for the Version 2.0.0 Code Charts, shows 
that originally, the UTC recommended the use of U+02BC MODIFIER LETTER 
APOSTROPHE for the English apostrophe as well as for apostrophe on the whole, 
and to reserve the use of U+2019 RIGHT SINGLE QUOTATION MARK for what it is: 
close-quote. 
It wasn’t sooner than in the 2.1 update that the preferred character for 
apostrophe was shifted from U+02BC to U+2019, to conform with the usage (and 
presumably at the demand) of Microsoft, which did not comply to the Standard, 
despite of being a full member of the Unicode Consortium (and having thus 
agreed at the beginning that apostrophe should be U+02BC). I’m pretty sure that 
when they moved the apostrophe preference from U+02BC to U+2019, the Unicode 
Technical Committee and the Unicode Editorial Committee acted against their 
will. My opinion is induced from the original UTC position and from comparing 
two versioned NamesList extracts among those displayed at 

charupdate.info#ambiguation

Second, your solution is *not* complete. Even if word-processors managed nested 
quotes, one single key for all occurring quotation marks of a given locale, as 
British English or US English, would scarcely be sufficient. Here’s why.
Everybody knows that quotes are used not only to quote, but also to delimit, to 
warn or generally to flag otherwise than as a quotation. The latter occurs 
commonly when the writer (and by transposition, the speaker, making a quotes 
gesture) wants to flag a word or an expression as being controversial, not 
true, not in his belief, or ironical. From this they are sometimes called 
“irony quotes”. Languages that use angle quotation marks (chevrons) to quote, 
use comma quotation marks to flag. In English, I suppose that you need to use 
the “other” quotation marks to flag. So in US English you would flag using 
single quotes, while in British English you would use double quotes, the like 
as in French. However I don’t know how that works in quotations (while in 
languages as French and German this is no problem).
Therefore, the user should always have means to type exactly the quotes he 
wishes to type. This will result in the need of at least one dead key or some 
supplemental dead list entries, and/or supplemental AltGr positions, or even 
supplemental shift states (Kana). Never one single key position can do all the 
job.

Third (but this is an off-topic discussion in this thread and is set aside in 
your blog post), the close-quote as an apostrophe is not good for French 
neither, regardless of how many words are around. The use of U+2019 as 
apostrophe hasn’t lead in French to any “Apostrophe Catastrophe” only because 
in French, few people use single comma quotes (in rare cases or for special 
purposes), and because properly leading apostrophes are often placed otherwise, 
as in “Y’a” for “Il y a”, instead of “’Y a”.

What shall we do? As you draw it, the so-called smart quotes algorithm must be 
reengineered and cannot stay working as it does, so users must be informed that 
to type “unexpected” quotes, they’ve to hit the key two times, or to type 
another character just after.
But users must also make an effort by themselves instead of wishing to stay 
with the inherited keyboard layout regardless of what changes are on-going, and 
at the same time, to get more Unicode characters as reasonably supportable on 
this old keyboard. In other words, the gap between the expected rendering and 
the actually conceded input must be filled up whether by using a set of 
customised (or perhaps one day, standardised) autocorrect entries (see one 
suggestion at charupdate.info#curly) or by typing appropriate characters on 
extended keyboard layouts (which don’t lead to change for another hardware, 
except for special purposes).


Thanks again, because without this discussion, I would have released more 
keyboard layouts with the wrong apostrophe!


Marcel Schneider


Re: Another take on the English Apostrophe in Unicode

2015-06-13 Thread Marcel Schneider
On Sun, Jul 18, 1999, Markus Kuhn  wrote:

 http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0557.html

 I addition, I feel that the current ISO 8859 oriented national keyboard
 standards are not adequate for modern Unicode-era word processing
 practices, as they put obsolete typewriter characters such as U+0027 on
 too prominent keys, while they have no key positions for the extremely
 frequently needed typesetting characters that are for instance supported
 by CP1252 (directional single and double quotes, en and em dashes,
 etc.). Software either has to use shaky algorithms to make educated
 guesses on which character the user might have meant (such as Word tries
 to do), or sequences of ASCII characters are interpreted with new
 semantics (such as both TeX and Word do), in order to give typists some
 compromise access to these characters.

 I think it is urgent time to revise national keyboard standards here. We
 really need standardized ways to easily enter say at least

 2018 LEFT SINGLE QUOTATION MARK
 2019 RIGHT SINGLE QUOTATION MARK
 201C LEFT DOUBLE QUOTATION MARK
 201D RIGHT DOUBLE QUOTATION MARK
 2013 EN DASH
 2014 EM DASH

 on keyboards for English language users, and corresponding extensions on
 other national keyboard standards. This might be a good opportunity to
 introduce on US keyboards the Level 2 Select key (AltGr), while on
 European keyboards is is probably sufficient to just add appropriate
 labels to a number of new Level 2 Select positions.

 


On Sun, Jul 18, 1999, Mark Davis  wrote:

 http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0558.html

 However, I agree that having the curly quotes (single and double) on the
 standard keyboard would be handy. I switch back and forth between a Mac and
 Windows. On the Mac, the option key (a second level shift) has always made
 this easy. The installable Windows international keyboard is not nearly so
 useful, since you can't just leave it on all the time (it messes up your 
 used of quotation marks).



On Thu, Jun 4, 2015 at 2:38 PM, Markus Scherer  wrote:

 How are normal users supposed to 
 find both U+2019 and U+02BC on their keyboards, 

 

Yes this may be the main issue, how to get at hand U+20BC, U+2019 and U+2018 as 
well, plus the actual U+0027, on keyboards that are derived from typewriters’ 
ones. Word processors are overasked with management of all four, while many 
users whish to stay typing ‘apostrophe’ for all of them. And not to change for 
another keyboard driver(?).

A free tool, the Microsoft Keyboard Layout Creator, allows every user to add 
U+02BC on his preferred keyboard layout, for example in the deadlist of 
apostrophe on the US International keyboard, a layout where U+2019 is already 
found, along with U+2018. You may choose a double stroke on Apostrophe to 
generate the modifier letter. But as this layout obviously is not so useful, 
you’ll prefer to get them on the US Standard layout, or depending on where you 
live, on the UK standard or extended or any other layout.

A more achieved solution is obtained with the Windows Driver Kit, a free 
development kit which allows to implement a Kana toggle, to toggle Apostrophe 
on the US Standard keyboard between U+0027 and U+02BC *or* U+2019. The least 
used among all three will be put into the deadlist, when adding one dead key on 
this layout, say Grave. Then, [Grave] [Apostrophe] will result in the missing 
apostrophe character.

 how are they supposed to deal with incorrect usage?

If the document is already incorrect, there will be nothing to do IMHO than 
check them one by one. Theoretically, word processors could integrate an 
exhaustive checking algorithm with an exhaustive dictionary. Which such a tool, 
there would be no “Apostrophe Catastrophe” as it has been called: 

 http://www.newrepublic.com/article/113101/smart-quotes-are-killing-apostrophe

 (found by a search engine).

So, on actual keyboard layouts, avoiding the Apostrophe Catastrophe 
would then have been unfeasible—the like as with actual consumption 
habits, avoiding a number of other catastrophes is unfeasible as well...



Nevertheless, this morning I opened once more the Microsoft Keyboard Layout 
Creator. Ten minutes later I got the finished complete package of the US 
American keyboard layout with U+02BC MODIFIER LETTER APOSTROPHE and all English 
quotation marks in one dead key on ‘Grave’, that is key number E00 (ISO/IEC 
9995-1). The same way I made up the keyboard layout for the United Kingdom, 
which uses AltGr, so the apostrophe and all quotes are also on AltGr. Ten 
minutes, again. 

If you don’t use the grave accent (or AltGr), there is strictly no change on 
these keyboard layouts, because I loaded the original Windows US and UK layouts 
into the MSKLC. If you use the grave accent, you must type a whitespace after 
hitting the grave key to get the grave accent (in conformance to the standard 
behavior of dead keys). 
– To get the modifier letter 

Re: Another take on the English apostrophe in Unicode

2015-06-13 Thread Philippe Verdy
I disagree: U+02BC already qualifies as a letter (even if it is not
specific to the Latin script and is not dual-cased). It is perfectly
integrable in language-specific alphabets and we don't need another
character to encode it once again as a letter.

So the only question is about choosing between:
- on one side, U+02BC (the existing apostrophe letter), and other possible
candidate letters for alternate forms (including U+02C8 for the vertical
form, and the common fallback letter U+00B4 present in many legacy fonts
for systems built before the UCS was standardized and using legacy 8-bit
charsets such as ISO 8859-1).
- and on the other side, U+2019 where it is encoded as a quotation
punctuation mark (like also the legacy ASCII single quote)

Note that U+00B4 (from ISO 8859-1) has also been used in association with
U+0074 (from ASCII) to replace the more ambiguous ASCII quote U+0027 by
assigning an orientation: the exact shape of these two is variable, between
a thin rectangle, or a wedge, or a curly comma (shaped like 6 and 9
digits), as well as the exact angle when it is a wedge or thin rectangle
(these characters however have been used since long in overstriking mode to
add accents over Latin capital letters, so the curly comma shapes are very
uncommon and they are more horizontal than vertical and U+00B4 will be a
very poor cantidate for the apostrophe that should have a narrow advance
width.

So there remains in practice U+02BC and U+02C8 for this apostrophe letter
(which one you'll use is a matter of preference but U+02C8  will not be
used if there are two distinct apostrophes in the language (e.g. in
Polynesian languages where the distinction was made even more clearer by
using right or left rings U+02BE/U+02BF, or glottal letters U+02C0/U+02C1
if that letter has a very distinctive phonetic realisation as a plain
consonnant with two variants like in Arabic or even U+02B0 when this is
just a breath without stop: the full range range U+02B0-U+02C1 offers much
enough variations for this letter if you need slight phonetic distinctions).

2015-06-13 8:28 GMT+02:00 Peter Constable peter...@microsoft.com:

 Nice article, as I recall. (Been a long time.)


 Peter

 -Original Message-
 From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of
 Kalvesmaki, Joel
 Sent: Friday, June 5, 2015 7:27 AM
 To: Unicode Mailing List
 Subject: Re: Another take on the English apostrophe in Unicode

 I don't have a particular position staked out. But to this discussion
 should be added the very interesting work done by Zwicky and Pullum arguing
 that the apostrophe is the 27th letter of the Latin alphabet. Neither
 U+2019 nor U+02BC would satisfy that position. See:

 Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum.
 Cliticization vs. Inflection: English N'T.Language59, no. 3 (1983):
 502-513.

 It's nicely summarized and discussed here:
 http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/

 jk
 --
 Joel Kalvesmaki
 Editor in Byzantine Studies
 Dumbarton Oaks
 202 339 6435





Re: Another take on the English apostrophe in Unicode

2015-06-13 Thread Philippe Verdy
I don't agree with this Grévisse definition (and I'm not alone, other
grammarians and dictionaries don't follow Grévisse, and even the French
Academy disagrees).

May be this is a form of composition but the correct term is nothz that it
create a new word, it just means that words take new semantics in specific
contexts (here, idiomatic expressions where the term pomme is a minor
shift of meaning, that also occurs in pomme de pin = pineapple, or
chou pomme and as well in the alternate semantic of pomme related only
to its rouch shape to designate a human head and by extension a person,
also used in idiomatic expressions like c'est pour ma pomme). But the
word itself is not different and in fact the etymology is the same, this
was only a progressive extension of semantic that created finally an
idiomatic expression, but not a new word.

A compound word (mot composé) needs a clear gluing, by an hyphen, or
apostrophe, or absence of space and punctuation. Grévisse still records
many good advices that are too frequently forgotten today, but here it got
too far in details that was not needed to preserve the semantics of the
language.

Another proof is the cuisne expression pomme frite which does not mean a
fried aple fruit, but a fried potato: pomme de terre has been abreviated
to only pomme, and this term even disappears now when the participle verb
frite used as an epithetic adjective is then substantivated. The
idiomatic expression pomme de terre is not so much idiomatic, this is
just a extension lemma added to the term pomme (apple). The composition
has in fact never be clearly attested, but if it was, hyphens would have
been used since long (many hyphens are now starting to disappear in
compiund words, replaced by direct gluing which is admitted in most cases).

2015-06-13 5:11 GMT+02:00 Eric Muller eric.mul...@efele.net:

  On 6/10/2015 9:37 PM, Philippe Verdy wrote:

 The French pomme de terre (potato in English, French vulgar synonym :
 patate) is a single lemma in dictionaries, but is still 3 separate words
 (only the first one takes the plural mark), it is not considered a nom
 composé (so there's no hyphens).



 Grevisse, Le bon usage, 11th edition, 1980, page 118, part 1 Elements of
 the language, chapter 7 The words, section 3 Formation of new words,
 article 2, Composition, very first paragraph (179 overall):

 ---
 By *composition*, language creates new words, either by combining simple
 words with existing words, or by preceding these simple words  with
 syllables that have no independent existence:

 *Chou-fleur, gendarme, pomme de terre, contredire, désunir, paratonnerre. *

 A word, despite being formed of graphically independent elements, is
 *composed* as soon at it brings to mind, not the distinct images of each
 of the words from which it is composed, but a single image. Thus the
 composites *hôtel de ville, pomme de terre, arc de triomphe* each remind
 of a unique image, and not of the distinct images of *hôtel* and of
 *ville*, of *pomme* and of *terre*, of *arc* and of
 *triomphe. *

 *---*

 *(hôtel de ville* = city hall; *pomme* = apple, *de* = of, *terre* =
 earth)

 Paragraph 181, 3rd remark:

 ---
 Sometimes the elements composing [the word] are welded in a simple word:
 *Bonheur**, contredire, entracte; *sometimes they are connected by an
 hyphen: *chou-fleur, coffre-fort;* sometimes they stay independent
 graphically:



 *Moyen âge, pomme de terre. --- *(“Le Grévisse” as we affectionately call
 it, or *Le bon usage / French Grammar with remarks on today’s french
 language*, is a must-have for the student of French. It is encyclopedic
 in its depth, and has tons of examples and counter-examples. Interestingly,
 the French wikipedia page says “a descriptive grammar of French”, while the
 English wikipedia page says “a prescriptive grammar”; it’s both!)

 I agree that we don’t need a new space coded character. I was just
 pointing out that some of the arguments for a new coded character for the
 apostrophe in *don’t* apply equally well to the spaces in the word *pomme
 de terre*.

 Eric.




RE: Another take on the English apostrophe in Unicode

2015-06-13 Thread Peter Constable
Nice article, as I recall. (Been a long time.)


Peter

-Original Message-
From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Kalvesmaki, Joel
Sent: Friday, June 5, 2015 7:27 AM
To: Unicode Mailing List
Subject: Re: Another take on the English apostrophe in Unicode

I don't have a particular position staked out. But to this discussion should be 
added the very interesting work done by Zwicky and Pullum arguing that the 
apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC 
would satisfy that position. See:

Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. 
Cliticization vs. Inflection: English N'T.Language59, no. 3 (1983): 502-513.

It's nicely summarized and discussed here:
http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/

jk
--
Joel Kalvesmaki
Editor in Byzantine Studies
Dumbarton Oaks
202 339 6435




RE: Another take on the English apostrophe in Unicode

2015-06-13 Thread Peter Constable
I should qualify my statement. The Zwicky and Pullum article was a nice piece 
of linguistic analysis regarding the morphological characteristics of “n’t”. 
Their remark about apostrophe, however, was not so much about orthography — 
which was not the focus of their article — but was rather a way of putting an 
exclamation on their findings.

When it comes to orthography, the notion of what comprise words of a language 
is generally pure convention. That’s because there isn’t any single 
_linguistic_ definition of word that gives the same answer when phonological 
vs. morphological or syntactic criteria are applied. There are book-length 
works on just this topic, such as this:

Di Sciullo, Anna Maria, and Edwin Williams. 1987. On the definition of word. 
(Linguistic Inquiry monograph fourteen.) Cambridge, Massachusetts, USA: The MIT 
Press.


Peter

From: ver...@gmail.com [mailto:ver...@gmail.com] On Behalf Of Philippe Verdy
Sent: Saturday, June 13, 2015 12:03 AM
To: Peter Constable
Cc: Kalvesmaki, Joel; Unicode Mailing List
Subject: Re: Another take on the English apostrophe in Unicode

I disagree: U+02BC already qualifies as a letter (even if it is not specific to 
the Latin script and is not dual-cased). It is perfectly integrable in 
language-specific alphabets and we don't need another character to encode it 
once again as a letter.

So the only question is about choosing between:
- on one side, U+02BC (the existing apostrophe letter), and other possible 
candidate letters for alternate forms (including U+02C8 for the vertical form, 
and the common fallback letter U+00B4 present in many legacy fonts for systems 
built before the UCS was standardized and using legacy 8-bit charsets such as 
ISO 8859-1).
- and on the other side, U+2019 where it is encoded as a quotation punctuation 
mark (like also the legacy ASCII single quote)

Note that U+00B4 (from ISO 8859-1) has also been used in association with 
U+0074 (from ASCII) to replace the more ambiguous ASCII quote U+0027 by 
assigning an orientation: the exact shape of these two is variable, between a 
thin rectangle, or a wedge, or a curly comma (shaped like 6 and 9 digits), as 
well as the exact angle when it is a wedge or thin rectangle (these characters 
however have been used since long in overstriking mode to add accents over 
Latin capital letters, so the curly comma shapes are very uncommon and they are 
more horizontal than vertical and U+00B4 will be a very poor cantidate for the 
apostrophe that should have a narrow advance width.

So there remains in practice U+02BC and U+02C8 for this apostrophe letter 
(which one you'll use is a matter of preference but U+02C8  will not be used if 
there are two distinct apostrophes in the language (e.g. in Polynesian 
languages where the distinction was made even more clearer by using right or 
left rings U+02BE/U+02BF, or glottal letters U+02C0/U+02C1 if that letter has a 
very distinctive phonetic realisation as a plain consonnant with two variants 
like in Arabic or even U+02B0 when this is just a breath without stop: the full 
range range U+02B0-U+02C1 offers much enough variations for this letter if you 
need slight phonetic distinctions).

2015-06-13 8:28 GMT+02:00 Peter Constable 
peter...@microsoft.commailto:peter...@microsoft.com:
Nice article, as I recall. (Been a long time.)


Peter

-Original Message-
From: Unicode 
[mailto:unicode-boun...@unicode.orgmailto:unicode-boun...@unicode.org] On 
Behalf Of Kalvesmaki, Joel
Sent: Friday, June 5, 2015 7:27 AM
To: Unicode Mailing List
Subject: Re: Another take on the English apostrophe in Unicode

I don't have a particular position staked out. But to this discussion should be 
added the very interesting work done by Zwicky and Pullum arguing that the 
apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC 
would satisfy that position. See:

Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. 
Cliticization vs. Inflection: English N'T.Language59, no. 3 (1983): 502-513.

It's nicely summarized and discussed here:
http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/

jk
--
Joel Kalvesmaki
Editor in Byzantine Studies
Dumbarton Oaks
202 339 6435




Re: Another take on the English apostrophe in Unicode

2015-06-13 Thread Mark Davis ☕️
On Sat, Jun 13, 2015 at 5:10 PM, Peter Constable peter...@microsoft.com
wrote:

 When it comes to orthography, the notion of what comprise words of a
 language is generally pure convention. That’s because there isn’t any
 single *_linguistic_ *definition of word that gives the same answer when
 phonological vs. morphological or syntactic criteria are applied. There are
 book-length works on just this topic, such as this:


​In particular, I see no need to change our recommendation on the character
used in contractions for English and many other languages (U+2019).
Similarly, we wouldn't recommend use of anything but the colon for marking
abbreviations in Swedish, or propose a new MODIFIER LETTER ELLIPSIS for
​supercali...docious.

(IMO, U+02BC was probably just a mistake; the minor benefit is not worth
the confusion.)

Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*


Re: Another take on the English Apostrophe in Unicode

2015-06-12 Thread Philippe Verdy
2015-06-12 17:02 GMT+02:00 Marcel Schneider charupd...@orange.fr:

  Would it be possible to have wordprocessing software where one uses
  CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC

 CONTROL and CONTROL+SHIFT cannot work on French keyboards where the
 existing ASCII apostrophe is on the numeric row where there are also ascii
 controls mapped matching the ASCII open brace that is itself mapped on
 ALTGR (or CTRL+ALT) in order to generate instead the C0 control.


In general it is a bad idea to map any printable character or combining
character or dead key with the CTRL or CTRL+SHIFT modifiers associated to
any position in the alphanumerica part of the keyboard: this should remain
reserved to map function keys or C0/C1 controls only, that local
applications will use to assign them application-specific application
functions.


Re: Another take on the English apostrophe in Unicode

2015-06-12 Thread Eric Muller

  
  
On 6/10/2015 9:37 PM, Philippe Verdy
  wrote:


  The French "pomme de terre" ("potato" in English,
French vulgar synonym : "patate") is a single lemma in
dictionaries, but is still 3 separate words (only the first one
takes the plural mark), it is not considered a "nom composé" (so
there's no hyphens).



Grevisse, Le bon usage, 11th edition, 1980, page 118, part 1
Elements of the language, chapter 7 The words, section 3 Formation
of new words, article 2, Composition, very first paragraph (179
overall):

---
By composition, language creates new words, either by
combining simple words with existing words, or by preceding these
simple words  with syllables that have no independent existence: Chou-fleur,
  gendarme, pomme de terre, contredire, désunir, paratonnerre.
  

A word, despite being formed of graphically independent
  elements, is composed as soon at it brings to mind, not
  the distinct images of each of the words from which it is
  composed, but a single image. Thus the composites hôtel
de ville, pomme de terre, arc de triomphe each remind of a
  unique image, and not of the distinct images of hôtel and
  of ville, of pomme and of terre, of arc
  and of triomphe. 
  
---

(hôtel de ville = city hall; pomme = apple, de
= of, terre = earth)

Paragraph 181, 3rd remark:

---
Sometimes the elements composing [the word] are welded in a simple
word: Bonheur, contredire, entracte; sometimes they
are connected by an hyphen: chou-fleur, coffre-fort;
sometimes they stay independent graphically: Moyen âge, pomme de
  terre.
  
  ---
  
(“Le Grévisse” as we affectionately call it, or Le bon usage
  / French Grammar with remarks on today’s french language, is a
must-have for the student of French. It is encyclopedic in its
depth, and has tons of examples and counter-examples. Interestingly,
the French wikipedia page says “a descriptive grammar of French”,
while the English wikipedia page says “a prescriptive grammar”; it’s
both!)

I agree that we don’t need a new space coded character. I was just
pointing out that some of the arguments for a new coded character
for the apostrophe in don’t apply equally well to the spaces
in the word pomme de terre.

Eric.

  



Re: Another take on the English apostrophe in Unicode

2015-06-12 Thread Marcel Schneider

On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer  wrote:

 Confusion between apostrophe and quoting -- 
 blame the scribe who came up with the ambiguous use, 
 not the people who gave it a number.

There’s a lot of confusion in writing, especially since this job was done on 
typewriters, where computer keyboards are derived from while the narrowing of 
the character sets shifted from mechanics to code pages. This is all over, 
thanks to Unicode and its principle defined in TUS §1.3: 

 “The Unicode Standard does not define glyph images. That is, the standard 
 defines how characters are interpreted, not how glyphs are rendered.” 
 Unfortunately the new precision and differenciation has sometimes been 
 refused by sticking with legacy practice and for backwards compatibility’s 
 sake.

The use of a paired quotation mark (U+2019) as an English apostrophe against 
the UTC’s initial successful attempt to disambiguate the two by recommending 
U+02BC (same glyph) for use as apostrophe, is a leading example of how the hard 
labor of ordering and clarification aiming at what in ancient Greek is called 
‘Kosmos’, can at every time be thrown back to chaos by applying short views and 
doubtful considerations. There’s been a discussion on this Mailing List in July 
of 1999, that was before the release of the 3.0.0 version of the Standard: 
“Apostrophes, quotation marks, keyboards and typography”, when the demand for 
simplification was already addressed with the corrections published as version 
2.1:
 

 Couldn't Unicode follow Microsoft and just remove the
 recommendation that U+02BC be the recommended apostrophe character and
 instead give U+2019 the dual meaning that it de-facto has already today?

http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0558.html
[The quoted UTR#8 is now located at:
http://www.unicode.org/reports/tr8/tr8-3.html]


(The shift, as viewed at NamesList level, is now highlighted at 
http://charupdate.info#ambiguation

 

On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer  wrote further:

 If anything, Unicode might have made a mistake in  
 encoding two of these that look identical.  

 How are normal users supposed to 
 find both U+2019 and U+02BC on their keyboards,  
 and how are they supposed to deal with incorrect usage? 

I never believed it could have been a mistake, since we know that Unicode 
encodes semantics, not glyphs. Were there no modifier letters at all, Unicode 
had have to introduce an apostrophe character, because an apostrophe is not at 
all the same as a quotation mark and does not work the same way neither. By 
handling text, not theories, Ted Clancy at Mozilla clearly shows us that 
ambiguating the apostrophe with a close-quote brings up counterproductive 
complications that impact severely the productivity of the users. 

What, now, about “normal users”? To fix the issue, consider that wishing to 
stay all the life long with one and the same keyboard layout while at the same 
time, changing for a new smartphone every year or two, needs some explanation. 
I guess it is because keyboards don't display anyhing by themselves except 
keycap labels, so you're never pretty sure about them.   

We should consider, too, that before being a matter of finding on keyboard, the 
matter is about using. How are we supposed to choose the right one out of four 
apostrophe/quotes (U+0027, U+02BC, U+2019, U+2018) while many of us seem not to 
know or not to bother about where to place it? But supposed we do, it would 
effectively be much more useful to tell the machine whether we want to type an 
apostrophe or a quotation mark, and as about that, the existing key is enough 
(see T. Clancy’s blog). Is managing nested quotes already implemented in word 
processing? I never heard it is. Definitely, here’s a point where the 
simplification wished for a widespread word processing software worsened 
considerably the working conditions of all demanding people. The gap between 
word processing and desktop publishing is the smaller. 

Adding characters on your preferred keyboard on Windows is very easy using the 
Microsoft Keyboard Layout Creator, which has an end-user UI. As the compiled 
drivers are not even Windows-versioned (from NT-4 upwards), you can deploy them 
in your company and share among your friends without precautions. That is what 
users are supposed to do. If they don’t, Microsoft is not supposed to force 
upon. 

By contrast, if you want a Kana toggle to toggle the apostrophe key between 
U+0027 and U+02BC (and the quotation mark between U+0022 and a dead key for all 
quotation marks), you must use the Windows Driver Kit (along with some other 
resources) plus the MSKLC. If you wish to see it working, you may download an 
experimental keyboard layout on the unfinished webpage http://charupdate.info. 
It exemplifies also the Third level solution and the Compose key solution. 

I hope that helps. 

Marcel Schneider 


Another take on the English Apostrophe in Unicode

2015-06-12 Thread Marcel Schneider
On Fri, June 5, William_J_G Overington wrote:

 Markus Scherer wrote:

 How are normal users supposed to find both U+2019 and U+02BC on their 
 keyboards, and how are they supposed to deal with incorrect usage?

 I replied:

 Would it be possible to have wordprocessing software where one uses 
 CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC 
 for input and could there be a show in colour mode where U+2019 is 
 displayed in cyan and U+02BC is displayed in red, while 
 everything else is displayed in black?

 I am wondering whether some existing software packages 
 might be able to be used for the character inputting part using customized 
 keyboard short cuts.


 https://community.serif.com/forum/43862/question-about-customized-keyboard-short-cuts

 I realize that the cyan and red colours cannot be done at present, 
 yet I have now thought of the alternative for now of being able to test what 
 is 
 in the text by using a special version of an open source font 
 where there are distinctive glyphs one from the other 
 for the two characters.


If your goal is to check right now what apostrophes are in a given text, an 
easy way is to do a search for U+02BC and to ask the software to highlight all. 
Of PagePlus I’ve got only an expired demo version, but I can assure that on 
Word, a side pane may even show you the pages with all instances highlighted, 
and allows you to browse them. To start, press Ctrl+F and type a modifier 
letter apostrophe into the search bar, or select one in your text and then 
press Ctrl+F.

Getting the apostrophes colored and with a distinctive glyph is possible too. 
As you are talking about changing the font, I suppose you are in front of raw 
text. In this case you can do a search-and-replace which gives all U+02BC a red 
color and another font, say Tahoma when the text is in Arial. Again I speak for 
Word, where a Plus button shows a Formatting button for the replacing text 
(replace by the same but with a font formatting on typeface and color), but I 
suppose PagePlus allows the same proceeding.

About suggesting options, one might think about a blinking markup which would 
allow to find the problematic apostrophes even faster. As a shortcut for U+02BC 
I’d prefer CONTROL APOSTROPHE because it may occur more often. However, adding 
something on your keyboard using Right Alt (that is AltGr) is much more 
efficient because:
— You add whenever you want and nearly what you want (if no Kana and no chained 
dead keys, you get the needed characters on your keyboard the time you write to 
lists and fora).
— You are not bound to a given high-end software (the driver works whenever you 
type on your keyboard).
— You go on to be an active part of your communities (Unicode, Serif, ...) by 
sharing the resulting drivers with other people.

Definitely, any shortcut for an apostrophe would slow down the writing speed, 
therefore Apostrophe is preferred on Base shift state. So you may design a 
variant keyboard layout with U+02BC instead of U+0027, even if that be the only 
change, and toggle between the new one and the usual one by means of your OS's 
facilities. Or you may choose to add a Kana toggle to toggle the apostrophe key 
directly inside the driver, but achieving this is somewhat longer.

For an example, you may look at the unfinished page http://charupdate.info 
where there is already an experimental keyboard layout for download. With 
U+02BC MODIFIER LETTER APOSTROPHE.


I hope that helps.


Marcel Schneider


Re: Another take on the English apostrophe in Unicode

2015-06-11 Thread Bill Poser
To add a factor that I think hasn't been mentioned, there are languages in
which apostrophe is used both as a letter by itself and as part of a
complex letter. Most of the native languages of British Columbia write
glottalized consonants as C+', e.g. t' for an ejective alveolar stop, and
many use apostrophe by itself for the glottal stop. (Another common
convention, which produces other difficulties, is to use the number 7 for
glottal stop.)

Bill

On Wed, Jun 10, 2015 at 2:10 PM, Ted Clancy tcla...@mozilla.com wrote:

  On 4/Jun/2015 14:34 PM, Markus Scherer wrote:

 Looks all wrong to me.

 Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your
 points below.



 You can't use simple regular expressions to find word boundaries. That's
 why we have UAX #29.


 And UAX #29 doesn't work for words which begin or end with apostrophes,
 whether represented by U+0027 or U+2019. It erroneously thinks there's a
 word boundary between the apostrophe and the rest of the word.

 But UAX #29 *would* work if the apostrophes were represented by U+02BC,
 which is what I'm suggesting.

 Confusion between apostrophe and quoting -- blame the scribe who came up
 with the ambiguous use, not the people who gave it a number.

 I'm not trying to blame anyone. I'm trying to fix the problem.

 I know this problem has a long history.

 English is taught as that squiggle being punctuation, not a letter.

 I think we need make a distinction between the colloquial usage of the
 word punctuation and the Unicode general category punctuation which has
 specific technical implications.

 I somewhat wish that Unicode had a separate category for Things that look
 like punctuation but behave like letters, which might clear up this
 taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF
 RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are
 actually modifiers, into that category too.) But we don't. And the English
 apostrophe behaves like a letter, regardless of what your primary school
 teacher might have told you, so with the options available in Unicode, it
 needs to be classed as a letter.

 don’t is a contraction of two words, it is not one word.

 This is utter nonsense. Should my spell-checker recognise hasn't as a
 valid word? Or should it consider hasn't to be the word hasn followed
 by the word t, and then flag both of them as spelling errors?

 Is fo'c'sle the three separate words fo, c, and sle?

 The idea that words with apostrophes aren't valid words is a regrettable
 myth that exists in English, which has repeatedly led to the apostrophe
 being an afterthought in computing, leading to situations like this one.

 If anything, Unicode might have made a mistake in encoding two of these
 that look identical. How are normal users supposed to find both U+2019
 and
 U+02BC on their keyboards, and how are they supposed to deal with
 incorrect
 usage?

 Yeah, and there are fonts where I can't tell the difference between
 capital I and lower-case l. But my spell-checker will underline a word
 where I erroneously use an I instead of an l, and I imagine spell-checkers
 of the future could underline a word where I erroneously use a closing
 quote instead of an apostrophe, or vice versa.

 There are other possible solutions too, but I don't want to get into a
 discussion about UI design. I'll leave that to UI designers.

 - Ted



Re: Another take on the English apostrophe in Unicode

2015-06-11 Thread Philippe Verdy
Also used in the Breton trigram c’h (considered as a single letter of the
Breton alphabet, but actually entered as two letters with a diacritic-like
apostrophe in the middle (which in this case is still not a letter of the
alphabet...): the trigram c’h is distinct from the digram ch.
Breton **also** uses a regular apostrophe for elision.

In fact what you note for the ejective in native american languages is
effectively a right-combining diacritic, and still not a letter by itself.
However, given its position and the fact it is spacing, this is the
spacing form of the apostrophe diacritic that should be used, and that form
is then to choose between:

* U+00B4 (acute, most often ugly, located too high, and too much
horizontal),
* U+02B9 (prime, nearly good, but still too high),
* U+02BC (apostrophe),
* U+02C8 (vertical high tick, but confusable with the mark of stress in IPA
before a phonetic syllable), and
* U+02CA (acute/2nd tone, which for me is not distinct from 00B4, only used
with sinograms in Mandarin Chinese, with its metrics distinct from U+00B4
that match the Latin metrics).

In my opinion 02BC is the best choice for the diacritic apostrophe.

The other character for the **elision** apostrophe is a punctuation mark
U+2019 (just like the full stop punctuation is also used as an abbreviation
mark). There's no confusion with its alternate role as a right-side single
quote because U+2019 is used in languages that normally never use the
single quotes, but chevrons (or other punctuation signs in East-Asian
scripts).

But in English where single quote are used for small quotations, there's
still a problem to represent this elision apostrophe when it does not occur
between two letters where it also marks a gluing of two morphemes (as in
don't or Peter's), but at the begining or end of a word. But elisions
at end of words is also invalid when this is the final word of a quoted
sentence. If you really want to cite a single English word terminated by an
elision apostrophe, the single quotes won't be usable and you'll use
chevrons like in this ‹demo’› and not single or double quotes which are
difficult to discriminate.


2015-06-11 19:47 GMT+02:00 Bill Poser billpos...@gmail.com:

 To add a factor that I think hasn't been mentioned, there are languages in
 which apostrophe is used both as a letter by itself and as part of a
 complex letter. Most of the native languages of British Columbia write
 glottalized consonants as C+', e.g. t' for an ejective alveolar stop, and
 many use apostrophe by itself for the glottal stop. (Another common
 convention, which produces other difficulties, is to use the number 7 for
 glottal stop.)

 Bill

 On Wed, Jun 10, 2015 at 2:10 PM, Ted Clancy tcla...@mozilla.com wrote:

  On 4/Jun/2015 14:34 PM, Markus Scherer wrote:

 Looks all wrong to me.

 Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your
 points below.



 You can't use simple regular expressions to find word boundaries. That's
 why we have UAX #29.


 And UAX #29 doesn't work for words which begin or end with apostrophes,
 whether represented by U+0027 or U+2019. It erroneously thinks there's a
 word boundary between the apostrophe and the rest of the word.

 But UAX #29 *would* work if the apostrophes were represented by U+02BC,
 which is what I'm suggesting.

 Confusion between apostrophe and quoting -- blame the scribe who came up
 with the ambiguous use, not the people who gave it a number.

 I'm not trying to blame anyone. I'm trying to fix the problem.

 I know this problem has a long history.

 English is taught as that squiggle being punctuation, not a letter.

 I think we need make a distinction between the colloquial usage of the
 word punctuation and the Unicode general category punctuation which has
 specific technical implications.

 I somewhat wish that Unicode had a separate category for Things that
 look like punctuation but behave like letters, which might clear up this
 taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF
 RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are
 actually modifiers, into that category too.) But we don't. And the English
 apostrophe behaves like a letter, regardless of what your primary school
 teacher might have told you, so with the options available in Unicode, it
 needs to be classed as a letter.

 don’t is a contraction of two words, it is not one word.

 This is utter nonsense. Should my spell-checker recognise hasn't as a
 valid word? Or should it consider hasn't to be the word hasn followed
 by the word t, and then flag both of them as spelling errors?

 Is fo'c'sle the three separate words fo, c, and sle?

 The idea that words with apostrophes aren't valid words is a regrettable
 myth that exists in English, which has repeatedly led to the apostrophe
 being an afterthought in computing, leading to situations like this one.

 If anything, Unicode might have made a mistake in encoding two of these
 that look 

Re: Another take on the English apostrophe in Unicode

2015-06-11 Thread Bill Poser
I agree with the recommendation of U+02BC. However, it is in fact rarely
used because most of the people who write these languages or create
supporting infrastructure are unawre of such issues.

A small point: it isn't always the spacing diacritic that is used. In some
languages, e.g. Halkomelem, people use the spacing apostrophe if they have
to but prefer the non-spacing version.

On Thu, Jun 11, 2015 at 11:39 AM, Philippe Verdy verd...@wanadoo.fr wrote:

 Also used in the Breton trigram c’h (considered as a single letter of the
 Breton alphabet, but actually entered as two letters with a diacritic-like
 apostrophe in the middle (which in this case is still not a letter of the
 alphabet...): the trigram c’h is distinct from the digram ch.
 Breton **also** uses a regular apostrophe for elision.

 In fact what you note for the ejective in native american languages is
 effectively a right-combining diacritic, and still not a letter by itself.
 However, given its position and the fact it is spacing, this is the
 spacing form of the apostrophe diacritic that should be used, and that form
 is then to choose between:

 * U+00B4 (acute, most often ugly, located too high, and too much
 horizontal),
 * U+02B9 (prime, nearly good, but still too high),
 * U+02BC (apostrophe),
 * U+02C8 (vertical high tick, but confusable with the mark of stress in
 IPA before a phonetic syllable), and
 * U+02CA (acute/2nd tone, which for me is not distinct from 00B4, only
 used with sinograms in Mandarin Chinese, with its metrics distinct from
 U+00B4 that match the Latin metrics).

 In my opinion 02BC is the best choice for the diacritic apostrophe.

 The other character for the **elision** apostrophe is a punctuation mark
 U+2019 (just like the full stop punctuation is also used as an abbreviation
 mark). There's no confusion with its alternate role as a right-side single
 quote because U+2019 is used in languages that normally never use the
 single quotes, but chevrons (or other punctuation signs in East-Asian
 scripts).

 But in English where single quote are used for small quotations, there's
 still a problem to represent this elision apostrophe when it does not occur
 between two letters where it also marks a gluing of two morphemes (as in
 don't or Peter's), but at the begining or end of a word. But elisions
 at end of words is also invalid when this is the final word of a quoted
 sentence. If you really want to cite a single English word terminated by an
 elision apostrophe, the single quotes won't be usable and you'll use
 chevrons like in this ‹demo’› and not single or double quotes which are
 difficult to discriminate.


 2015-06-11 19:47 GMT+02:00 Bill Poser billpos...@gmail.com:

 To add a factor that I think hasn't been mentioned, there are languages
 in which apostrophe is used both as a letter by itself and as part of a
 complex letter. Most of the native languages of British Columbia write
 glottalized consonants as C+', e.g. t' for an ejective alveolar stop, and
 many use apostrophe by itself for the glottal stop. (Another common
 convention, which produces other difficulties, is to use the number 7 for
 glottal stop.)

 Bill

 On Wed, Jun 10, 2015 at 2:10 PM, Ted Clancy tcla...@mozilla.com wrote:

  On 4/Jun/2015 14:34 PM, Markus Scherer wrote:

 Looks all wrong to me.

 Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your
 points below.



 You can't use simple regular expressions to find word boundaries.
 That's why we have UAX #29.


 And UAX #29 doesn't work for words which begin or end with apostrophes,
 whether represented by U+0027 or U+2019. It erroneously thinks there's a
 word boundary between the apostrophe and the rest of the word.

 But UAX #29 *would* work if the apostrophes were represented by U+02BC,
 which is what I'm suggesting.

 Confusion between apostrophe and quoting -- blame the scribe who came up
 with the ambiguous use, not the people who gave it a number.

 I'm not trying to blame anyone. I'm trying to fix the problem.

 I know this problem has a long history.

 English is taught as that squiggle being punctuation, not a letter.

 I think we need make a distinction between the colloquial usage of the
 word punctuation and the Unicode general category punctuation which has
 specific technical implications.

 I somewhat wish that Unicode had a separate category for Things that
 look like punctuation but behave like letters, which might clear up this
 taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF
 RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are
 actually modifiers, into that category too.) But we don't. And the English
 apostrophe behaves like a letter, regardless of what your primary school
 teacher might have told you, so with the options available in Unicode, it
 needs to be classed as a letter.

 don’t is a contraction of two words, it is not one word.

 This is utter nonsense. Should my spell-checker recognise hasn't as a

Re: Another take on the English apostrophe in Unicode

2015-06-11 Thread Ted Clancy
On Thu, Jun 11, 2015 at 1:17 AM, Philippe Verdy verd...@wanadoo.fr wrote:

 The ASCII punctuations have been ovveriden for a lot of different roles.
 There's simply no way to map them to a category that matches their semantic
 role. [...] Pd (dash) is then appropriate for the ASCII hyphen-minus.


I agree, but I wasn't talking about the ASCII hyphen, U+002D
(HYPHEN-MINUS). I was talking about U+2010 (HYPHEN).

I also wasn't talking about changing the properties of U+0027 (APOSTROPHE).


 in dictionaries I've seen small slanted tildes, or slanted small equal
 signs, to make the distinction with true hyphens used in compound words


This is drifting off-topic, but I wanted to address the thing you just said
above. Firstly, in the dictionaries I've seen, the slanted double hyphen is
only used when a line break happens to occur at the same place as a true
hyphen. It replaces the true hyphen. When a line is broken at a
hyphenation point between letters, an ordinary-looking hyphen is displayed.

Secondly, this character is encoded in Unicode at U+2E17 (DOUBLE OBLIQUE
HYPHEN).

- Ted


On Thu, Jun 11, 2015 at 1:17 AM, Philippe Verdy verd...@wanadoo.fr wrote:

 The ASCII punctuations have been ovveriden for a lot of different roles.
 There's simply no way to map them to a category that matches their semantic
 role. So the ASCII hyphen and apostrophe-quote can only be given a very
 weak category that just exhibit their visual role. Pd (dash) is then
 appropriate for the ASCII hyphen-minus. You can't really tell from the
 character alone if it is a punctuation or a minus sign.

 If it is a minus sign you can reencode it better using the more specific
 mathematical minus sign. Otherwise, even if it is not a minus sign, it can
 be:
 - a connector between words in compound words (hyphen)
 - a trailing mark at end of lines for indicating a word has been broken in
 the middle (but remember that I asked previously for another character for
 that role because this word-breaking hyphen is not necessarily an
 horisontal hyphen (in dictionaries I've seen small slanted tildes, or
 slanted small equal signs, to make the distinction with true hyphens used
 in compound words, also because sometimes these breaks are not necessarily
 between two syllables in pocket books with very narrow columns and
 minimized spacing)
 - a bullet leading items in a vertical list (this should be an en dash,
 follwoed by some spacing)
 - a punctuation (not necessarily at begining of line) marking the change
 of person speaking (very common in litterature, notably in theatre).

 As a connector between words, there's a demonstrated need of
 differentiating regular hyphens, longer hyphens (preferably surrounded by
 thin spaces) for noting intervals (we can use the EN DASH for that), long
 hyphens between two separate names that are joined (example in propers
 names, after mariage, there's an example in France, where INSEE encodes it
 for now using TWO successive hyphens, which are also used in French
 identity cards, passports, social security green cards...).


 

 Still nobody replied to my past comment (about 1 month ago) about the
 various forms of the word-breaking hypĥen / line-wrapping symbol:

 * I'm not speaking about the SHY control, but about the real character
 whose glyph appears when SHY is materialized at end of lines (and which
 should be neither minus, or en-dash but also not the same as the
 orthographic hyphen used between words in a compound word).

 * This character can also be found (and is needed) also for breaking long
 mathematical formulas and must be clearly distinct from the regular minus.

 * This character is also needed for rendering long lines of programming
 code or textual data (it is something that must not be entered in programs
 but that must be rendered because theses programs or codes have significant
 line breaks: the glyph indicates that the following rendered line break is
 to be discarded). Not all programming languages have a syntax allwong to
 use an escape before the line break (such escaping varies, it may be a
 backslash in C/C++, or an underscore in Basic, but in data dumps such as
 CSV files, it is impossible to note such escape in the data language
 itself, and we need to render some specific glyph).

 * This character is absolutely needed when rendering on a static medium
 (i.e. printing or broadcasting) ;  for dynamic medium (such as personal
 displays with a personal UI) we could still use scrolling, but users don't
 like horizontal scrolls and highly prefer reading the text directly. So
 they expect to see a distinctive glyph (or icon) to see the distinction
 between line breaks where there are significant or where they just wrap too
 long lines, and still see the distinction with other regular hyphens and
 minus (that are also significant and very frequently distinct)


 2015-06-11 0:51 GMT+02:00 Ted Clancy tcla...@mozilla.com:

 On 4/Jun/2015 19:01, Leo Broukhis wrote:

 Along the same 

Re: Another take on the English apostrophe in Unicode

2015-06-11 Thread Philippe Verdy
2015-06-11 20:46 GMT+02:00 Bill Poser billpos...@gmail.com:

 I agree with the recommendation of U+02BC. However, it is in fact rarely
 used because most of the people who write these languages or create
 supporting infrastructure are unawre of such issues.

 A small point: it isn't always the spacing diacritic that is used. In some
 languages, e.g. Halkomelem, people use the spacing apostrophe if they have
 to but prefer the non-spacing version.


True but on the examples I gave, spacing is needed: the apostrophe is
intended to not collide with the previous or next letter, including when
writing capital letters. In the Breton trigram c’h where it it plays a
diacritic role, but as well in the English elision don’t, the collision
would occur after the apostrophe with the ascenders.

The only alternative would have been to use a diacritic above one of the
two letters for the diacritic apostrophe (and the best diacritic that would
have been used for Breton or English would have been an acute accent over
the first consonnant. But such usage of combining characters is non
conforming for its use as an elision mark.

An elision alone is not supposed to change the pronunciation of the
remaining letters.So it would have not been appropriate for the elisions in
English don’t, or in French j’ai or s’est (this is not a strict rule,
French or English also have exceptions where some combinations are used and
written that change the way the letters are effectively phonetically
realized, including with elisions: don’t is a perfect example where n
looses its consonnant value as it is glued with the previous vowel to
nasalize it and slightly stress it and in other contexts the following t is
also muted as in you don't have to do that in fast speech: this is still
the same contraction/elision and it is justified to keep the elision mark
separate without noting how the following or next letter are contextually
realized, but in all case the elision glues two syllables into only one and
the apostrophe is written between the remaining letters of morphemes on
each side).

If you use a non-spacing version, this can in fact only occur graphically
when the following letter is a small letter without ascenders : I still
think that this is the spacing version, but what happens is just the effect
of some contextual typographic kerning (the same thing that happens in
pairs like AV, fi, ij, To...)



Also you claim that U+02BC is rarely used for the elision apostrophe. This
is plain wrong for French at least, even if people only have an ASCII
apostrophe on their native keyboard (there are many word processors that
will correctly enter the appropriate curly apostrophe as U+02BC instead
of the ugly ASCII vertical quote. Even in English when you look at
correctly typeset documents the ASCII quote is replaced by U+2BC (look at
large section headings, book titles).

U+02BC is also prefered in English for the elision apostrophe. For English
you may want to read this:
http://www.creativebloq.com/typography/mistakes-everyone-makes-21514129

ASCII and the computer keyboards just perpetuate the limited charset that
was supported by old mechanical typewriters. I don't understand why PC
keyboards could be extended to add many multimedia control keys or
function keys, but not the traditional quotes that are needed (and even
sometimes letters still missing in all standard physical keyboard leyouts
for French, such as œ/Œ, æ/Æ, or frequent capitals with accents such as É,
which is however present on virtual onscreen keyboards for smartphones and
tablets).

It's high time to restore these letters (and also campaign so that
manufacturer of physical keyboards will add a few more keys for national
letters (they did it for Japanese only, why not for French or even English,
to have more punctuation signs and missing letters or diacritics). It is
perfectly possible to find a place for them on physical keyboards just
above the numeric key (F1..F12 keys can be compacted if needed, and a
couple of dead keys can also be mapped to the right of the Return key
without reducing the size of the space bar or the Return/Backspace keys or
other modifier keys).

Some notebook manufacturers have used two additional preprogrammed keys
(e.g. Acer, stupidly, for an unneeded additional Euro symbol whose location
on AltGr+E or AltGr+4 in UK is standard, the second one being bound to the
dollar symbol aslo not needed !). What is needed is 5 standard keys with
standard keycodes, different from keycodes used for user-programmable keys
(generally labelled PF1, PF2... but sometimes unlabelled) and different
from application-dependant function keys (e.g. generic color keys, like on
TV remote controls for navigation in menus: red, green, yellow, blue)

Note that this is different from the existing feature on some keyboards
defining programmable keys, whose layout is not programmable by the driver
itself but by individual settings of the user, independantly of thre
selected 

Re: Another take on the English apostrophe in Unicode

2015-06-10 Thread Ted Clancy
 On 4/Jun/2015 14:34 PM, Markus Scherer wrote:

 Looks all wrong to me.

Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your
points below.



 You can't use simple regular expressions to find word boundaries. That's
 why we have UAX #29.


And UAX #29 doesn't work for words which begin or end with apostrophes,
whether represented by U+0027 or U+2019. It erroneously thinks there's a
word boundary between the apostrophe and the rest of the word.

But UAX #29 *would* work if the apostrophes were represented by U+02BC,
which is what I'm suggesting.

Confusion between apostrophe and quoting -- blame the scribe who came up
 with the ambiguous use, not the people who gave it a number.

I'm not trying to blame anyone. I'm trying to fix the problem.

I know this problem has a long history.

English is taught as that squiggle being punctuation, not a letter.

I think we need make a distinction between the colloquial usage of the word
punctuation and the Unicode general category punctuation which has
specific technical implications.

I somewhat wish that Unicode had a separate category for Things that look
like punctuation but behave like letters, which might clear up this
taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF
RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are
actually modifiers, into that category too.) But we don't. And the English
apostrophe behaves like a letter, regardless of what your primary school
teacher might have told you, so with the options available in Unicode, it
needs to be classed as a letter.

don’t is a contraction of two words, it is not one word.

This is utter nonsense. Should my spell-checker recognise hasn't as a
valid word? Or should it consider hasn't to be the word hasn followed
by the word t, and then flag both of them as spelling errors?

Is fo'c'sle the three separate words fo, c, and sle?

The idea that words with apostrophes aren't valid words is a regrettable
myth that exists in English, which has repeatedly led to the apostrophe
being an afterthought in computing, leading to situations like this one.

If anything, Unicode might have made a mistake in encoding two of these
 that look identical. How are normal users supposed to find both U+2019 and
 U+02BC on their keyboards, and how are they supposed to deal with
 incorrect
 usage?

Yeah, and there are fonts where I can't tell the difference between capital
I and lower-case l. But my spell-checker will underline a word where I
erroneously use an I instead of an l, and I imagine spell-checkers of the
future could underline a word where I erroneously use a closing quote
instead of an apostrophe, or vice versa.

There are other possible solutions too, but I don't want to get into a
discussion about UI design. I'll leave that to UI designers.

- Ted


Re: Another take on the English apostrophe in Unicode

2015-06-10 Thread Ted Clancy
On 4/Jun/2015 19:01, Leo Broukhis wrote:

 Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for
 example, the work ack-ack isn't decomposable into words, or even
 morphemes,
 ack and ack.

I do think that U+2010 (HYPHEN) is miscategorised. I think it should have
General Category = Pc, not Pd. (That is, hyphens are connectors, not
dashes.) That would make it a word character.

Or, at the very least, U+2010 should have Word Break = MidNumLet (meaning
it can occur in the middle of numbers or letters). UAX #29 says that U+2010
deliberately does *not* have Word Break = MidNumLet, though an
implementation may treat it as if it did. (UAX #29 doesn't give any reasons
for this decision. I can understand why U+002D (HYPHEN-MINUS) doesn't have
Word Break = MidNumLet, due to its history of being used as a dash or minus
sign, but U+2010 should never be used as a dash or minus sign, so I don't
see the problem.)

But luckily, the miscategorisation of U+2010 hasn't led to any pressing
practical problems, unlike the misuse of U+2019 for the apostrophe.

- Ted


Re: Another take on the English apostrophe in Unicode

2015-06-10 Thread Philippe Verdy
The ASCII punctuations have been ovveriden for a lot of different roles.
There's simply no way to map them to a category that matches their semantic
role. So the ASCII hyphen and apostrophe-quote can only be given a very
weak category that just exhibit their visual role. Pd (dash) is then
appropriate for the ASCII hyphen-minus. You can't really tell from the
character alone if it is a punctuation or a minus sign.

If it is a minus sign you can reencode it better using the more specific
mathematical minus sign. Otherwise, even if it is not a minus sign, it can
be:
- a connector between words in compound words (hyphen)
- a trailing mark at end of lines for indicating a word has been broken in
the middle (but remember that I asked previously for another character for
that role because this word-breaking hyphen is not necessarily an
horisontal hyphen (in dictionaries I've seen small slanted tildes, or
slanted small equal signs, to make the distinction with true hyphens used
in compound words, also because sometimes these breaks are not necessarily
between two syllables in pocket books with very narrow columns and
minimized spacing)
- a bullet leading items in a vertical list (this should be an en dash,
follwoed by some spacing)
- a punctuation (not necessarily at begining of line) marking the change of
person speaking (very common in litterature, notably in theatre).

As a connector between words, there's a demonstrated need of
differentiating regular hyphens, longer hyphens (preferably surrounded by
thin spaces) for noting intervals (we can use the EN DASH for that), long
hyphens between two separate names that are joined (example in propers
names, after mariage, there's an example in France, where INSEE encodes it
for now using TWO successive hyphens, which are also used in French
identity cards, passports, social security green cards...).




Still nobody replied to my past comment (about 1 month ago) about the
various forms of the word-breaking hypĥen / line-wrapping symbol:

* I'm not speaking about the SHY control, but about the real character
whose glyph appears when SHY is materialized at end of lines (and which
should be neither minus, or en-dash but also not the same as the
orthographic hyphen used between words in a compound word).

* This character can also be found (and is needed) also for breaking long
mathematical formulas and must be clearly distinct from the regular minus.

* This character is also needed for rendering long lines of programming
code or textual data (it is something that must not be entered in programs
but that must be rendered because theses programs or codes have significant
line breaks: the glyph indicates that the following rendered line break is
to be discarded). Not all programming languages have a syntax allwong to
use an escape before the line break (such escaping varies, it may be a
backslash in C/C++, or an underscore in Basic, but in data dumps such as
CSV files, it is impossible to note such escape in the data language
itself, and we need to render some specific glyph).

* This character is absolutely needed when rendering on a static medium
(i.e. printing or broadcasting) ;  for dynamic medium (such as personal
displays with a personal UI) we could still use scrolling, but users don't
like horizontal scrolls and highly prefer reading the text directly. So
they expect to see a distinctive glyph (or icon) to see the distinction
between line breaks where there are significant or where they just wrap too
long lines, and still see the distinction with other regular hyphens and
minus (that are also significant and very frequently distinct)


2015-06-11 0:51 GMT+02:00 Ted Clancy tcla...@mozilla.com:

 On 4/Jun/2015 19:01, Leo Broukhis wrote:

 Along the same lines, we might need a MODIFIER LETTER HYPHEN, because,
 for
 example, the work ack-ack isn't decomposable into words, or even
 morphemes,
 ack and ack.

 I do think that U+2010 (HYPHEN) is miscategorised. I think it should have
 General Category = Pc, not Pd. (That is, hyphens are connectors, not
 dashes.) That would make it a word character.

 Or, at the very least, U+2010 should have Word Break = MidNumLet (meaning
 it can occur in the middle of numbers or letters). UAX #29 says that U+2010
 deliberately does *not* have Word Break = MidNumLet, though an
 implementation may treat it as if it did. (UAX #29 doesn't give any reasons
 for this decision. I can understand why U+002D (HYPHEN-MINUS) doesn't have
 Word Break = MidNumLet, due to its history of being used as a dash or minus
 sign, but U+2010 should never be used as a dash or minus sign, so I don't
 see the problem.)

 But luckily, the miscategorisation of U+2010 hasn't led to any pressing
 practical problems, unlike the misuse of U+2019 for the apostrophe.

 - Ted




Re: Another take on the English apostrophe in Unicode

2015-06-10 Thread Philippe Verdy
The French pomme de terre (potato in English, French vulgar synonym :
patate) is a single lemma in dictionaries, but is still 3 separate words
(only the first one takes the plural mark), it is not considered a nom
composé (so there's no hyphens).

And they are separated by standard spaces (that are breakable, and
expansible/compressible like all others in case of justified text)... The
lemma is still recognized if there are extra punctuation in the middle such
as : « pomme » de terre. We don't need any new space character.

What you want is to insert markup to exhibit the structure of sentences for
grouping words semantically or grammaticaly. But nobody including
grammarians will use this new space, what they'll use is in fact some
additional symbols or presentation features (enclosing boxes, braces above
or below, colors...) if they want to exhibit it on top of the standard text.



2015-06-06 3:08 GMT+02:00 Eric Muller eric.mul...@efele.net:

 On 6/5/2015 10:29 AM, John D. Burger wrote:

 Linguistically, don't and friends pass all the diagnostics that
 indicate they're single words.


 If I am not mistaken, the french pomme de terre also passes the
 diagnostics. So we need a new space character.

 Eric.




Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread Leo Broukhis
On Thu, Jun 4, 2015 at 9:25 PM, David Starner prosfil...@gmail.com wrote:

 Hyphens generally make multiple words into one anyway. There's not really
 multiple hyphens the way there's separate quotes and apostrophes.


Generally, but not always, just as apostrophes aren't always at a
contracted word boundary. There is only one hyphen because no language
(AFAIK) claims it as part of its alphabet.

Leo

 On 7:01pm, Thu, Jun 4, 2015 Leo Broukhis l...@mailcom.com wrote:

 Along the same lines, we might need a MODIFIER LETTER HYPHEN, because,
 for example, the work ack-ack isn't decomposable into words, or even
 morphemes, ack and ack.

 Leo

 On Thu, Jun 4, 2015 at 6:31 PM, David Starner prosfil...@gmail.com
 wrote:

 On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer markus@gmail.com
 wrote:

 don’t is a contraction of two words, it is not one word.


 But as he points out, it's not a contraction of don and t; it is, at
 best, a contraction of do and n't. It's eliding, not punctuating. In the
 comments, he also brings up the examples of Don’t you mind? being okay
 but not *Do not you mind?, and fo’c’sle.

  You can't use simple regular expressions to find word boundaries.

 Who uses _simple_ regular expressions? You can't use any code to
 reliably find word boundaries in English, and that's a problem.





Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread Leo Broukhis
 But the point was that treating hyphens as parts of words is not generally a 
 wrong thing.

That brings us back to my original question: where's MODIFIER LETTER
HYPHEN, then? A word is a sequence of letters, isn't it? :)

I agree that conflating apostrophes and quotes is a source of
problems, however, existence of the MODIFIER LETTER [same glyph as
used for English contractions] in Unicode is a coincidence which
should not have an effect on usage of apostrophes in English.

Leo

On Thu, Jun 4, 2015 at 11:58 PM, David Starner prosfil...@gmail.com wrote:
 On June 4, 2015, at 11:01 PM, Leo Broukhis l...@mailcom.com wrote:



On Thu, Jun 4, 2015 at 9:25 PM, David Starner prosfil...@gmail.com wrote:

Hyphens generally make multiple words into one anyway. There's not really
 multiple hyphens the way there's separate quotes and apostrophes.

Generally, but not always, just as apostrophes aren't always at a
 contracted word boundary. There is only one hyphen because no language
 (AFAIK) claims it as part of its alphabet.

 But the point was that treating hyphens as parts of words is not generally a
 wrong thing. There is one generally consistent rule for hyphens. When
 apostrophes and quotes are conflated, there is no one generally acceptable
 rule.


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread David Starner


On June 4, 2015, at 11:01 PM, Leo Broukhis l...@mailcom.com wrote:



On Thu, Jun 4, 2015 at 9:25 PM, David Starner prosfil...@gmail.com wrote:

Hyphens generally make multiple words into one anyway. There's not really 
multiple hyphens the way there's separate quotes and apostrophes.

Generally, but not always, just as apostrophes aren't always at a contracted 
word boundary. There is only one hyphen because no language (AFAIK) claims it 
as part of its alphabet. 

But the point was that treating hyphens as parts of words is not generally a 
wrong thing. There is one generally consistent rule for hyphens. When 
apostrophes and quotes are conflated, there is no one generally acceptable rule.


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread QSJN 4 UKR
The conflict is between linguists and programmers. In plain text
apostrophe is a punctuation used instead letters (unreadable, one or
more) or as separator for avoid connecting letters into ligature or
syllable, between parts of composite word as well as inside the simple
word, or finally, as quotation mark. Yes it is ambiguous!
It is. It just is! Linguists say It is. We see that. We know that.
And programmers say That's wrong! We can't understand that. Just are
you so stupid if you can't!
Modifier letter apostrophe is a letter that used as itself and means
itself (ejective sound e.g.) only. Don't use it else. It just make
more confusion.


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread William_J_G Overington
Markus Scherer wrote:
 How are normal users supposed to find both U+2019 and U+02BC on their 
 keyboards, and how are they supposed to deal with incorrect usage?
Would it be possible to have wordprocessing software where one uses CONTROL 
APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and 
could there be a show in colour mode where U+2019 is displayed in cyan and 
U+02BC is displayed in red, while everything else is displayed in black?
That is, CONTROL U+0027 and CONTROL SHIFT U+0027 respectively.
If people want this facility, maybe it could become published in a Unicode 
Technical Report so that standardization and interoperability could be achieved.
William Overington
5 June 2015


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread David Starner
On Fri, Jun 5, 2015 at 12:16 AM Leo Broukhis l...@mailcom.com wrote:

 I agree that conflating apostrophes and quotes is a source of
 problems, however, existence of the MODIFIER LETTER [same glyph as
 used for English contractions] in Unicode is a coincidence which
 should not have an effect on usage of apostrophes in English.


Coincidence or not, the Unicode Consortium is not going to allocate a new
code-point for the English apostrophe as long as MODIFIER LETTER APOSTROPHE
exists. Any change is pretty unlikely, but changing to an existing
character is vastly more likely then creating a new one.


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread David Starner
On Fri, Jun 5, 2015 at 2:43 AM QSJN 4 UKR qsjn4...@gmail.com wrote:

 The conflict is between linguists and programmers.


No, it's not.


 Yes it is ambiguous!
 It is. It just is! Linguists say It is. We see that. We know that.


Now you programmers find some way to deal with that so you can produce
useful corpuses for linguistic work. Which is what this is all about, is
producing good linguistic interpretations of plain text, for, among others,
linguists whose supply of scanned text has exceeded their ability to
hand-process it.


 Modifier letter apostrophe is a letter that used as itself and means
 itself (ejective sound e.g.) only. Don't use it else. It just make
 more confusion.


If you don't know what language a text is in, you can't tell what sounds
letters make. Adding this character to English's repertoire won't change
that.


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread Kalvesmaki, Joel
I don’t have a particular position staked out. But to this discussion should be 
added the very interesting work done by Zwicky and Pullum arguing that the 
apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC 
would satisfy that position. See:

Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. 
Cliticization vs. Inflection: English N’T.Language59, no. 3 (1983): 502–513.

It’s nicely summarized and discussed here:
http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/

jk
--
Joel Kalvesmaki
Editor in Byzantine Studies
Dumbarton Oaks
202 339 6435



Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread William_J_G Overington
Markus Scherer wrote:
 How are normal users supposed to find both U+2019 and U+02BC on their 
 keyboards, and how are they supposed to deal with incorrect usage?
I replied:
 Would it be possible to have wordprocessing software where one uses CONTROL 
 APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and 
 could there be a show in colour mode where U+2019 is displayed in cyan and 
 U+02BC is displayed in red, while everything else is displayed in black?
I am wondering whether some existing software packages might be able to be used 
for the character inputting part using customized keyboard short cuts.
https://community.serif.com/forum/43862/question-about-customized-keyboard-short-cuts
I realize that the cyan and red colours cannot be done at present, yet I
 have now thought of the alternative for now of being able to test what is in 
the text by using a special version 
of an open source font where there are distinctive glyphs one from the 
other for the two characters.
William Overington
5 June 2015


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread John D. Burger

 On Jun 4, 2015, at 17:34 , Markus Scherer markus@gmail.com wrote:
 
 Looks all wrong to me.
 
 don’t is a contraction of two words, it is not one word.

Yes it is. Is keyboard two words? How about newspaper?

If don't is two words, please tell me what two words make up won't? (Hint, 
neither of them is will.)

Linguistically, don't and friends pass all the diagnostics that indicate 
they're single words.

- John Burger

 English is taught as that squiggle being punctuation, not a letter. (Unlike, 
 say, the Hawaiʻian ʻOkina.)
 
 You can't use simple regular expressions to find word boundaries. That's why 
 we have UAX #29.
 
 Confusion between apostrophe and quoting -- blame the scribe who came up with 
 the ambiguous use, not the people who gave it a number.
 
 If anything, Unicode might have made a mistake in encoding two of these that 
 look identical. How are normal users supposed to find both U+2019 and U+02BC 
 on their keyboards, and how are they supposed to deal with incorrect usage?
 
 markus




Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread Doug Ewell
QSJN 4 UKR qsjn4ukr at gmail dot com wrote:

 And programmers say That's wrong! We can't understand that. Just are
 you so stupid if you can't!

You know, we really aren't all like that. Some of us actually try to
meet user needs.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread Eric Muller

On 6/5/2015 10:29 AM, John D. Burger wrote:

Linguistically, don't and friends pass all the diagnostics that indicate 
they're single words.


If I am not mistaken, the french pomme de terre also passes the 
diagnostics. So we need a new space character.


Eric.



Re: Another take on the English apostrophe in Unicode

2015-06-04 Thread David Starner
On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer markus@gmail.com wrote:

 don’t is a contraction of two words, it is not one word.


But as he points out, it's not a contraction of don and t; it is, at best,
a contraction of do and n't. It's eliding, not punctuating. In the
comments, he also brings up the examples of Don’t you mind? being okay
but not *Do not you mind?, and fo’c’sle.

 You can't use simple regular expressions to find word boundaries.

Who uses _simple_ regular expressions? You can't use any code to reliably
find word boundaries in English, and that's a problem.


Re: Another take on the English apostrophe in Unicode

2015-06-04 Thread David Starner
Hyphens generally make multiple words into one anyway. There's not really
multiple hyphens the way there's separate quotes and apostrophes.

On 7:01pm, Thu, Jun 4, 2015 Leo Broukhis l...@mailcom.com wrote:

 Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for
 example, the work ack-ack isn't decomposable into words, or even morphemes,
 ack and ack.

 Leo

 On Thu, Jun 4, 2015 at 6:31 PM, David Starner prosfil...@gmail.com
 wrote:

 On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer markus@gmail.com
 wrote:

 don’t is a contraction of two words, it is not one word.


 But as he points out, it's not a contraction of don and t; it is, at
 best, a contraction of do and n't. It's eliding, not punctuating. In the
 comments, he also brings up the examples of Don’t you mind? being okay
 but not *Do not you mind?, and fo’c’sle.

  You can't use simple regular expressions to find word boundaries.

 Who uses _simple_ regular expressions? You can't use any code to reliably
 find word boundaries in English, and that's a problem.





Re: Another take on the English apostrophe in Unicode

2015-06-04 Thread Leo Broukhis
Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for
example, the work ack-ack isn't decomposable into words, or even morphemes,
ack and ack.

Leo

On Thu, Jun 4, 2015 at 6:31 PM, David Starner prosfil...@gmail.com wrote:

 On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer markus@gmail.com
 wrote:

 don’t is a contraction of two words, it is not one word.


 But as he points out, it's not a contraction of don and t; it is, at best,
 a contraction of do and n't. It's eliding, not punctuating. In the
 comments, he also brings up the examples of Don’t you mind? being okay
 but not *Do not you mind?, and fo’c’sle.

  You can't use simple regular expressions to find word boundaries.

 Who uses _simple_ regular expressions? You can't use any code to reliably
 find word boundaries in English, and that's a problem.



Re: Another take on the English apostrophe in Unicode

2015-06-04 Thread Markus Scherer
Looks all wrong to me.

don’t is a contraction of two words, it is not one word.

English is taught as that squiggle being punctuation, not a letter.
(Unlike, say, the Hawaiʻian ʻOkina
http://en.wikipedia.org/wiki/%CA%BBOkina.)

You can't use simple regular expressions to find word boundaries. That's
why we have UAX #29.

Confusion between apostrophe and quoting -- blame the scribe who came up
with the ambiguous use, not the people who gave it a number.

If anything, Unicode might have made a mistake in encoding two of these
that look identical. How are normal users supposed to find both U+2019 and
U+02BC on their keyboards, and how are they supposed to deal with incorrect
usage?

markus


Another take on the English apostrophe in Unicode

2015-06-04 Thread Frédéric Grosshans
An interesting argument for U+02BC MODIFIER LETTER APOSTROPHE as English
apostrophe :

https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/

Frédéric