Re: Ancient Greek apostrophe marking elision

2019-02-05 Thread James Tauber via Unicode
On Tue, Feb 5, 2019 at 12:23 AM James Kass via Unicode 
wrote:

> Text a man has JOINED together, let not algorithm put asunder.
>

I was hoping so much that ὃ οὖν ὁ θεὸς συνέζευξεν ἄνθρωπος μὴ χωριζέτω
would have an apostrophe but alas no.


Re: Ancient Greek apostrophe marking elision

2019-02-04 Thread James Kass via Unicode



On 2019-01-28 8:58 PM, Richard Wordingham wrote:
> On Mon, 28 Jan 2019 03:48:52 +
> James Kass via Unicode  wrote:
>
>> It’s been said that the text segmentation rules seem over-complicated
>> and are probably non-trivial to implement properly.  I tried your
>> suggestion of WORD JOINER U+2060 after tau ( γένοιτ⁠’ ἄν ), but it
>> only added yet another word break in LibreOffice.
>
> I said we *don't* have a control that joins words.  The text of TUS
> used to say we had one in U+2060, but that was removed in 2015.  I
> pleaded for the retention of this functionality in document
> L2/2015/15-192, but my request was refused.  I pointed out in ICU
> ticket #11766 that ICU's Thai word breaker retained this facility. ...

Sorry for sounding obtuse there.  It was your *post* which suggested the 
use of WORD JOINER.  You did clearly assert that it would not work.  So, 
human nature, I /had/ to try it and see.


It. did. not. work.  (No surprise.)  But it /should/ have worked. It’s a 
JOINER, for goodness sake!


When the author/editor puts any kind of JOINER into a text string, 
what’s the intent?  What’s the poînt of having a JOINER that doesn’t?


Recently I put a ZWJ between the “c” and the “t” in the word 
“Respec‍tfully” as an experiment.  Spellchecker flagged both “respec” 
and “tfully” as being misspelt, which they probably are.  A ZWNJ would 
have been used if there had been any desire for the string to be *split* 
there, e.g., to forbid formation of a discretionary ligature.  Instead 
the ZWJ was inserted, signalling authorial intent that a ‘more joined’ 
form of the “c-t” substring was requested.


Text a man has JOINED together, let not algorithm put asunder.



Re: Ancient Greek apostrophe marking elision

2019-01-29 Thread Richard Wordingham via Unicode
On Mon, 28 Jan 2019 20:55:39 -0500
"Mark E. Shoulson via Unicode"  wrote:

> On 1/28/19 2:31 AM, Mark Davis ☕️ via Unicode wrote:
> >
> > But the question is how important those are in daily life. I'm not 
> > sure why the double-click selection behavior is so much more of a 
> > problem for Ancient Greek users than it is for the somewhat larger 
> > community of English users. Word selection is not normally as 
> > important an operation as line break, which does work as expected.  
> 
> This is a good point.  Bottom line is that word-selection, at least,
> is not going to be _exactly_ right.  Oh, and for another example,
> note that Esperanto also regularly (in poetry, anyway) uses a
> word-final apostrophe (of some kind) to indicate elision of the final
> -o of a nominative singular noun, or the -a of the article "la".
> What shall we say to Esperantists who can't correctly the third word
> in «al la mond’ eterne militanta / Ĝi promesas sanktan harmonion»?  I
> guess "Suck it up and deal with it."  And that may indeed be the
> answer.

Who's going to punish them for using U+02BC?

I found some documentation of an Ancient Greek spell-checker for
OpenOffice.  It listed problem with the apostrophe as one of its
shortcomings.

Richard.



Re: Ancient Greek apostrophe marking elision

2019-01-29 Thread Richard Wordingham via Unicode
On Mon, 28 Jan 2019 21:10:19 -0500
"Mark E. Shoulson via Unicode"  wrote:

> On 1/28/19 3:58 PM, Richard Wordingham via Unicode wrote:
> > Interestingly, bringing this word breaker into line with TUS in the
> > UK may well be in breach of the Equality Act 2010.
> >
> > Richard.  
> 
> OK, I've got to ask: how would that be?  How would this impinge on 
> anyone's equality on the basis of "age, disability, gender
> reassignment, marriage and civil partnership, pregnancy and
> maternity, race, religion or belief, sex, and sexual orientation"?
> (quote from WP)

The most relevant clauses are 9(1), 9(4), 19(2), 29(5) and 29(7).

The change would restrict Thais' access to the provision of a service.
The service provided is to allow one to use a persistent, correctable
spell-checking system for one's native language.  Firefox and
LibreOffice provide this service.  Of course, one may have to supply
the spell-checking databases oneself.  Withdrawing this service for
some ethnic groups would be breach of the law.

By persistent, I means that corrections to the spell-checking remain
when the text is revisited.  For English plain-text, the easy
correction is to remove false positives by adding the word to
'personal dictionaries'.   The difficult correction, not always
possible, is to remove the word from the spell-checker's word list.

For scriptio continua scripts, line_break=complex_context in UCD terms,
there is the additional problem that word-breaking is not infrequently
wrong, even for Thai in Thai script.  (Recent loanwords into Thai can
be a nightmare.  So is Pali in Thai script, though Pali spell-checking
has its own issues.)  Line-breaking can be corrected with WJ and ZWSP.
At present, word-breaking can currently be corrected by inserting these
characters, and then spelling can be negotiated - the visible
characters are non-negotiable. The changes in the text will persist in
plain text. If WJ ceases to be treated as joining words, then the
service of a persistent, *correctable* spell-checking system is lost.

Now, one defence to the denial of the service would be that it would be
unreasonably difficult to allow users to solve the problem of
word-breaks in the wrong place.  However, if one is already providing
that service, that defence cannot be applied to subsequently denying
the service.

Richard.



Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread James Tauber via Unicode
On Mon, Jan 28, 2019 at 10:58 PM James Kass via Unicode 
wrote:

>
> On 2019-01-29 1:55 AM, Mark E. Shoulson via Unicode wrote:
> > I guess "Suck it up and deal with it."  And that may indeed be the
> answer.
>
> It would certainly make for shorter and simpler FAQ pages, anyway.
>

Except people will just respond with "okay, I'll use U+02BC instead" which
is what started all this :-)

James
-- 
*James Tauber*
Eldarion  | jktauber.com (Greek Linguistics)
 | Modelling Music
 | Digital
Tolkien 


Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread James Kass via Unicode



On 2019-01-29 1:55 AM, Mark E. Shoulson via Unicode wrote:

I guess "Suck it up and deal with it."  And that may indeed be the answer.


It would certainly make for shorter and simpler FAQ pages, anyway.



Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Mark E. Shoulson via Unicode

On 1/28/19 3:58 PM, Richard Wordingham via Unicode wrote:

Interestingly, bringing this word breaker into line with TUS in the UK
may well be in breach of the Equality Act 2010.

Richard.


OK, I've got to ask: how would that be?  How would this impinge on 
anyone's equality on the basis of "age, disability, gender reassignment, 
marriage and civil partnership, pregnancy and maternity, race, religion 
or belief, sex, and sexual orientation"? (quote from WP)



~mark



Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Mark E. Shoulson via Unicode

On 1/28/19 2:31 AM, Mark Davis ☕️ via Unicode wrote:


But the question is how important those are in daily life. I'm not 
sure why the double-click selection behavior is so much more of a 
problem for Ancient Greek users than it is for the somewhat larger 
community of English users. Word selection is not normally as 
important an operation as line break, which does work as expected.


This is a good point.  Bottom line is that word-selection, at least, is 
not going to be _exactly_ right.  Oh, and for another example, note that 
Esperanto also regularly (in poetry, anyway) uses a word-final 
apostrophe (of some kind) to indicate elision of the final -o of a 
nominative singular noun, or the -a of the article "la".  What shall we 
say to Esperantists who can't correctly the third word in «al la mond’ 
eterne militanta / Ĝi promesas sanktan harmonion»?  I guess "Suck it up 
and deal with it."  And that may indeed be the answer.


~mark


Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Mark E. Shoulson via Unicode

On 1/27/19 4:30 PM, Philippe Verdy via Unicode wrote:
For Volapük, it looks much more like U+02BE (right half ring modifier 
letter)

than like U+02BC (apostrophe "modifier" letter).
according to the PDF on 
https://archive.org/details/cu31924027111453/page/n12



No, I don't think it's 02BE (especially since it goes in the other 
direction.  You mean 02BF.  But I don't think it's that either).  Note 
the thickness at the top.  That isn't a half-ring.  It's pretty clearly 
an 02BD on that page, whereas on the page before, it's just as clearly 
an 02BB.  Or I guess another lesson to be learned is they weren't 
terribly picky.  Which I guess is good, because I don't want to have to 
fret about "gee, we need a boldface 02BB for capitalized Volapük..."  
There's a reason they dropped that letter.


~mark


Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Richard Wordingham via Unicode
On Mon, 28 Jan 2019 08:31:40 +0100
Mark Davis ☕️ via Unicode  wrote:
> But the question is how important those are in daily life. I'm not
> sure why the double-click selection behavior is so much more of a
> problem for Ancient Greek users than it is for the somewhat larger
> community of English users. Word selection is not normally as
> important an operation as line break, which does work as expected.

How does ancient Greek spell-checking work?  (Does it work?)  Stripping
a final apostrophe off a standard English word usually yields another
standard English word.  That isn't so with Ancient Greek.  One would
prefer to use a general spell-checking framework, such as provided by
many applications.

Richard.



Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Richard Wordingham via Unicode
On Mon, 28 Jan 2019 03:48:52 +
James Kass via Unicode  wrote:

> It’s been said that the text segmentation rules seem over-complicated 
> and are probably non-trivial to implement properly.  I tried your 
> suggestion of WORD JOINER U+2060 after tau ( γένοιτ⁠’ ἄν ), but it
> only added yet another word break in LibreOffice.

I said we *don't* have a control that joins words.  The text of TUS
used to say we had one in U+2060, but that was removed in 2015.  I
pleaded for the retention of this functionality in document
L2/2015/15-192, but my request was refused.  I pointed out in ICU
ticket #11766 that ICU's Thai word breaker retained this facility. An
investigation was planned, but nothing seems to have come of it.
Interestingly, bringing this word breaker into line with TUS in the UK
may well be in breach of the Equality Act 2010.

Richard.




Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Kalvesmaki, Joel via Unicode
Yes, we use U+2019 in either case. We might do something different if we ever 
run across a case where the two different types are justifiably adjacent, but 
that would be a rare case indeed.


jk


From: James Tauber 
Sent: Monday, January 28, 2019 10:27:23 AM
To: Kalvesmaki, Joel
Cc: Mark Davis ☕️; unicode@unicode.org; Richard Wordingham
Subject: Re: Ancient Greek apostrophe marking elision

On Mon, Jan 28, 2019 at 10:21 AM Kalvesmaki, Joel 
mailto:kalvesma...@doaks.org>> wrote:

In publishing critical editions of ancient/medieval Greek texts, I regularly 
deals with editions that mix elision and closing single-quotation marks.

You have my sympathies :-)

But you use U+2019 for both, right? (just checking as another data point 
against U+02BC)

James

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business, providing a safer and more useful place for your human 
generated data. Mimecast specializes in security, archiving and compliance. To 
find out more visit the Mimecast website.


Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread James Tauber via Unicode
On Mon, Jan 28, 2019 at 10:21 AM Kalvesmaki, Joel 
wrote:

> In publishing critical editions of ancient/medieval Greek texts, I
> regularly deals with editions that mix elision and closing single-quotation
> marks.
>

You have my sympathies :-)

But you use U+2019 for both, right? (just checking as another data point
against U+02BC)

James


Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Kalvesmaki, Joel via Unicode
In publishing critical editions of ancient/medieval Greek texts, I regularly 
deals with editions that mix elision and closing single-quotation marks. That 
is, I cannot assume without context that an instance of U+2019 represents 
either an ancient/medieval elision mark or modern editorial punctuation. I 
therefore have no expectations on ideal behavior when double-clicking a string 
with U+2019.


Best wishes,


jk

--
Joel Kalvesmaki
Editor in Byzantine Studies
Dumbarton Oaks
1703 32nd St. NW
Washington, DC 20007
(202) 339-6435

From: Unicode  on behalf of Mark Davis ☕️ via 
Unicode 
Sent: Monday, January 28, 2019 3:37:54 AM
To: James Tauber
Cc: Richard Wordingham; Unicode Mailing List
Subject: Re: Ancient Greek apostrophe marking elision

It would certainly be possible (and relatively simple) to change ’ into a word 
character for languages that don't use ’ for any other purpose. And if no 
languages using a particular script use ’ for another purpose, then it is 
particularly easy. (If you depend on language tagging, then any software that 
doesn't maintain the language tagging will cause it to revert to the default 
behavior.)

So does modern Greek use ’ for in trailing environments where people wouldn't 
expect it to be included in word selection?

Mark


On Mon, Jan 28, 2019 at 8:49 AM James Tauber 
mailto:jtau...@jtauber.com>> wrote:
On Mon, Jan 28, 2019 at 2:31 AM Mark Davis ☕️ 
mailto:m...@macchiato.com>> wrote:
But the question is how important those are in daily life. I'm not sure why the 
double-click selection behavior is so much more of a problem for Ancient Greek 
users than it is for the somewhat larger community of English users. Word 
selection is not normally as important an operation as line break, which does 
work as expected.

Even if they don't _really_ care about word selection, there are digital 
classicists who care even less about U+2019 being the preferred character which 
makes it harder for me to make my case :-)

What triggered the question in my original post about tailoring the Word 
Boundary Rules was the statement in TR29 "A further complication is the use of 
the same character as an apostrophe and as a quotation mark. Therefore leading 
or trailing apostrophes are best excluded from the default definition of a 
word." Because Ancient Greek does not have that ambiguity, there's no need for 
the exclusion in that case. Immediately following that quote is a suggestion 
about tailoring for French and Italian which made we wonder if the "right" 
thing to do is to tailor the WBRs for Ancient Greek.

I know you've said here (and in your original response to me) that you don't 
think it's worth it, but is WBR tailoring (the only|the best|a) technically 
correct way to achieve with U+2019 (in Ancient Greek) what people are abusing 
U+02BC for?

James

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business, providing a safer and more useful place for your human 
generated data. Mimecast specializes in security, archiving and compliance. To 
find out more visit the Mimecast website.


Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Tom Gewecke via Unicode


> On Jan 28, 2019, at 1:51 AM, James Tauber via Unicode  
> wrote:
> 
> when I'm entering U+2019 in a Greek context (via option-n)  the keyboard is 
> fully aware I'm in that Greek context. 

Could you explain what you mean by the keyboard being “aware” of the Greek 
context?  


Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Michael Everson via Unicode
The hell I do, Julian. 

http://evertype.com/polynesian.html

> On 27 Jan 2019, at 21:00, Julian Bradfield via Unicode  
> wrote:
> 
> You have a very low opinion of Polynesian users. 




Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread James Tauber via Unicode
On Mon, Jan 28, 2019 at 2:54 AM James Kass via Unicode 
wrote:

> at the keyboard driver level.  It's a presumption that Greek classicists
> are already specifying fonts and using dedicated keyboard drivers.
> Based on the description provided by James Tauber, it should be
> relatively simple to make the keyboard insert some kind of joiner before
> U+2019 if it follows a Greek letter. This would not be visible to the
> end-user.
>

As a user of the Greek - Polytonic Input Source on macOS, I can confirm
that when I'm entering U+2019 in a Greek context (via option-n)  the
keyboard is fully aware I'm in that Greek context. The virtual keyboard
could easily map option-n in Greek - Polytonic to a sequence of joiner plus
U+2019.

James


Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread James Tauber via Unicode
On Mon, Jan 28, 2019 at 3:38 AM Mark Davis ☕️  wrote:

> So does modern Greek use ’ for in trailing environments where people
> wouldn't expect it to be included in word selection?
>
>
Unfortunately, I can't speak for Modern Greek at all.

James


Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Mark Davis ☕️ via Unicode
That is a fair point; if you could get everyone to use keyboards that
inserted such a character, and also get people with current data (eg
Thesaurus Linguae Graecae to process their text), then it would behave as
expected.

Mark


On Mon, Jan 28, 2019 at 8:55 AM James Kass via Unicode 
wrote:

>
> On 2019-01-28 7:31 AM, Mark Davis ☕️ via Unicode wrote:
> > Expecting people to type in hard-to-find invisible characters just to
> > correct double-click is not a realistic expectation.
>
> True, which is why such entries, when consistent, are properly handled
> at the keyboard driver level.  It's a presumption that Greek classicists
> are already specifying fonts and using dedicated keyboard drivers.
> Based on the description provided by James Tauber, it should be
> relatively simple to make the keyboard insert some kind of joiner before
> U+2019 if it follows a Greek letter. This would not be visible to the
> end-user.
>
> This approach would also mean that plain-text, which has no language
> tagging mechanism, would "get it right" cross-platform, cross-applications.
>
>


Re: Ancient Greek apostrophe marking elision

2019-01-28 Thread Mark Davis ☕️ via Unicode
It would certainly be possible (and relatively simple) to change ’ into a
word character for languages that don't use ’ for any other purpose. And if
no languages using a particular script use ’ for another purpose, then it
is particularly easy. (If you depend on language tagging, then any software
that doesn't maintain the language tagging will cause it to revert to the
default behavior.)

So does modern Greek use ’ for in trailing environments where people
wouldn't expect it to be included in word selection?

Mark


On Mon, Jan 28, 2019 at 8:49 AM James Tauber  wrote:

> On Mon, Jan 28, 2019 at 2:31 AM Mark Davis ☕️  wrote:
>
>> But the question is how important those are in daily life. I'm not sure
>> why the double-click selection behavior is so much more of a problem for
>> Ancient Greek users than it is for the somewhat larger community of English
>> users. Word selection is not normally as important an operation as line
>> break, which does work as expected.
>>
>
> Even if they don't _really_ care about word selection, there are digital
> classicists who care even less about U+2019 being the preferred character
> which makes it harder for me to make my case :-)
>
> What triggered the question in my original post about tailoring the Word
> Boundary Rules was the statement in TR29 "A further complication is the use
> of the same character as an apostrophe and as a quotation mark. Therefore
> leading or trailing apostrophes are best excluded from the default
> definition of a word." Because Ancient Greek does not have that ambiguity,
> there's no need for the exclusion in that case. Immediately following that
> quote is a suggestion about tailoring for French and Italian which made we
> wonder if the "right" thing to do is to tailor the WBRs for Ancient Greek.
>
> I know you've said here (and in your original response to me) that you
> don't think it's worth it, but is WBR tailoring (the only|the best|a)
> technically correct way to achieve with U+2019 (in Ancient Greek) what
> people are abusing U+02BC for?
>
> James
>


Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread James Kass via Unicode



On 2019-01-28 7:31 AM, Mark Davis ☕️ via Unicode wrote:
Expecting people to type in hard-to-find invisible characters just to 
correct double-click is not a realistic expectation.


True, which is why such entries, when consistent, are properly handled 
at the keyboard driver level.  It's a presumption that Greek classicists 
are already specifying fonts and using dedicated keyboard drivers.  
Based on the description provided by James Tauber, it should be 
relatively simple to make the keyboard insert some kind of joiner before 
U+2019 if it follows a Greek letter. This would not be visible to the 
end-user.


This approach would also mean that plain-text, which has no language 
tagging mechanism, would "get it right" cross-platform, cross-applications.




Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread James Tauber via Unicode
On Mon, Jan 28, 2019 at 2:31 AM Mark Davis ☕️  wrote:

> But the question is how important those are in daily life. I'm not sure
> why the double-click selection behavior is so much more of a problem for
> Ancient Greek users than it is for the somewhat larger community of English
> users. Word selection is not normally as important an operation as line
> break, which does work as expected.
>

Even if they don't _really_ care about word selection, there are digital
classicists who care even less about U+2019 being the preferred character
which makes it harder for me to make my case :-)

What triggered the question in my original post about tailoring the Word
Boundary Rules was the statement in TR29 "A further complication is the use
of the same character as an apostrophe and as a quotation mark. Therefore
leading or trailing apostrophes are best excluded from the default
definition of a word." Because Ancient Greek does not have that ambiguity,
there's no need for the exclusion in that case. Immediately following that
quote is a suggestion about tailoring for French and Italian which made we
wonder if the "right" thing to do is to tailor the WBRs for Ancient Greek.

I know you've said here (and in your original response to me) that you
don't think it's worth it, but is WBR tailoring (the only|the best|a)
technically correct way to achieve with U+2019 (in Ancient Greek) what
people are abusing U+02BC for?

James


Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Mark Davis ☕️ via Unicode
Note that this is no different than the reasonably common cases in English
such as «the boys’ books».
(you can try various combinations in
http://unicode.org/cldr/utility/list-unicodeset.jsp)

There are certainly cases that are suboptimal in word selection. As another
example, «re-iterate» seems like it should not break around hyphens, but on
the other hand in «an out-of-the-box experience» it seems like they should.
Expecting people to type in hard-to-find invisible characters just to
correct double-click is not a realistic expectation. Short of a dictionary
or ML lookup, there is no good way to distinguish certain tricky cases.
(And that probably needs more context, to distinguish «Ted was lyin’ to her
mother.» from «She said ‘Ted was lyin’ to her mother.».)

But the question is how important those are in daily life. I'm not sure why
the double-click selection behavior is so much more of a problem for
Ancient Greek users than it is for the somewhat larger community of English
users. Word selection is not normally as important an operation as line
break, which does work as expected.

Mark



On Sun, Jan 27, 2019 at 8:13 PM James Tauber via Unicode <
unicode@unicode.org> wrote:

> On Sun, Jan 27, 2019 at 1:22 PM Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:
>
>> Except the Uniocde-compliant processes aren't required to follow the
>> scheme of TR27 Unicode Text Segmentation.  However, it is only required
>> to select the whole word because the U+2019 is followed by a letter.
>> TR27 prescribes different behaviour for "dogs'" with U+2019 (interpret
>> as two 'words') and U+02BC (interpret as one word).  The GTK-based
>> email client I'm using has that difference, but also fails with
>> "don't" unless one uses U+02BC.
>>
>> However LibreOffice treats "don't" as a single word for U+0027, U+02BC
>> and U+2019, but "dogs'" as a single word only for U+02BC.  This
>> complies with TR27.  I'm not surprised, as LibreOffice does use or has
>> used ICU.
>>
>
> This comes back to my original question that started this thread. Many
> people creating Ancient Greek digital resources use U+02BC seemingly
> because of incorrect word-breaking with *word-final* U+2019 (which is the
> only time it occurs in Ancient Greek and always marking elision, never as
> the end of a quotation).
>
> I am trying to write guidelines as to why they should use U+2019. I'm
> convinced it's technically the right code point to use but am wanting to
> get my facts straight about how to address the word-breaking issue
> (specifically for word-final U+2019 in Ancient Greek, to be clear). In my
> original post, I asked if a language-specific tailoring of the text
> segmentation algorithm was the solution but no one here has agreed so far.
>
> Here's a concrete example from Smyth's Grammar:
>
> γένοιτ’ ἄν
>
> Double-clicking on the first word should select the U+2019 as well.
> Interestingly on macOS Mojave it does in Pages[1] but not in Notes, the
> Terminal or here in Gmail on Chrome.
>
> To be clear: when I say "should" I mean that that is the expectation
> classicists have and the failure to meet it is why some of them insist on
> using U+02BC.
>
> I'm happy if the answer is "use U+2019 and go get your text segmentation
> implementations fixed"[2] but am looking for confirmation of that.
>
> James
>
> [1] To be honest, I was impressed Pages got it right.
> [2] In the same spirit as "if certain combining character combinations
> don't work, the solution is not to add precomposed characters, it's to
> improve the fonts" or "tonos and oxia are the same and if they look
> different, it's the fault of your font".
>
>
>
>
>


Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread James Kass via Unicode



On 2019-01-27 11:38 PM, Richard Wordingham via Unicode wrote:

On Sun, 27 Jan 2019 19:57:37 +
James Kass via Unicode  wrote:


On 2019-01-27 7:09 PM, James Tauber via Unicode wrote:

In my original post, I asked if a language-specific tailoring of
the text segmentation algorithm was the solution but no one here
has agreed so far.

If there are likely to be many languages requiring exceptions to the
segmentation algorithm wrt U+2019, then perhaps it would be better to
establish conventions using ZWJ/ZWNJ and adjust the algorithm
accordingly so that it would be cross-languages.  (Rather than
requiring additional and open ended language-specific tailorings.) (I
inserted several combinations of ZWJ/ZWNJ into James Tauber's
example, but couldn't improve the segmentation in LibreOffice,
although it was possible to make it worse.)

If you look at TR29, you will see that ZWJ should only affect word
boundaries for emoji.  ZWNJ shall have no effect.  What you want is a
control that joins words, but we don't have that.

Richard.



(https://unicode.org/reports/tr29/)

It’s been said that the text segmentation rules seem over-complicated 
and are probably non-trivial to implement properly.  I tried your 
suggestion of WORD JOINER U+2060 after tau ( γένοιτ⁠’ ἄν ), but it only 
added yet another word break in LibreOffice.


The problem may stem from the fact that WORD JOINER is supposed to be 
treated as though it were a zero-width no-break space.  IOW it is a 
*space*, and as a space it indicates a word break.  That doesn’t seem right.


Instead of treating WORD JOINER as a SPACE, why not treat it as a WORD 
JOINER?  It could save a lot of problems wrt undesirable string 
segmentation in addition to possibly minimizing future language-specific 
tailoring and easing the burden on implementers.




Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Richard Wordingham via Unicode
On Sun, 27 Jan 2019 19:57:37 +
James Kass via Unicode  wrote:

> On 2019-01-27 7:09 PM, James Tauber via Unicode wrote:
> > In my original post, I asked if a language-specific tailoring of
> > the text segmentation algorithm was the solution but no one here
> > has agreed so far.  
> If there are likely to be many languages requiring exceptions to the 
> segmentation algorithm wrt U+2019, then perhaps it would be better to 
> establish conventions using ZWJ/ZWNJ and adjust the algorithm 
> accordingly so that it would be cross-languages.  (Rather than
> requiring additional and open ended language-specific tailorings.) (I
> inserted several combinations of ZWJ/ZWNJ into James Tauber's
> example, but couldn't improve the segmentation in LibreOffice,
> although it was possible to make it worse.)

If you look at TR29, you will see that ZWJ should only affect word
boundaries for emoji.  ZWNJ shall have no effect.  What you want is a
control that joins words, but we don't have that.

Richard.



Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Richard Wordingham via Unicode
On Sun, 27 Jan 2019 14:09:31 -0500
James Tauber via Unicode  wrote:

> On Sun, Jan 27, 2019 at 1:22 PM Richard Wordingham via Unicode <
> unicode@unicode.org> wrote:  

> > However LibreOffice treats "don't" as a single word for U+0027,
> > U+02BC and U+2019, but "dogs'" as a single word only for U+02BC.
> > This complies with TR27.  I'm not surprised, as LibreOffice does
> > use or has used ICU.

> This comes back to my original question that started this thread.

Yes.  I'm driving home the problem for those who somehow fail to
understand your opening post.

> Here's a concrete example from Smyth's Grammar:
> 
> γένοιτ’ ἄν
> 
> Double-clicking on the first word should select the U+2019 as well.
> Interestingly on macOS Mojave it does in Pages[1] but not in Notes,
> the Terminal or here in Gmail on Chrome.
> 
> To be clear: when I say "should" I mean that that is the expectation
> classicists have and the failure to meet it is why some of them
> insist on using U+02BC.
> 
> I'm happy if the answer is "use U+2019 and go get your text
> segmentation implementations fixed"[2] but am looking for
> confirmation of that.

The problem with that approach is that it assumes one can have a
language-sensitive implementation, and that that will suffice.

Smyth’s grammar gives the concrete example, “γένοιτ’ ἄν”.  It contains
the word ‘ἄν’.

Should double-clicking the first Greek word in the paragraph above
select it?  That's not going to work if the paragraph above is
considered to be in English.  And what about double clicking the third
Greek word?  What should that select?  Or is that paragraph
ungrammatical?

To fix the problem with possessive plural "dogs’" with U+2019 one has
to parse enough of the paragraph to distinguish an apostrophe from a
closing single inverted comma. Moreover, it assumes that end-of-word
apostrophes will not be included in a span bounded by single inverted
commas.  I may observe such a rule, but I don't remember being taught
it.

In Unicode 2.0 the apostrophe was U+02BC; it was changed to U+2019 in
Unicode 2.1.  The justification I could find given for the change is in
the Unicore thread (members only) starting at
https://www.unicode.org/mail-arch/unicore-ml/y1997-A/0185.html .  The
justification recorded there was merely that:

1) Windows and Mac Latin character sets had equivalents of U+0027, to
which the 'letter apostrophe' was mapped, and U+2019, which was used
for single quotes.

2) The 'punctuation apostrophe' was being mapped to the U+2019 by the
'smart quote' apparatus.

3) For consistency, the 'punctuation apostrophe' should therefore be
encoded by U+2019 instead of U+02BC.

This argument didn't persuade everyone even then, and it feels even
weaker now.

Perhaps I just have the problem that I don't see a sharp difference
between the letter apostrophe and the punctuation apostrophe. For
example, when the pronunciation of English "letter" with a glottal stop
as the intervocalic consonant is represented in writing as something
like "le'er", is it a letter apostrophe because it's a glottal stop, or
a punctuation apostrophe because the 'tt' is dropped?

The issue arises in the orthography of Finnish.  The genitive singular
of _keko_ 'a pile' is _keon_ - the 'k' is 'dropped' because of
consonant gradation.  However, regularly, the genitive singular of
_raaka_ 'raw' is _raa'an_, where the U+0027 I wrote represent an
apostrophe and is pronounced as a glottal stop.  Is this a letter
apostrophe or a punctuation apostrophe?  The 'k' has been dropped by
the same rule, but because of the vowel pattern it is replaced by a
glottal stop and written with an apostrophe.  English Wiktionary
chooses U+2019: the Finnish Wiktionary ducks the issue and uses U+0027.

Richard.



Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Philippe Verdy via Unicode
For Volapük, it looks much more like U+02BE (right half ring modifier
letter)
than like U+02BC (apostrophe "modifier" letter).
according to the PDF on
https://archive.org/details/cu31924027111453/page/n12

The half ring makes a clear distinction with the regular apostrophe (for
elisions) or quotation marks. It is used really in this context as a
modifier after another consonnants for borrowing words *phonetically* from
other languages, notably after 'c' and 'l'. Then U+02BD (left  half ring
"modifier" letter) is a regular letter (for translitterating the expirated
'h' from English). But I'm currious about the diacritic used above 'h' on
item (5) ("ta") of that page to transliteratiung the English soft "th". But
this was describing the "Labas" orthography.

On the next chapter ("Noms Tonabas"), another convention is used for the
aopostrophe like letters, and U+02BE (right half ring modifier letter) is
used instead of U+02BD for the expirated 'h' (see paragraph 18), but it is
said to use the "Greek mark" (not sure if the author meant the coronis
U+01FBD or the soft spirit U+01FBF).

So it looks like these were various early adaptations of the basic Volapük
orthography to borrow foreign names (notable proper names for people,
trademarks, toponyms and other place names), and these were part of several
competing proposals. I'm curious to know if there was finally a wide enough
consensus to standardize these.

So It seems that for Volapük all the apostrophe-like letters are not
formally assigned, authors will use anyone as they want when they
transliterate foreign words, or will simply avoid transliterating them
completely if they exist natively in a Latin form (I bet English is not
transliterated at all, and French or German accents are preserved as is if
they are already part of the basic alphabet and the only standard diacritic
is then the "diaeresis", as used in the German umlaut (Volapük does not
need any true diaeresis to avoid the formation of diphtongs and digrams,
all its orthography use a single base letter as a foundation principle.

If so, the 1st convention using the apostrophe-like modifier to create
digrams is probably not favored and ther Tonabas convention is proably more
convenient and more compliant t othe principles. I don't think they will
ever use directly the greek signs or letters (like the one used for
transliterating the English 'ng' and would prefer using now the Latin Eng
letter.

The right half-ring being rarely supported is now most probably supported
using U+02BC (for both letter cases, ignoring the bolder style for the
capital variant) which uses a curved comma shape (with a filled bowl at
top). If there are case distinction, the same glyph would be used but at
different height instead of using bold distinctions, or dictinction would
be made using the alternate forms of the comma (probably the wedge for
lowercase, and the bowl with curl for capitals).

Note: Are the different shapes of the comma (and similar apostrophe-like
letters, or even the semicolon) distinguished with encoded variant
selectors ?


Le dim. 27 janv. 2019 à 18:42, Mark E. Shoulson via Unicode <
unicode@unicode.org> a écrit :

> Well, sure; some languages work better with some fonts.  There's nothing
> wrong with saying that 02BC might look the same as 2019... but it's
> nice, when writing Hawaiian (or Klingon for that matter) to use a bigger
> glyph. That's why they pay typesetters the big bucks (you wish): to make
> things look good on the page.
>
> I recall in early Volapük, ʼ was a letter (presumably 02BC), with value
> /h/.  And the "capital" ʼ was the same, except bolder: see
> https://archive.org/details/cu31924027111453/page/n11 (entry 4, on the
> left-hand page).
>
> ~mark
>
> On 1/27/19 12:23 AM, Asmus Freytag via Unicode wrote:
> > On 1/26/2019 6:25 PM, Michael Everson via Unicode wrote:
> > the 02BC’s need to be bigger or the text can’t be read easily. In our
> > work we found that a vertical height of 140% bigger than the quotation
> > mark improved legibility hugely. Fine typography asks for some other
> > alterations to the glyph, but those are cosmetic.
> >> If the recommended glyph for 02BC were to be changed, it would in no
> case impact adversely on scientific linguistics texts. It would just make
> the mark a bit bigger. But for practical use in Polynesian languages where
> the character has to be found alongside the quotation marks, a glyph
> distinction must be made between this and punctuation.
> >
> > It somehow seems to me that an evolution of the glyph shape of 02BC in
> > a direction of increased distinction from U+2019 is something that
> > Unicode has indeed made possible by a separate encoding. However, that
> > evolution is a matter of ALL the language communities that use U+02BC
> > as part of their orthography, and definitely NOT something were
> > Unicode can be permitted to take a lead. Unicode does not *recommend*
> > glyphs for letters.
> >
> > However, as a publisher, you ar

Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Julian Bradfield via Unicode
On 2019-01-27, Michael Everson via Unicode  wrote:
> On 27 Jan 2019, at 05:21, Richard Wordingham 
>  wrote:
>> The closing single inverted comma has a different origin to the apostrophe.
> No, it doesn’t, but you are welcome to try to prove your assertion. 

As far as I can tell from the easily accessible literature, the
apostrophe derives from an in-line manuscript mark that is a point
with a tail, while the quotation marks derive from a marginal mark
shaped like an arrowhead (like modern guillemets). What is your story
about them?

>> Is someone going to tell me there is an advantage in treating "men's” as one 
>> word but "dogs'" as two?  As I've said, the argument for encoding English 
>> apostrophes as U+2019 is that even with adequate keyboards, users cannot be 
>> relied upon to distinguish U+02BC and U+2019 - especially with no feedback. 
>> A writing system should choose one and stick with it.  User unreliability 
>> forces a compromise.
>
> Polynesian users need to 02BC to be visually distinguished from 2019. 
> European users don’t need the apostrophe to be visually distinguished from 
> 2019. The edge case of “dogs’” doesn’t convince me. In all my years of 
> typesetting I have never once noticed this, much less considered it a problem 
> that needed fixing.

You have a very low opinion of Polynesian users. People (as opposed to
computers) use context to remove ambiguity. Before we had to interact
with pedantic computers, we were rarely confused by the typewriter-induced
confusion of 1 and l and 0 and O (or, indeed, the use of symmetrical
quotation marks).
Now a sensible orthographic choice for a language using comma-like
letters would be to use guillemets for quotation, and while I don't
know (there being precious few modern Polynesian materials online), I
would guess that the languages of French Polynesia do that.
If, like Hawaiian, you're stuck with English-style quotation marks for
historical reasons, an obvious typographic solution is to thin-space
them, French-style. (See previous thread!). That seems visually
preferable to relying on a small difference in size of what is already
a small letter compared to everything else on the page.


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Tom Gewecke via Unicode


> On Jan 27, 2019, at 12:09 PM, James Tauber via Unicode  
> wrote:
> 
> γένοιτ’ ἄν
> 
> Double-clicking on the first word should select the U+2019 as well. 
> Interestingly on macOS Mojave it does in Pages[1] but not in Notes

On my ipad/iphone, Word does it correctly but Pages and Notes do not.




Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread James Kass via Unicode



On 2019-01-27 7:09 PM, James Tauber via Unicode wrote:
In my original post, I asked if a language-specific tailoring of the 
text segmentation algorithm was the solution but no one here has 
agreed so far.
If there are likely to be many languages requiring exceptions to the 
segmentation algorithm wrt U+2019, then perhaps it would be better to 
establish conventions using ZWJ/ZWNJ and adjust the algorithm 
accordingly so that it would be cross-languages.  (Rather than requiring 
additional and open ended language-specific tailorings.) (I inserted 
several combinations of ZWJ/ZWNJ into James Tauber's example, but 
couldn't improve the segmentation in LibreOffice, although it was 
possible to make it worse.)


Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread James Tauber via Unicode
On Sun, Jan 27, 2019 at 1:22 PM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> Except the Uniocde-compliant processes aren't required to follow the
> scheme of TR27 Unicode Text Segmentation.  However, it is only required
> to select the whole word because the U+2019 is followed by a letter.
> TR27 prescribes different behaviour for "dogs'" with U+2019 (interpret
> as two 'words') and U+02BC (interpret as one word).  The GTK-based
> email client I'm using has that difference, but also fails with
> "don't" unless one uses U+02BC.
>
> However LibreOffice treats "don't" as a single word for U+0027, U+02BC
> and U+2019, but "dogs'" as a single word only for U+02BC.  This
> complies with TR27.  I'm not surprised, as LibreOffice does use or has
> used ICU.
>

This comes back to my original question that started this thread. Many
people creating Ancient Greek digital resources use U+02BC seemingly
because of incorrect word-breaking with *word-final* U+2019 (which is the
only time it occurs in Ancient Greek and always marking elision, never as
the end of a quotation).

I am trying to write guidelines as to why they should use U+2019. I'm
convinced it's technically the right code point to use but am wanting to
get my facts straight about how to address the word-breaking issue
(specifically for word-final U+2019 in Ancient Greek, to be clear). In my
original post, I asked if a language-specific tailoring of the text
segmentation algorithm was the solution but no one here has agreed so far.

Here's a concrete example from Smyth's Grammar:

γένοιτ’ ἄν

Double-clicking on the first word should select the U+2019 as well.
Interestingly on macOS Mojave it does in Pages[1] but not in Notes, the
Terminal or here in Gmail on Chrome.

To be clear: when I say "should" I mean that that is the expectation
classicists have and the failure to meet it is why some of them insist on
using U+02BC.

I'm happy if the answer is "use U+2019 and go get your text segmentation
implementations fixed"[2] but am looking for confirmation of that.

James

[1] To be honest, I was impressed Pages got it right.
[2] In the same spirit as "if certain combining character combinations
don't work, the solution is not to add precomposed characters, it's to
improve the fonts" or "tonos and oxia are the same and if they look
different, it's the fault of your font".


Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Richard Wordingham via Unicode
On Sun, 27 Jan 2019 16:11:12 +
Michael Everson via Unicode  wrote:

> Yes, yes. It doesn’t matter. The discussion applies to both the two
> quotation marks and the two modifier letters.

Actually, there is a difference.  As the ʻokina doesnʹt occur at the
end of a word in Hawaiian, one only strictly needs a contrast at the
beginning of a word - unless Hawaiian makes significant use of the
apostrophe for abbreviation.  Unfortunately, U+02BB is worse than
U+02BC from this perspective.  

Richard.



Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Richard Wordingham via Unicode
On Sun, 27 Jan 2019 12:38:39 -0500
"Mark E. Shoulson via Unicode"  wrote:

> On 1/27/19 11:08 AM, Michael Everson via Unicode wrote:
> > It is a letter. In “can’t” the apostrophe isn’t a letter. It’s a
> > mark of elision.  I can double-click on the three words in this
> > paragraph which have the apostrophe in them, and they are all
> > whole-word selected.  
> 
> That doesn't work when I try it: I double-click on the "a" in "can’t" 
> and get only the "can" selected.
> 
> This does not necessarily prove anything; my software (Thunderbird)
> is arguably doing it wrong.

Except the Uniocde-compliant processes aren't required to follow the
scheme of TR27 Unicode Text Segmentation.  However, it is only required
to select the whole word because the U+2019 is followed by a letter.
TR27 prescribes different behaviour for "dogs'" with U+2019 (interpret
as two 'words') and U+02BC (interpret as one word).  The GTK-based
email client I'm using has that difference, but also fails with
"don't" unless one uses U+02BC.

However LibreOffice treats "don't" as a single word for U+0027, U+02BC
and U+2019, but "dogs'" as a single word only for U+02BC.  This
complies with TR27.  I'm not surprised, as LibreOffice does use or has
used ICU.

Richard.



Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Mark E. Shoulson via Unicode

On 1/27/19 11:08 AM, Michael Everson via Unicode wrote:

It is a letter. In “can’t” the apostrophe isn’t a letter. It’s a mark of 
elision.  I can double-click on the three words in this paragraph which have 
the apostrophe in them, and they are all whole-word selected.


That doesn't work when I try it: I double-click on the "a" in "can’t" 
and get only the "can" selected.


This does not necessarily prove anything; my software (Thunderbird) is 
arguably doing it wrong.


~mark


Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Mark E. Shoulson via Unicode
Well, sure; some languages work better with some fonts.  There's nothing 
wrong with saying that 02BC might look the same as 2019... but it's 
nice, when writing Hawaiian (or Klingon for that matter) to use a bigger 
glyph. That's why they pay typesetters the big bucks (you wish): to make 
things look good on the page.


I recall in early Volapük, ʼ was a letter (presumably 02BC), with value 
/h/.  And the "capital" ʼ was the same, except bolder: see 
https://archive.org/details/cu31924027111453/page/n11 (entry 4, on the 
left-hand page).


~mark

On 1/27/19 12:23 AM, Asmus Freytag via Unicode wrote:

On 1/26/2019 6:25 PM, Michael Everson via Unicode wrote:
the 02BC’s need to be bigger or the text can’t be read easily. In our 
work we found that a vertical height of 140% bigger than the quotation 
mark improved legibility hugely. Fine typography asks for some other 
alterations to the glyph, but those are cosmetic.

If the recommended glyph for 02BC were to be changed, it would in no case 
impact adversely on scientific linguistics texts. It would just make the mark a 
bit bigger. But for practical use in Polynesian languages where the character 
has to be found alongside the quotation marks, a glyph distinction must be made 
between this and punctuation.


It somehow seems to me that an evolution of the glyph shape of 02BC in 
a direction of increased distinction from U+2019 is something that 
Unicode has indeed made possible by a separate encoding. However, that 
evolution is a matter of ALL the language communities that use U+02BC 
as part of their orthography, and definitely NOT something were 
Unicode can be permitted to take a lead. Unicode does not *recommend* 
glyphs for letters.


However, as a publisher, you are of course free to experiment and to 
see whether your style becomes popular.


There is a concern though, that your choice may appeal only to some 
languages that use this code point and not become universally accepted.


A./






Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Michael Everson via Unicode
Yes, yes. It doesn’t matter. The discussion applies to both the two quotation 
marks and the two modifier letters.

> On 27 Jan 2019, at 15:08, Tom Gewecke via Unicode  wrote:
> 
> 
>> On Jan 26, 2019, at 11:08 PM, Richard Wordingham via Unicode 
>>  wrote:
>> 
>> It may be a matter of literacy in Hawaiian.  If the test readership
>> doesn't use ʼokina, 
> 
> I think the Unicode Hawaiian ʻokina is supposed to be U+02BB (instead of 
> U+02BC).
> 




Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Michael Everson via Unicode
On 27 Jan 2019, at 05:21, Richard Wordingham  
wrote:

>>> I’ll be publishing a translation of Alice into Ancient Greek in due
 course. I will absolutely only use U+2019 for the apostrophe. It
 would be wrong for lots of reasons to use U+02BC for this.  
>>> 
>>> Please list them.  
>> 
>> The Greek use is of an apostrophe. Often a mark elision (as here),
>> that’s what 2019 is for.
>> 
>> 02BC is a letter. Usually a glottal stop. 
> 
> So it would seem that the 'lots of reasons' is just that it goes against the 
> *recommendation* of TUS.

I have no idea what TUS says about this. I did not look it up. I know a lot 
about characters, though. 

> Incidentally, I believe the principal use of U+2019 RIGHT SINGLE QUOTATION 
> MARK is as a quotation mark.

You can believe what you like, but that isn’t likely true. In books which 
prefer “this kind” of quotation marks for primary quotations and ’this kind’ 
for nested quotations, 2019 is primarily used for the apostrophe in words like 
I’m, can’t, isn’t, don’t etc. In books which prefer ’this kind’ for primary 
quotations 2019 the statistics will be different. But 2019 is still the correct 
character for both.

> As you have noted in the text left in below, U+02BC started out as the 
> apostrophe.

Lead-type typesetters used that sort, yes. And that sort was used for both 
apostrophe and single quotation marks. 

> The closing single inverted comma has a different origin to the apostrophe.

No, it doesn’t, but you are welcome to try to prove your assertion. 

> My argument for U+02BC is that this apostrophe is an integral part of the 
> word.

It is a letter. In “can’t” the apostrophe isn’t a letter. It’s a mark of 
elision.  I can double-click on the three words in this paragraph which have 
the apostrophe in them, and they are all whole-word selected. 

> The main constituent of a prototypical word are letters and their attendant 
> marks. Now, the word-breaking algorithm in TR27 allows for various generally 
> overloaded elements to join elements of a word. However, this apostrophe does 
> not mark the boundary of constituents. Accordingly it makes sense to treat it 
> as a letter.

The behaviour of 2019 it not broken. I use it every day. I’ve typeset many many 
books in English and Cornish and Irish, all of which use single quotation marks 
and double quotation marks and lots and lots of apostrophes, and I have no 
trouble with them. 2019 has for decades been treated correctly in software that 
I use. 

> Treating the Greek apostrophe as a letter (U+02BC) gives better word-breaking.

Why do you claim this? I did not read the beginning of this thread and I am not 
going to try to find it. What is the problem you claim to have? In what 
software? On what platform?

> I don't see any downside in treating it like a Polynesian glottal stop.

I do. And to try to replace the apostrophe in English can’t and don’t and all 
is doomed to fail. Doomed. 

Moreover there are good practical reasons to change the glyph for the 
Polynesian letter.

When I typeset Greek, I will use 2019 for the apostrophe. 

> Is someone going to tell me there is an advantage in treating "men's” as one 
> word but "dogs'" as two?  As I've said, the argument for encoding English 
> apostrophes as U+2019 is that even with adequate keyboards, users cannot be 
> relied upon to distinguish U+02BC and U+2019 - especially with no feedback. A 
> writing system should choose one and stick with it.  User unreliability 
> forces a compromise.

Polynesian users need to 02BC to be visually distinguished from 2019. European 
users don’t need the apostrophe to be visually distinguished from 2019. The 
edge case of “dogs’” doesn’t convince me. In all my years of typesetting I have 
never once noticed this, much less considered it a problem that needed fixing.

> Now, if text processors were to enable a difference, then the arguments would 
> change.  I for one find it helpful that Microsoft Word is willing to display 
> visible symbols for spaces and tab characters so that I know what white space 
> is composed of.

Most word-processing typesetting programs will do this. Quark and InDesign do. 
Word and LibreOffice and Apple Pages do. 

>> I didn’t follow the beginning of this. Evidently it has something to do with 
>> word selection of d’ + a space + what follows. If that’s so, then there’s no 
>> argument at all for 02BC. It’s a question of the space, and that’s got 
>> nothing to do with the identity of the apostrophe.
> 
> The word selection issue is that except before a letter, the standard 
> word-breaking algorithm says that there is a word boundary between the delta 
> and apostrophe.

Well, that’s the expected behaviour for a character which is polyvalent. If you 
have problems double-clicking “d’ Artagnan” you should probably just write 
“d’Artagnan”. 

> 
>>> Will your coding decision be machine readable for the readership?  
>> 
>> I don’t know what you mean by “readable”.
> 
> Will the d

Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread James Kass via Unicode



On 2019-01-27 3:08 PM, Tom Gewecke via Unicode wrote:
I think the Unicode Hawaiian ʻokina is supposed to be U+02BB (instead 
of U+02BC).

notes for U+02BB
* typographical alternate for 02BD or 02BF
* used in Hawai'ian orthorgraphy as 'okina (glottal stop)


Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Tom Gewecke via Unicode


> On Jan 26, 2019, at 11:08 PM, Richard Wordingham via Unicode 
>  wrote:
> 
> It may be a matter of literacy in Hawaiian.  If the test readership
> doesn't use ʼokina, 

I think the Unicode Hawaiian ʻokina is supposed to be U+02BB (instead of 
U+02BC).



Re: Ancient Greek apostrophe marking elision

2019-01-27 Thread Andrew Cunningham via Unicode
On Sunday, 27 January 2019, Asmus Freytag via Unicode 
wrote:

>
> Choice of quotation marks is language-based and for novels, many times
> there are
> additional conventions that may differ by publisher.
>
> Wonder why the publisher is forcing single quotes on them
>

In theory quotation marks are language based but many languages have had
the puntuation and typographic conventions of colonial languages  imposed,
even when it isn't the best choice.

And publishers are following established patterns. The publishers that care
about the language do try to distinguish or refine these characters
typographically.

Andrew


-- 
Andrew Cunningham
lang.supp...@gmail.com


Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Asmus Freytag via Unicode

  
  
On 1/26/2019 10:08 PM, Richard
  Wordingham via Unicode wrote:


  On Sat, 26 Jan 2019 21:11:36 -0800
Asmus Freytag via Unicode  wrote:


  
On 1/26/2019 5:43 PM, Richard Wordingham via Unicode wrote:

  
  

  

  That appears to contradict Michael Everson's remark about a
Polynesian
need to distinguish the two visually.


  
  

  
Why do you need to distinguish them? To code text correctly (so the
invisible properties are what the software expects) or because a
human reader needs the disambiguation in order to follow the text?

  
  

  
The latter phenomenon is so common throughout many writing systems,
that I have difficulties buying it.

  
  
It may be a matter of literacy in Hawaiian.  If the test readership
doesn't use ʼokina, it could be confusing to have to resolve the
difference between a sentence(?) starting with one from a sentence in
single quotes. Otherwise, one does wonder why the issue should only
arise now.



one does.




  

One other possibility is that single quote punctuation is being used on
a readership used to double quote punctuation.  Double quotes would
avoid the confusion.

Choice of quotation marks is language-based and for novels, many
times there are
additional conventions that may differ by publisher.
Wonder why the publisher is forcing single quotes on them?




  


  
PS: I wasn't talking about what the Polynesians do; different part of
the world.

  
  
Why should the Polynesians be different?



I am simply stating that my evidence does not come from them. I
  have no special insight into what Polynesians do or do not do.

A./

  



Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Richard Wordingham via Unicode
On Sat, 26 Jan 2019 21:11:36 -0800
Asmus Freytag via Unicode  wrote:

> On 1/26/2019 5:43 PM, Richard Wordingham via Unicode wrote:

>> That appears to contradict Michael Everson's remark about a
>> Polynesian
>> need to distinguish the two visually.

> Why do you need to distinguish them? To code text correctly (so the
> invisible properties are what the software expects) or because a
> human reader needs the disambiguation in order to follow the text?

> The latter phenomenon is so common throughout many writing systems,
> that I have difficulties buying it.

It may be a matter of literacy in Hawaiian.  If the test readership
doesn't use ʼokina, it could be confusing to have to resolve the
difference between a sentence(?) starting with one from a sentence in
single quotes. Otherwise, one does wonder why the issue should only
arise now.

One other possibility is that single quote punctuation is being used on
a readership used to double quote punctuation.  Double quotes would
avoid the confusion.

> PS: I wasn't talking about what the Polynesians do; different part of
> the world.

Why should the Polynesians be different?

Richard.



Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Asmus Freytag via Unicode

  
  
On 1/26/2019 7:53 PM, Richard
  Wordingham via Unicode wrote:


  On Sun, 27 Jan 2019 01:55:29 +
James Kass via Unicode  wrote:


  
Richard Wordingham replied to Asmus Freytag,

 >> To make matters worse, users for languages that "should" use
 >> U+02BC aren't actually consistent; much data uses U+2019 or
 >> U+0027. Ordinary users can't tell the difference (and spell
 >> checkers seem not successful in enforcing the practice).  
 >
 > That appears to contradict Michael Everson's remark about a
 > Polynesian need to distinguish the two visually.  

Does it?

U+02BC /should/ be used but ordinary users can't tell the difference 
because the glyphs in their displays are identical, resulting in much 
data which uses U+2019 or U+0027.  I don't see any contradiction.

  
  
I had assumed that Polynesians would be writing with paper and ink.  It
depends on what 'tell the difference' means.  In normal parlance it
means that they are unaware of the difference in the symbols; you are
assuming that it means that printed material doesn't show the
difference.

In general, handwritten differences can show up in various ways.  For
example, one can find a slight, unreliable difference in the relative
positioning of characters that reflects the difference in the usage of
characters.

Of course, Asmus's facts have to be unreliable.  It's like someone
typing U+1142A NEWA LETTER MHA for Sanskrit , which we've been
assured would never happen.  There must be something wrong with reality.



There usually is :)
Our leaders tell us so.
Anyway, most of us don't use U+2019 where proper unless we happen
  to use 
  software that makes the translation from U+0027 for us . . .
When it picks the left single quote by mistake, that's something
  we can spot and nudge it. When the difference is invisible people
  will type the wrong thing - like typesetting whole books with the
  wrong Arabic character because it happens to share the same shape
  in that position with another one.

A./

  



Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Asmus Freytag via Unicode

  
  
On 1/26/2019 6:25 PM, Michael Everson
  via Unicode wrote:


  On 27 Jan 2019, at 01:37, Richard Wordingham via Unicode  wrote:

  



  I’ll be publishing a translation of Alice into Ancient Greek in due
course. I will absolutely only use U+2019 for the apostrophe. It
would be wrong for lots of reasons to use U+02BC for this.



Please list them.

  
  
The Greek use is of an apostrophe. Often a mark elision (as here), that’s what 2019 is for.

02BC is a letter. Usually a glottal stop. 

I didn’t follow the beginning of this. Evidently it has something to do with word selection of d’ + a space + what follows. If that’s so, then there’s no argument at all for 02BC. It’s a question of the space, and that’s got nothing to do with the identity of the apostrophe.


  
Will your coding decision be machine readable for the readership?

  
  
I don’t know what you mean by “readable”.


  

  Moreover, implementations of U+02BC need to be revised. In the
context of Polynesian languages, it is impossible to use U+02BC if it
is _identical_ to U+2019. Readers cannot work out what is what. I
will prepare documentation on this in due course.



It looks as though you've found a new character - or a revived
distinction.

  
  
It may not be “revived’. In origin, linguists took the lead-type 2019 and used it as a consonant letter. Now, in the 21st century, where Harry Potter is translated into Hawaiian, and where Harry Potter has glottals alongside both single and double quotation marks, 

The use of quotation marks is language dependent. There is no
  cast in stone requirement to use single quotation marks with
  languages where it causes difficulties.

English uses apostrophe and single quotation marks - the former
  are a bit more rare compared to when that symbol is used in some
  languages, but in principle the same confusion applies and so far
  hasn't prompted anyone to follow the lead of the French in choice
  of quotation marks . . .


  the 02BC’s need to be bigger or the text can’t be read easily. In our work we found that a vertical height of 140% bigger than the quotation mark improved legibility hugely. Fine typography asks for some other alterations to the glyph, but those are cosmetic.

If the recommended glyph for 02BC were to be changed, it would in no case impact adversely on scientific linguistics texts. It would just make the mark a bit bigger. But for practical use in Polynesian languages where the character has to be found alongside the quotation marks, a glyph distinction must be made between this and punctuation.

It somehow seems to me that an evolution of the glyph shape of
  02BC in a direction of increased distinction from U+2019 is
  something that Unicode has indeed made possible by a separate
  encoding. However, that evolution is a matter of ALL the language
  communities that use U+02BC as part of their orthography, and
  definitely NOT something were Unicode can be permitted to take a
  lead. Unicode does not *recommend* glyphs for letters.
However, as a publisher, you are of course free to experiment and
  to see whether your style becomes popular.
There is a concern though, that your choice may appeal only to
  some languages that use this code point and not become universally
  accepted.

A./



  



Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Asmus Freytag via Unicode

  
  
On 1/26/2019 5:43 PM, Richard
  Wordingham via Unicode wrote:


  On Sat, 26 Jan 2019 17:11:49 -0800
Asmus Freytag via Unicode  wrote:


  
To make matters worse, users for languages that "should" use U+02BC
aren't actually consistent; much data uses U+2019 or U+0027. Ordinary
users can't tell the difference (and spell checkers seem not
successful in enforcing the practice).

  
  
That appears to contradict Michael Everson's remark about a Polynesian
need to distinguish the two visually.

Richard.



Why do you need to distinguish them? To code
text correctly (so the invisible properties are what the
software expects) or because a human reader needs the
disambiguation in order to follow the text?
The former is like first coding a different
character for a decimal point from an ordinary period, then
deciding to make it look different so you know you typed the
right one. The latter is like saying people can't handle using
the same symbol (dot on the baseline) for two different
functions. 
  
The latter phenomenon is so common
throughout many writing systems, that I have difficulties buying
it.
A./
PS: I wasn't talking about what the
Polynesians do; different part of the world.
  


  



Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Richard Wordingham via Unicode
On Sun, 27 Jan 2019 01:55:29 +
James Kass via Unicode  wrote:

> Richard Wordingham replied to Asmus Freytag,
> 
>  >> To make matters worse, users for languages that "should" use
>  >> U+02BC aren't actually consistent; much data uses U+2019 or
>  >> U+0027. Ordinary users can't tell the difference (and spell
>  >> checkers seem not successful in enforcing the practice).  
>  >
>  > That appears to contradict Michael Everson's remark about a
>  > Polynesian need to distinguish the two visually.  
> 
> Does it?
> 
> U+02BC /should/ be used but ordinary users can't tell the difference 
> because the glyphs in their displays are identical, resulting in much 
> data which uses U+2019 or U+0027.  I don't see any contradiction.

I had assumed that Polynesians would be writing with paper and ink.  It
depends on what 'tell the difference' means.  In normal parlance it
means that they are unaware of the difference in the symbols; you are
assuming that it means that printed material doesn't show the
difference.

In general, handwritten differences can show up in various ways.  For
example, one can find a slight, unreliable difference in the relative
positioning of characters that reflects the difference in the usage of
characters.

Of course, Asmus's facts have to be unreliable.  It's like someone
typing U+1142A NEWA LETTER MHA for Sanskrit , which we've been
assured would never happen.  There must be something wrong with reality.

Richard.



Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Michael Everson via Unicode
Fair enough, but I didn’t wait.

> On 27 Jan 2019, at 01:59, James Kass via Unicode  wrote:
> 
> 
> Richard Wordingham responded to Michael Everson,
> 
> >> I’ll be publishing a translation of Alice into Ancient Greek in due
> >> course. I will absolutely only use U+2019 for the apostrophe. It
> >> would be wrong for lots of reasons to use U+02BC for this.
> >
> > Please list them.
> 
> Let's see the list of reasons why U+02BC should be used first.
> 




Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Michael Everson via Unicode
On 27 Jan 2019, at 01:37, Richard Wordingham via Unicode  
wrote:
> 
>> I’ll be publishing a translation of Alice into Ancient Greek in due
>> course. I will absolutely only use U+2019 for the apostrophe. It
>> would be wrong for lots of reasons to use U+02BC for this.
> 
> Please list them.

The Greek use is of an apostrophe. Often a mark elision (as here), that’s what 
2019 is for.

02BC is a letter. Usually a glottal stop. 

I didn’t follow the beginning of this. Evidently it has something to do with 
word selection of d’ + a space + what follows. If that’s so, then there’s no 
argument at all for 02BC. It’s a question of the space, and that’s got nothing 
to do with the identity of the apostrophe.

> Will your coding decision be machine readable for the readership?

I don’t know what you mean by “readable”.

>> Moreover, implementations of U+02BC need to be revised. In the
>> context of Polynesian languages, it is impossible to use U+02BC if it
>> is _identical_ to U+2019. Readers cannot work out what is what. I
>> will prepare documentation on this in due course.
> 
> It looks as though you've found a new character - or a revived
> distinction.

It may not be “revived’. In origin, linguists took the lead-type 2019 and used 
it as a consonant letter. Now, in the 21st century, where Harry Potter is 
translated into Hawaiian, and where Harry Potter has glottals alongside both 
single and double quotation marks, the 02BC’s need to be bigger or the text 
can’t be read easily. In our work we found that a vertical height of 140% 
bigger than the quotation mark improved legibility hugely. Fine typography asks 
for some other alterations to the glyph, but those are cosmetic.

If the recommended glyph for 02BC were to be changed, it would in no case 
impact adversely on scientific linguistics texts. It would just make the mark a 
bit bigger. But for practical use in Polynesian languages where the character 
has to be found alongside the quotation marks, a glyph distinction must be made 
between this and punctuation.

Michael Everson





Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Michael Everson via Unicode
Polynesians are using 0027 as a fallback, and this has to do with education, 
keyboarding, and training.

The typography of the fallback is of no consequence. It’s a fallback.

> On 27 Jan 2019, at 01:43, Richard Wordingham via Unicode 
>  wrote:
> 
> On Sat, 26 Jan 2019 17:11:49 -0800
> Asmus Freytag via Unicode  wrote:
> 
>> To make matters worse, users for languages that "should" use U+02BC
>> aren't actually consistent; much data uses U+2019 or U+0027. Ordinary
>> users can't tell the difference (and spell checkers seem not
>> successful in enforcing the practice).
> 
> That appears to contradict Michael Everson's remark about a Polynesian
> need to distinguish the two visually.
> 
> Richard.




Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread James Kass via Unicode



Richard Wordingham responded to Michael Everson,

>> I’ll be publishing a translation of Alice into Ancient Greek in due
>> course. I will absolutely only use U+2019 for the apostrophe. It
>> would be wrong for lots of reasons to use U+02BC for this.
>
> Please list them.

Let's see the list of reasons why U+02BC should be used first.



Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread James Kass via Unicode



Richard Wordingham replied to Asmus Freytag,

>> To make matters worse, users for languages that "should" use U+02BC
>> aren't actually consistent; much data uses U+2019 or U+0027. Ordinary
>> users can't tell the difference (and spell checkers seem not
>> successful in enforcing the practice).
>
> That appears to contradict Michael Everson's remark about a Polynesian
> need to distinguish the two visually.

Does it?

U+02BC /should/ be used but ordinary users can't tell the difference 
because the glyphs in their displays are identical, resulting in much 
data which uses U+2019 or U+0027.  I don't see any contradiction.




Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Richard Wordingham via Unicode
On Sat, 26 Jan 2019 17:11:49 -0800
Asmus Freytag via Unicode  wrote:

> To make matters worse, users for languages that "should" use U+02BC
> aren't actually consistent; much data uses U+2019 or U+0027. Ordinary
> users can't tell the difference (and spell checkers seem not
> successful in enforcing the practice).

That appears to contradict Michael Everson's remark about a Polynesian
need to distinguish the two visually.

Richard.


Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Richard Wordingham via Unicode
On Sun, 27 Jan 2019 00:32:43 +
Michael Everson via Unicode  wrote:

> I’ll be publishing a translation of Alice into Ancient Greek in due
> course. I will absolutely only use U+2019 for the apostrophe. It
> would be wrong for lots of reasons to use U+02BC for this.

Please list them.

Will your coding decision be machine readable for the readership?

> Moreover, implementations of U+02BC need to be revised. In the
> context of Polynesian languages, it is impossible to use U+02BC if it
> is _identical_ to U+2019. Readers cannot work out what is what. I
> will prepare documentation on this in due course.

It looks as though you've found a new character - or a revived
distinction.

Richard.



Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Richard Wordingham via Unicode
On Sat, 26 Jan 2019 15:45:54 +
James Kass via Unicode  wrote:

> Perhaps I'm not understanding, but if the desired behavior is to 
> prohibit both line and word breaks in the example string, then...
> 
> In Notepad, replacing U+0020 with U+00A0 removes the line-break.

I believe the problem is that "δ’ αρχαια" should have non-blank
*words*.  With U+2019, one gets 3.  Line-break suppressing spaces don't
help with word-breaking, because they are not treated as letters.

A clunky solution would be to have a sequence .  However, there is no such
thing as a 'control-joining-words' if one complies with the TUS
injunction in Section 23.3, "The word joiner should be ignored in
contexts other than line breaking".  A robust, trainable spell-checker
will treat this institutionally racist injunction with the contempt it
deserves.

It's interesting that the spellings "'bus" and "'phone" have died.
They would once have hit the word-boundary problems when "bus" and
"phone" were rejected.

Richard.



Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Asmus Freytag via Unicode

  
  
On 1/26/2019 3:02 AM, Mark Davis ☕️ via
  Unicode wrote:


  
  

  > breaking
  selection for "d'Artagnan" or "can't" into two is overly
  fussy.
  
  
  
True, and that is not what U+2019
  does; it does not break medially.
  

  



Not everyone seems to have got the word . . . but that's not
  Unicode's fault. But shows that picking specific character codes
  from among a set that are identical except for (invisible)
  properties could be a losing game if widely deployed software
  can't be relied on to honor such finesse.

A./
PS: btw, the Root Zone of the DNS will not support U+02BC as a
  "letter". The "invisible" distinction in property is irrelevant
  when it comes to identifies that are identified visually by users,
  and further, we don't really want to encourage people to use it to
  register words intended to contain apostrophes. Since we can't
  have ordinary apostrophes or U+2019, we can't have U+02BC looking
  like it might be one of the others.
To make matters worse, users for languages that "should" use
  U+02BC aren't actually consistent; much data uses U+2019 or
  U+0027. Ordinary users can't tell the difference (and spell
  checkers seem not successful in enforcing the practice).


  

  


  
  

  

  

  

  

  
  Mark


  

  

  

  

  

  

  
  

  
  
  
On Fri, Jan 25, 2019 at 11:07 PM Asmus Freytag
  via Unicode  wrote:


  
On
  1/25/2019 9:39 AM, James Tauber via Unicode wrote:


  Thank you, although the word break does
still affect things like double-clicking to select.


And people do seem to want to use U+02BC for this
  reason (and I'm trying to articulate why that isn't
  what U+02BC is meant for).


  

For normal edition operations, breaking selection for
  "d'Artagnan" or "can't" into two is overly fussy.
No wonder people get frustrated.

A./


  
James
  
  
  
On Fri,
  Jan 25, 2019 at 12:34 PM Mark Davis ☕️ 
  wrote:


  

  
U+2019 is normally
  the character used, except where the ’ is
  considered a letter. When it is between
  letters it doesn't cause a word break, but
  because it is also a right single quote, at
  the end of words there is a break. Thus in a
  phrase like «tryin’ to go» there is a word
  break after the n, because one can't tell.


So something like "δ’
  αρχαια" (picking a phrase at random) would
  have a word break after the delta. 



Word break: 

  

  δ’ αρχαια 

  



However, there is no
  line break between them (which is the
  more important operation in normal usage).
  Probably not worth tailoring the word break.


Line break:

  

  
δ’ αρχαια 
   

Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Michael Everson via Unicode
I’ll be publishing a translation of Alice into Ancient Greek in due course. I 
will absolutely only use U+2019 for the apostrophe. It would be wrong for lots 
of reasons to use U+02BC for this.

Moreover, implementations of U+02BC need to be revised. In the context of 
Polynesian languages, it is impossible to use U+02BC if it is _identical_ to 
U+2019. Readers cannot work out what is what. I will prepare documentation on 
this in due course.

> On 26 Jan 2019, at 23:52, James Tauber via Unicode  
> wrote:
> 
> Well, my desire it to simple know whether to tell people doing digital 
> editions of Ancient Greek texts whether to use U+2019 or U+02BC for the 
> apostrophe marking elision (or at least accurately describe the trade-offs of 
> each).




Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread James Tauber via Unicode
Well, *my* desire it to simple know whether to tell people doing digital
editions of Ancient Greek texts whether to use U+2019 or U+02BC for the
apostrophe marking elision (or at least accurately describe the trade-offs
of each).



On Sat, Jan 26, 2019 at 10:50 AM James Kass via Unicode 
wrote:

>
> Perhaps I'm not understanding, but if the desired behavior is to
> prohibit both line and word breaks in the example string, then...
>
> In Notepad, replacing U+0020 with U+00A0 removes the line-break.
> U+0020 ( δ’ αρχαια )
> U+00A0 ( δ’ αρχαια )
> U+202F ( δ’ αρχαια )
> It also changes the advancement of the text cursor (Ctrl + arrows),
> suggesting that word/string selection would be as desired.  (U+202F also
> does this and may offer a more pleasing appearance to classisists by
> default.)
>
> Wouldn't it be best to handle substitution of U+00A0 for U+0020 at the
> input method / keyboard driver level where appropriate, so that
> preferred apostrophe U+2019 can be used?
>
>

-- 
*James Tauber*
Eldarion  | jktauber.com (Greek Linguistics)
 | Modelling Music
 | Digital
Tolkien 


Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread James Kass via Unicode



Perhaps I'm not understanding, but if the desired behavior is to 
prohibit both line and word breaks in the example string, then...


In Notepad, replacing U+0020 with U+00A0 removes the line-break.
U+0020 ( δ’ αρχαια )
U+00A0 ( δ’ αρχαια )
U+202F ( δ’ αρχαια )
It also changes the advancement of the text cursor (Ctrl + arrows), 
suggesting that word/string selection would be as desired.  (U+202F also 
does this and may offer a more pleasing appearance to classisists by 
default.)


Wouldn't it be best to handle substitution of U+00A0 for U+0020 at the 
input method / keyboard driver level where appropriate, so that 
preferred apostrophe U+2019 can be used?




Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread James Kass via Unicode



Mark Davis responded to Asmus Freytag,

>> breaking selection for "d'Artagnan" or "can't" into two is overly fussy.
>
> True, and that is not what U+2019 does; it does not break medially.

Mark Davis earlier posted this example,
> So something like "δ’ αρχαια" (picking a phrase at random) would have
> a word break after the delta.
If the user wanted to use the preferred character, U+2019, would using 
the no break space (U+00A0) after it resolve the word or line break 
issues?  Or possibly NNBSP (U+202F)?


It's a shame if users choose suboptimal characters over preferred 
characters because of what are essentially rendering/text selection 
issues.  IMO, it's better to use preferred characters in the long run.


(Users should file bug reports on applications which improperly medially 
break strings which include U+2019.)




Re: Ancient Greek apostrophe marking elision

2019-01-26 Thread Mark Davis ☕️ via Unicode
> breaking selection for "d'Artagnan" or "can't" into two is overly fussy.

True, and that is not what U+2019 does; it does not break medially.

Mark


On Fri, Jan 25, 2019 at 11:07 PM Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> On 1/25/2019 9:39 AM, James Tauber via Unicode wrote:
>
> Thank you, although the word break does still affect things like
> double-clicking to select.
>
> And people do seem to want to use U+02BC for this reason (and I'm trying
> to articulate why that isn't what U+02BC is meant for).
>
> For normal edition operations, breaking selection for "d'Artagnan" or
> "can't" into two is overly fussy.
>
> No wonder people get frustrated.
>
> A./
>
> James
>
> On Fri, Jan 25, 2019 at 12:34 PM Mark Davis ☕️  wrote:
>
>> U+2019 is normally the character used, except where the ’ is considered a
>> letter. When it is between letters it doesn't cause a word break, but
>> because it is also a right single quote, at the end of words there is a
>> break. Thus in a phrase like «tryin’ to go» there is a word break after the
>> n, because one can't tell.
>>
>> So something like "δ’ αρχαια" (picking a phrase at random) would have a
>> word break after the delta.
>>
>> Word break:
>> δ’ αρχαια
>>
>> However, there is no *line break* between them (which is the more
>> important operation in normal usage). Probably not worth tailoring the word
>> break.
>>
>> Line break:
>> δ’ αρχαια
>>
>> Mark
>>
>>
>> On Fri, Jan 25, 2019 at 1:10 PM James Tauber via Unicode <
>> unicode@unicode.org> wrote:
>>
>>> There seems some debate amongst digital classicists in whether to use
>>> U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking
>>> elision. (e.g. δ’ for δέ preceding a word starting with a vowel).
>>>
>>> It seems to me that U+2019 is the technically correct choice per the
>>> Unicode Standard but it is not without at least one problem: default word
>>> breaking rules.
>>>
>>> I'm trying to provide guidelines for digital classicists in this regard.
>>>
>>> Is it correct to say the following:
>>>
>>> 1) U+2019 is the correct character to use for the apostrophe in Ancient
>>> Greek when marking elision.
>>> 2) U+02BC is a misuse of a modifier for this purpose
>>> 3) However, use of U+2019 (unlike U+02BC) means the default Word
>>> Boundary Rules in UAX#29 will (incorrectly) exclude the apostrophe from the
>>> word token
>>> 4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules
>>> in UAX#29 will (incorrectly) include the apostrophe as part of a glyph
>>> cluster with the previous letter
>>> 5) The correct solution is to tailor the Word Boundary Rules in the case
>>> of Ancient Greek to treat U+2019 as not breaking a word (which shouldn't
>>> have the same ambiguity problems with the single quotation mark as in
>>> English as it should not be used as a quotation mark in Ancient Greek)
>>>
>>> Many thanks in advance.
>>>
>>> James
>>>
>>
>
> --
> *James Tauber*
> Greek Linguistics: https://jktauber.com/
> Music Theory: https://modelling-music.com/
> Digital Tolkien: https://digitaltolkien.com/
>
> Twitter: @jtauber
>
>
>


Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread James Kass via Unicode



On 2019-01-25 10:06 PM, Asmus Freytag via Unicode wrote:

James, by now it's unclear whether your ' is 2019 or 02BC.
The example word "aren't" in previous message used U+2019.  Sorry if I 
was unclear.


Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread James Tauber via Unicode
On Fri, Jan 25, 2019 at 9:41 PM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> To quote TUS:
>
> "A few may modify the following letter, and some may serve as a
> independent letters".
>
> Bear in mind that one of the uses of U+02BC is the scholarly
> representation of a glottal stop, especially in Arabic names.
>

Okay, so this legitimises the use of U+02BC (with its better
word-breaking properties) for the apostrophe marking elision in Ancient
Greek even though U+2019 is stated as the preferred character _in
general_ for the apostrophe.

On balance, this would seem to suggest U+02BC can (and perhaps
should) be used for the specific purpose in Ancient Greek.

(Of course, the other character that comes up is U+1FBD, but there
the consensus seems strong that this is just plain wrong.)

Thank you all.

James


Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread Richard Wordingham via Unicode
On Fri, 25 Jan 2019 17:02:25 -0500
James Tauber via Unicode  wrote:

> I guess U+02BC is category Lm not Mn, but doesn't that still mean it
> modifies the previous character (i.e. is really part of the same
> grapheme cluster) and so isn't appropriate as either a vowel or an
> indication of an omitted vowel?

To quote TUS:

"A few may modify the following letter, and some may serve as a
independent letters".

Bear in mind that one of the uses of U+02BC is the scholarly
representation of a glottal stop, especially in Arabic names.

Richard.


Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread Asmus Freytag via Unicode

  
  
On 1/25/2019 10:05 AM, James Kass via
  Unicode wrote:


  
  For U+2019, there's a note saying 'this is the preferred character
  to use for apostrophe'.
  
  
  Mark Davis wrote,
  
  
  > When it is between letters it doesn't cause a word break, ...
  
  
  Some applications don't seem to get that.  For instance, the
  spellchecker for Mozilla Thunderbird flags the string "aren" for
  correction in the word "aren’t", which suggests that users trying
  to use preferred characters may face uphill battles.
  
  
  



James, by now it's unclear whether your ' is 2019 or 02BC.
Spellcheckers are truly dumb sometimes when "user perceived
  words" don't match what the fussy prescriptionistas ordain.
And then you get parts of perfectly valid "words" rejected, and
  can't even fix them with overrides, because the override doesn't
  accept the whole _expression_.

A./

  



Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread Asmus Freytag via Unicode

  
  
On 1/25/2019 9:39 AM, James Tauber via
  Unicode wrote:


  
  Thank you, although the word break does still
affect things like double-clicking to select.


And people do seem to want to use U+02BC for this reason
  (and I'm trying to articulate why that isn't what U+02BC is
  meant for).


  

For normal edition operations, breaking selection for
  "d'Artagnan" or "can't" into two is overly fussy.
No wonder people get frustrated.

A./


  
James
  
  
  
On Fri, Jan 25, 2019 at 12:34
  PM Mark Davis ☕️  wrote:


  

  
U+2019 is normally the
  character used, except where the ’ is considered a
  letter. When it is between letters it doesn't cause a
  word break, but because it is also a right single
  quote, at the end of words there is a break. Thus in a
  phrase like «tryin’ to go» there is a word break after
  the n, because one can't tell.


So something like "δ’ αρχαια"
  (picking a phrase at random) would have a word break
  after the delta. 



Word break: 

  

  δ’ αρχαια 

  



However, there is no line
break between them (which is the more important
  operation in normal usage). Probably not worth
  tailoring the word break.


Line break:

  

  
δ’ αρχαια 
  

  




  

  

  

  

  

Mark
  
  

  

  

  

  

  

  


  

  
  
  
On Fri, Jan 25, 2019 at 1:10 PM James Tauber
  via Unicode 
  wrote:


  

  There seems some debate amongst digital
classicists in whether to use U+2019 or U+02BC to
represent the apostrophe in Ancient Greek when
marking elision. (e.g. δ’ for δέ preceding a word
starting with a vowel).
  
  
  It seems to me that U+2019 is the technically
correct choice per the Unicode Standard but it is
not without at least one problem: default word
breaking rules.
  
  
  I'm trying to provide guidelines for digital
classicists in this regard.
  
  
  Is it correct to say the following:
  
  
  1) U+2019 is the correct character to use for the
apostrophe in Ancient Greek when marking elision. 
  2) U+02BC is a misuse of a modifier for this
purpose
  3) However, use of U+2019 (unlike U+02BC) means
the default Word Boundary Rules in UAX#29 will
(incorrectly) exclude the apostrophe from the word
token
  4) And use of U+02BC (unlike U+2019) means Glyph
Cluster Boundary Rules in UAX#29 will (incorrectly)
include the apostrophe as part of a glyph cluster
with the previous letter
  5) The correct solution is to tailor the Word
Boundary Rules in the case of Ancient Greek to treat
U+2019 as not breaking a word (which sho

Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread James Tauber via Unicode
I guess U+02BC is category Lm not Mn, but doesn't that still mean it
modifies the previous character (i.e. is really part of the same grapheme
cluster) and so isn't appropriate as either a vowel or an indication of an
omitted vowel?



On Fri, Jan 25, 2019 at 4:30 PM Richard Wordingham via Unicode <
unicode@unicode.org> wrote:

> On Fri, 25 Jan 2019 12:39:47 -0500
> James Tauber via Unicode  wrote:
>
> > Thank you, although the word break does still affect things like
> > double-clicking to select.
> >
> > And people do seem to want to use U+02BC for this reason (and I'm
> > trying to articulate why that isn't what U+02BC is meant for).
>
> It's a bit tricky when the reason is that it was too hard to get users
> of English to make a distinction between U+02BC and U+2019.  And for
> Larry Niven's elephant-like aliens in _Footfall__, is _fi'_, the
> singular of _fithp_, better written with U+02BC or U+2019?  And does
> the phonetically faithful spelling of Estuarine English _fi'_ for
> _fit_ depend on whether the glottal stop is dropped?
>
> The science-fiction ethnonym _Vl'harg_ is also tricky.  Does its elegant
> encoding depend on whether the apostrophe is a vowel symbol (so
> U+02BC) or the indication of an omitted vowel (so U+2019)?
>
> Richard.
>


-- 
*James Tauber*
Greek Linguistics: https://jktauber.com/
Music Theory: https://modelling-music.com/
Digital Tolkien: https://digitaltolkien.com/

Twitter: @jtauber


Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread Richard Wordingham via Unicode
On Fri, 25 Jan 2019 12:39:47 -0500
James Tauber via Unicode  wrote:

> Thank you, although the word break does still affect things like
> double-clicking to select.
> 
> And people do seem to want to use U+02BC for this reason (and I'm
> trying to articulate why that isn't what U+02BC is meant for).

It's a bit tricky when the reason is that it was too hard to get users
of English to make a distinction between U+02BC and U+2019.  And for
Larry Niven's elephant-like aliens in _Footfall__, is _fi'_, the
singular of _fithp_, better written with U+02BC or U+2019?  And does
the phonetically faithful spelling of Estuarine English _fi'_ for
_fit_ depend on whether the glottal stop is dropped?

The science-fiction ethnonym _Vl'harg_ is also tricky.  Does its elegant
encoding depend on whether the apostrophe is a vowel symbol (so
U+02BC) or the indication of an omitted vowel (so U+2019)?

Richard.


Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread James Kass via Unicode



For U+2019, there's a note saying 'this is the preferred character to 
use for apostrophe'.


Mark Davis wrote,

> When it is between letters it doesn't cause a word break, ...

Some applications don't seem to get that.  For instance, the 
spellchecker for Mozilla Thunderbird flags the string "aren" for 
correction in the word "aren’t", which suggests that users trying to use 
preferred characters may face uphill battles.




Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread James Tauber via Unicode
Thank you, although the word break does still affect things like
double-clicking to select.

And people do seem to want to use U+02BC for this reason (and I'm trying to
articulate why that isn't what U+02BC is meant for).

James

On Fri, Jan 25, 2019 at 12:34 PM Mark Davis ☕️  wrote:

> U+2019 is normally the character used, except where the ’ is considered a
> letter. When it is between letters it doesn't cause a word break, but
> because it is also a right single quote, at the end of words there is a
> break. Thus in a phrase like «tryin’ to go» there is a word break after the
> n, because one can't tell.
>
> So something like "δ’ αρχαια" (picking a phrase at random) would have a
> word break after the delta.
>
> Word break:
> δ’ αρχαια
>
> However, there is no *line break* between them (which is the more
> important operation in normal usage). Probably not worth tailoring the word
> break.
>
> Line break:
> δ’ αρχαια
>
> Mark
>
>
> On Fri, Jan 25, 2019 at 1:10 PM James Tauber via Unicode <
> unicode@unicode.org> wrote:
>
>> There seems some debate amongst digital classicists in whether to use
>> U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking
>> elision. (e.g. δ’ for δέ preceding a word starting with a vowel).
>>
>> It seems to me that U+2019 is the technically correct choice per the
>> Unicode Standard but it is not without at least one problem: default word
>> breaking rules.
>>
>> I'm trying to provide guidelines for digital classicists in this regard.
>>
>> Is it correct to say the following:
>>
>> 1) U+2019 is the correct character to use for the apostrophe in Ancient
>> Greek when marking elision.
>> 2) U+02BC is a misuse of a modifier for this purpose
>> 3) However, use of U+2019 (unlike U+02BC) means the default Word Boundary
>> Rules in UAX#29 will (incorrectly) exclude the apostrophe from the word
>> token
>> 4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules
>> in UAX#29 will (incorrectly) include the apostrophe as part of a glyph
>> cluster with the previous letter
>> 5) The correct solution is to tailor the Word Boundary Rules in the case
>> of Ancient Greek to treat U+2019 as not breaking a word (which shouldn't
>> have the same ambiguity problems with the single quotation mark as in
>> English as it should not be used as a quotation mark in Ancient Greek)
>>
>> Many thanks in advance.
>>
>> James
>>
>

-- 
*James Tauber*
Greek Linguistics: https://jktauber.com/
Music Theory: https://modelling-music.com/
Digital Tolkien: https://digitaltolkien.com/

Twitter: @jtauber


Re: Ancient Greek apostrophe marking elision

2019-01-25 Thread Mark Davis ☕️ via Unicode
U+2019 is normally the character used, except where the ’ is considered a
letter. When it is between letters it doesn't cause a word break, but
because it is also a right single quote, at the end of words there is a
break. Thus in a phrase like «tryin’ to go» there is a word break after the
n, because one can't tell.

So something like "δ’ αρχαια" (picking a phrase at random) would have a
word break after the delta.

Word break:
δ’ αρχαια

However, there is no *line break* between them (which is the more important
operation in normal usage). Probably not worth tailoring the word break.

Line break:
δ’ αρχαια

Mark


On Fri, Jan 25, 2019 at 1:10 PM James Tauber via Unicode <
unicode@unicode.org> wrote:

> There seems some debate amongst digital classicists in whether to use
> U+2019 or U+02BC to represent the apostrophe in Ancient Greek when marking
> elision. (e.g. δ’ for δέ preceding a word starting with a vowel).
>
> It seems to me that U+2019 is the technically correct choice per the
> Unicode Standard but it is not without at least one problem: default word
> breaking rules.
>
> I'm trying to provide guidelines for digital classicists in this regard.
>
> Is it correct to say the following:
>
> 1) U+2019 is the correct character to use for the apostrophe in Ancient
> Greek when marking elision.
> 2) U+02BC is a misuse of a modifier for this purpose
> 3) However, use of U+2019 (unlike U+02BC) means the default Word Boundary
> Rules in UAX#29 will (incorrectly) exclude the apostrophe from the word
> token
> 4) And use of U+02BC (unlike U+2019) means Glyph Cluster Boundary Rules in
> UAX#29 will (incorrectly) include the apostrophe as part of a glyph cluster
> with the previous letter
> 5) The correct solution is to tailor the Word Boundary Rules in the case
> of Ancient Greek to treat U+2019 as not breaking a word (which shouldn't
> have the same ambiguity problems with the single quotation mark as in
> English as it should not be used as a quotation mark in Ancient Greek)
>
> Many thanks in advance.
>
> James
>