Re: Suppressing Ligation of Spacing Marks

2016-11-08 Thread Philippe Verdy
inserting some zero-width word joiner or disjoiner should work with this...
But if you see a dotted circle, you need to encode some zero-width space as
the base holder for the combining vowel sign following it.

However I wonder if fonts accept zero-width holders for combining vowels,
they could still assume that there's no matching base consonnant and thus
insert another base dotted circle.

There's no consensus across script for using the same null-base holder
acting as a pseudo-consonnant for vowels encoded after them (e.g. Hangul
has its own jamo holder for this because of its specific algorithmic
composition, but some other scripts also use such null holders for their
own orthography).. In Alphabetic scripts, the ZWNJ should work.

But in Indic scripts we are all depending on the capability of renderers to
support specific scripts with only specific subsets of base letters and
every other character outside this subset will trigger the insertion of a
dotted circle glyph, and ZWJ/ZWNJ is already specific for being used in
script-specific clusters for some distinctions (notably to control how
parts of clusters are subgrouped ...)

You'll need to "bug" the maintainers of the renderer if they forgot
necessary cases described earlier for the script when it was initially
approved for encoding.

2016-11-08 10:09 GMT+01:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> Should it be possible to suppress the ligation of a base character and
> a visually following spacing mark in plain text?
>
> The example I have in minf is the sequence  U+1A63 TAI THAM VOWEL SIGN AA>.  It may be desirable to suppress the
> ligation because both ligands have subscript consonants.  However, if
> I write , the Universal Shaping Engine
> decides that the ZWNJ triggers a new syllable, and inserts a dotted
> circle before SIGN AA.  (The dotted circle after SIGN AA results from a
> failure to read the proposal for the Lanna script as it was then
> called.)
>
> Richard.
>
>


Re: Multiple Preposed Marks

2016-11-08 Thread Philippe Verdy
2016-11-09 0:42 GMT+01:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> On Wed, 9 Nov 2016 00:00:01 +0100
> Philippe Verdy  wrote:
>
> > 2016-11-08 9:30 GMT+01:00 Richard Wordingham <
> > richard.wording...@ntlworld.com>:
> >
> > > TUS Section 2.11 says, "If the combining characters can interact
> > > typographically—for example, U+0304 combining macron and  U+0308
> > > combining  diaeresis — then  the  order  of  graphic  display  is
> > > determined  by  the  order  of  coded  characters  (see Table 2-5).
> > > By  default,  the  diacritics  or other combining characters are
> > > positioned from the base character’s glyph outward".
>
> > The interpretation of   "If the combining characters can interact
> > typographically" should be better read as "If the combining
> > characters have the same non-zero combining class or any one of them
> > has a zero combining class".
>
> The combining marks in question both have canonical combining class 0.
>
> > But now normalization is everywhere and causes the pairs using the
> > condition above to be freely reordered (or decomposed and recomposed,
> > meaning that the encoding order is NOT significant at all).
>
> I believe a renderer is permitted to treat canonically equivalent
> sequence differently so long as it does not believe it should treat
> them differently.  However, that is irrelevant to this case.
>

This is DIRECTLY relevant to the sentence in TUS you quoted, which is all
about combining characters encoded after the base letter and often have
non-zero combining classes and are reorderable

But evidently this sentence in TUS is not relevant to "prepended" combining
marks that are all with combining class 0, here "prepended" meaning:
encoded before the base character, but not after it even if they are
visually combining before it, as is the case for wellknown Indic vowels
that have now non-zero combining classes that allow them to be reordered
before other combining marks when normalizing, but still remaining encoded
after the base consonnant).

What I want to say is that this sentence in TUS is quite ambiguous: it
speaks about graphic interaction, but this is not really encoded in text
sequences and forgets the the effect of combining classes on combining
sequences, which NEVER considers any actual graphic interaction (simply
because it is not specified and the actual graphic interactions may depend
on font styles (notably in honorific Arabic typography using very complex
layouts, but even within the Latin script when using decorated font styles
or custom ligatures where complex also interactions occur, including on
larger spans than clusters, such as full words).


Re: Multiple Preposed Marks

2016-11-08 Thread Philippe Verdy
2016-11-08 9:30 GMT+01:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> TUS Section 2.11 says, "If the combining characters can interact
> typographically—for example, U+0304 combining macron and  U+0308
> combining  diaeresis — then  the  order  of  graphic  display  is
> determined  by  the  order  of  coded  characters  (see Table 2-5).
> By  default,  the  diacritics  or other combining characters are
> positioned from the base character’s glyph outward".
>

The interpretation of   "If the combining characters can interact
typographically" should be better read as "If the combining characters have
the same non-zero combining class or any one of them has a zero combining
class".

Effectively the combining classes were historically intended to track these
possible graphic interactions, in order to allow or disable reordering and
detect canonical equivalences.

But now normalization is everywhere and causes the pairs using the
condition above to be freely reordered (or decomposed and recomposed,
meaning that the encoding order is NOT significant at all).

But it turned out that some diacritics may be positioned differently
according to their base character. E.g., the cedilla which may interact
below, where no interaction is supposed with other combining characters
normally interacting above (so that reordering to canonical equivalents is
permitted and in fact made automatically during the encoding/decoding
processes of documents), but with some Latin letters these interaction do
occur. The only way then to block the reordering (if you don't want the
positions infered from the encoding order of normalized strings), is to
block it using zero-combining joiners (CGJ).

This sentence should have been updated since long in TUS, because TUS does
not really know how characters will be positioned and Unicode permits
reordering of pairs of diacritics if they are not blocking each other for
normalization.

This is important for the cedilla, but even more important for Hebrew
diacritics, whose combining classes do not really track correctly their
relative positioning (as discussed on this list years ago, and known as the
"Hebrew points bug" (but this will never change: the combiing classes are
assigned permanently and continue to work for simple cases, but will cause
problems with some pairs needing insertions of CGJ).

This is also important for several Indic scripts that have complex
positioning rules if you use combining characters with non-zero combining
classes (initially intended for simple cases in Latin/Greek/Cyrillic).
Thanks, the most critical diacritics in Indic scripts for such complex
cases have a combining class set to zero (meaning that they blcok eah other
and their relative encoding order is not affected by normalization, but
there are many cases where CGJ is needed.


Re: Multiple Preposed Marks

2016-11-08 Thread Marcel Schneider
On Tue, 8 Nov 2016 21:36, Richard Wordingham wrote:
> 
> On Tue, 8 Nov 2016 08:30:25 +
> Richard Wordingham  wrote:
> 
> > and the need for an OpenType feature (probably a cvXX)
> > for inconsistent handling of U+1A58 MAI KANG LAI. The latter may be a
> > challenge - I couldn't persuade MS Edge to use the font's Lao shaping
> 
> General features (e.g. 'ss01') for Tai Tham work a treat in MS Edge, and
> seem to be executed at the same time time as the 'standard typographical
> presentation', e.g feature 'psts'. Thank you! That makes things much
> easier. […]

“Where thereʼs a will, thereʼs a way!”

Marcel



Re: Multiple Preposed Marks

2016-11-08 Thread Richard Wordingham
On Tue, 8 Nov 2016 08:30:25 +
Richard Wordingham  wrote:

> and the need for an OpenType feature (probably a cvXX)
> for inconsistent handling of U+1A58 MAI KANG LAI.  The latter may be a
> challenge - I couldn't persuade MS Edge to use the font's Lao shaping

General features (e.g. 'ss01') for Tai Tham work a treat in MS Edge, and
seem to be executed at the same time time as the 'standard typographical
presentation', e.g feature 'psts'.  Thank you!  That makes things much
easier.  (There seems to be quite a bit of variation in layout in Chiang
Mai province, never mind the rest of the region.)

Richard.


Re: The (Klingon) Empire Strikes Back

2016-11-08 Thread gfb hjjhjh
I believe there's already a court ruling that say languages and words are
not copyrightablein the case about loglan, although the trademarkability of
an language is another matter.

2016年11月5日 01:42 於 "David Faulks"  寫道:

> > On Thu, 11/3/16, Mark Shoulson  wrote:
> > Subject: The (Klingon) Empire Strikes Back
>
> > At the time of writing this letter it has not yet hit the UTC
> > Document Register, but I have recently submitted a document
> > revisiting the ever-popular issue of the encoding of Klingon
> > "pIqaD".  The reason always given why it could not be
> > encoded was that it did not enjoy enough usage, and so I've
> > collected a bunch of examples to demonstrate that this is not
> > true (scans and also web pages, etc.)  So the issue comes
> > back up, and time to talk about it again.
>
> There is another issue of course, which I think could be a huge obstacle:
> the Trademark/Copyright issue. Paramount claims copyright over the entire
> Klingon language (presumably including the script). The issue has recently
> gone to court. Encoding criteria for symbols (and this likely extends to
> letters) is against encoding them without the permission of the
> Copyright/Trademark holder.
>
> Is Paramount endorsing your proposal?
>
> 
>
> > ~mark
>
> David Faulks
>
>
>
>
>
>
>


Re: The (Klingon) Empire Strikes Back

2016-11-08 Thread Julian Bradfield
On 2016-11-08, Mark E. Shoulson  wrote:
> I've heard that there are similar questions regarding tengwar and cirth, 
> but it is notable that UTC *did* see fit to consider this question for 
> them and determine that they were worthy of encoding (they are on the 
> roadmap), even though they have not actually followed through on that 
> yet, perhaps because of these very IP concerns.  Notably, pIqaD is not 

The Tolkien Estate considers that the tengwar constitute a work of
art, and it's not willing to see them in Unicode, because this would
hinder its ability to pursue people using tengwar for what it
considers inappropriate purposes. (I finally asked them a couple of
years ago for permission to encode, based on Michael Everson's draft
proposal from yonks ago, and that's the summary of their reply.)

Several years ago, I was told on this list that it would be up to the
proposers to deal with this, and that the Unicode Consortium would
have no interest in taking on the 800lb legal gorilla that is the
Tolkien Estate. (Now a 24M£ gorilla with what it got from New Line
Cinema.)

If some wealthy Unicode Consortium member feels like paying for an
American counsel's opinion that the Estate is just trying it on, feel
free!

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Suppressing Ligation of Spacing Marks

2016-11-08 Thread Richard Wordingham
Should it be possible to suppress the ligation of a base character and
a visually following spacing mark in plain text?

The example I have in minf is the sequence .  It may be desirable to suppress the
ligation because both ligands have subscript consonants.  However, if
I write , the Universal Shaping Engine
decides that the ZWNJ triggers a new syllable, and inserts a dotted
circle before SIGN AA.  (The dotted circle after SIGN AA results from a
failure to read the proposal for the Lanna script as it was then
called.)

Richard.



Multiple Preposed Marks

2016-11-08 Thread Richard Wordingham
TUS Section 2.11 says, "If the combining characters can interact
typographically—for example, U+0304 combining macron and  U+0308  
combining  diaeresis — then  the  order  of  graphic  display  is
determined  by  the  order  of  coded  characters  (see Table 2-5).
By  default,  the  diacritics  or other combining characters are
positioned from the base character’s glyph outward".

So, if I have two spacing combining marks E and O that are each
positioned to the left of the base (say X) in a left-to-right script,
so that the encodings  and  appear with the glyph orders
 and , and codings  and , if not
total gibberish, represent a horizontal sequence of the glyphs with
gX on the right, should  render as  or ?  The phonetics and collation (in so far as it is meaningful) of
the words provide no cue as to the order of the encoded characters.  I
have encountered both renderings.

The issue came up when I was checking, in both the Firefox and MS Edge
browsers, that my OpenType Tai Tham font Da Lekh could handle all the
headwords of two Northern Thai dictionaries. (Sparing dotted circle
deletion and orthographic syllable reunification are tricky.)  One
of the dictionaries spells a few words with a combination of the Tai and
Pali notations for the vowel /o:/ in open syllables where one might
expect to see an independent vowel.

I'm down to two other rendering engine issues - a combination of tone
mark and then vowel in 4 words, where the dictionary probably has a
misspelling, and the need for an OpenType feature (probably a cvXX) for
inconsistent handling of U+1A58 MAI KANG LAI.  The latter may be a
challenge - I couldn't persuade MS Edge to use the font's Lao shaping
for the Tai Tham script or for the Latin script in a transliteration
mode.  (That mode is triggered by feature ss02 for the Latin script, and
that works well enough in browsers.)

Richard.