subject:"Re\: Using Combining Double Breve and expressing characters perhaps as if struck out."

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-28 Thread Philippe Verdy

 Message du 26/07/10 18:45
 De : Markus Scherer markus@gmail.com
 A : verd...@wanadoo.fr
 Copie à : Unicode Mailing List unicode@unicode.org
 Objet : Re: Using Combining Double Breve and expressing characters perhaps as 
 if struck out.

 There are 857 combining marks with combining class of 0:
 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[:M:]%26[:ccc%3D0:]]abb=ong=

 On Sat, Jul 24, 2010 at 11:25 AM, Philippe Verdy verd...@wanadoo.fr wrote:

  Kent Karlsson kent.karlsso...@telia.com wrote:
   Den 2010-07-24 10.07, skrev Philippe Verdy verd...@wanadoo.fr:
  
Double diacritics have a combining property equal to zero, so they
  
   No, they don't. The above ones have combining class 234 and the below
   ones have combining class 233 (other characters with the word DOUBLE
   in them are 'double' in some other way):
  
   035C;COMBINING DOUBLE BREVE BELOW;Mn;233;NSM;N;
   ...
 
  Aren't they using the maximum value of the combining class ?


 No.

  If so,
  you can still use double diacritics betweeb two sequences containing a
  base character and any simple diacritic, and be sure that the double
  diacritic will be rendered about them, as it will remain in the last
  position of the normalized form.
 

 No. The order of combining marks only determines their rendering order if
 they have the same combining class value. If they have different values,
 then their rendering is supposed to be independent of their order in the
 text. The canonical ordering in normalization only serves processing such as
 string comparisons.

You've not understood what I wanted to say.

I know what you explain, but double diacritics can only be reordered
in one case: if there's an upper double diacritic occuring before a
lower diacritic (in which case the normalization will reorder it; as
there's no visible difference in the result, this reordering is safe,
and CGJ is not required to protect it).

But given the way they will be encoded only between base graphemes,
there's no risk that they can be swapped by normalization or that thy
could be ordered BEFORE non-double diacritics.

We can perfectly expect that sequences encoded with double diacritics
will only be in that order:

- prependers for base 1, base 1, other simple diacritics or extenders
for base 1 only, then
- lower double diacritics, upper double diacritics, then
- prependers for base 2, base 2, other simple diacritics or extendrs
for base 2 only

That's what I said in sayin that they have the MAXIMUM combining class
value. There's also NO risk that stacking double diacritics will be
reordered within the same position, so that that use, you will never
need CGJ.

CGJ will only be needed if you want to append a non-double diacritic
on top of a double, but given that this double diacritic shold not
apply to the double diacritic itself, but to the whole group of base
graphemes joined by the double diacritics, these additional
non-double diacritics should be encoded AFTER this whole group, i.e.
just after:

- prependers for base 2, base 2, other simple diacritics or extendrs
for base 2 only,

if we really want to respect the logical encoding order.

And for this use, CGJ will be incorrect (because the additional
diacritics will STILL be part of the base grapheme cluster 2).

We need something else, and that's were will need ZWJ instead, as the
holder of additional diacritics that should stack on the whole group.

OK you may avoid this problem by using CGJ immediately after the
double diacritics (i.e. also before base grapheme cluster 2), but this
will remain illogical.

Well, even the double diacritics themselves are a hack in Unicode.
Ideally we should not even need them, and instead of using:

- o, DOUBLE BREVE, o

This should be:

- o, ZWJ, o, ZWJ, BREVE

Now you can see the problem: ZWJ has never been designed to create
structured layout groups, when used alone.
If layout structire grouping is needed however, we could use variation
selectors to qualify the ZWJ:

- o, ZWJ, VS1, o, ZWJ, VS1, BREVE

where the variation sequence ZWJ,VS1 would mean here : horizontal
group level 1.

And so, we could encode the logicial layout structures of Hieroglyphs
(that require multiple levels, both horizontally, and vertically) by
defining these variation sequences:

HGROUP1 = ZWJ,VS1
VGROUP1 = ZWJ,VS2
HGROUP2 = ZWJ,VS3
VGROUP2 = ZWJ,VS4
HGROUP3 = ZWJ,VS5
VGROUP3 = ZWJ,VS6
and so on...

With this definition, then we no longer need ANY double diacritic
variants, we just use the standard diacritics:

- o, HGROUP1, o, HGROUP1, BREVE

instead of the deprecated method using :

- o, DOUBLE BREVE, o

(which won't be canonically equivalent, but does it matter ?). And we
gain a consistant encoding for triple diacritics or longer:

- o, HGROUP1, o, HGROUP1, o HGROUP1, BREVE

which represents a single BREVE over an horizontal grouping of three o.

And with the same tool, we can almost completely encode as well the
Egyptian hieroglyphs. This could even be part

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-28 Thread Andrew West

On 28 July 2010 22:09, Philippe Verdy verd...@wanadoo.fr wrote:

 You've not understood what I wanted to say.

Maybe if you said less people would understand more .

I don't know how much free time you must have on your hands to write
hundreds of lines in reply to almost every message on this list (3,879
lines in 40 messages this month alone according to the somewhat broken
stats at http://www.unicode.org/mail-arch/unicode-ml/post-stats.html),
but I am certain that no-one else on this list has the time to read
all of it.

Andrew

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-28 Thread verdy_p

Markus Scherer 
 
 There are 857 combining marks with combining class of 0:
 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[:M:]%26[:ccc%3D0:]]abb=ong=

So what ? I perfectly know that there are a lot of diacritics with cc of 0. 

It's DEFINITELY NOT me that contested that on this list (and I already posted a 
reply to someone that pretended to me 
that this was a contradiction). So you're probably making a false assumption 
here, as I have NEVER said the opposite.

And those 857 combining marks are definitely not a problm for the generality of 
double diacritics. You can use them 
with EVEN LESS problems, if the base grapheme clusters ever contain any 
combination of those combining characters with 
combining class 0.

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-28 Thread Mark Davis ☕

I agree; when the nuggets of useful information are so overwhelmed by the
volume of rubble, you just can't afford the time to sift them out.

Mark

*— Il meglio è l’inimico del bene —*


On Wed, Jul 28, 2010 at 14:46, Andrew West andrewcw...@gmail.com wrote:

 On 28 July 2010 22:09, Philippe Verdy verd...@wanadoo.fr wrote:
 
  You've not understood what I wanted to say.

 Maybe if you said less people would understand more .

 I don't know how much free time you must have on your hands to write
 hundreds of lines in reply to almost every message on this list (3,879
 lines in 40 messages this month alone according to the somewhat broken
 stats at http://www.unicode.org/mail-arch/unicode-ml/post-stats.html),
 but I am certain that no-one else on this list has the time to read
 all of it.

 Andrew

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-26 Thread Markus Scherer

There are 857 combining marks with combining class of 0:
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[[:M:]%26[:ccc%3D0:]]abb=ong=

On Sat, Jul 24, 2010 at 11:25 AM, Philippe Verdy verd...@wanadoo.fr wrote:

 Kent Karlsson kent.karlsso...@telia.com wrote:
  Den 2010-07-24 10.07, skrev Philippe Verdy verd...@wanadoo.fr:
 
   Double diacritics have a combining property equal to zero, so they
 
  No, they don't. The above ones have combining class 234 and the below
  ones have combining class 233 (other characters with the word DOUBLE
  in them are 'double' in some other way):
 
  035C;COMBINING DOUBLE BREVE BELOW;Mn;233;NSM;N;
  ...

 Aren't they using the maximum value of the combining class ?


No.

If so,
 you can still use double diacritics betweeb two sequences containing a
 base character and any simple diacritic, and be sure that the double
 diacritic will be rendered about them, as it will remain in the last
 position of the normalized form.


No. The order of combining marks only determines their rendering order if
they have the same combining class value. If they have different values,
then their rendering is supposed to be independent of their order in the
text. The canonical ordering in normalization only serves processing such as
string comparisons.

markus

re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-24 Thread Philippe Verdy

 Message du 24/07/10 09:02
 De : William_J_G Overington wjgo_10...@btinternet.com
 A : unicode@unicode.org
 Copie à : wjgo_10...@btinternet.com
 Objet : Using Combining Double Breve and expressing characters perhaps as if 
 struck out.


 I have been looking at the following thread, which is entitled Making Fonts 
 with Diacritical Marks for Phonetics.

 http://forum.high-logic.com/viewtopic.php?f=3t=3169

 I am writing here to ask two questions please in relation to the Unicode 
 aspects of the problem.

 I have looked at http://www.unicode.org/versions/Unicode5.2.0/ch02.pdf in 
 section 2.11 Combining Characters (page 36 of the pdf) and at 
 http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf in section 3.6 
 Combination (page 24 of the pdf).

 In http://www.unicode.org/charts/PDF/U0300.pdf there is U+035D COMBINING 
 DOUBLE BREVE and there is U+035E COMBINING DOUBLE MACRON.

 In http://www.unicode.org/charts/PDF/U.pdf there is U+006F LATIN SMALL 
 LETTER O.

 How does one express two letters LATIN SMALL LETTER O with a combining double 
 breve in a Unicode plain text document please?

First encode each base (unjoined) extended grapheme clusters
separately (possibly with their own diacritics or extenders or
prependers, including ZWJ and ZWNJ, according to their definition in
the UAX defining text segmentations).

Then encode the double diacritic between them.

So for your examples you get 006F, 035D, 006F (double breve) or
006F, 035D, 006F (double macron).

Double diacritics have a combining property equal to zero, so they
block the reordering for canonical equivalences and the relative order
and independance for the encoding of base grapheme clusters will be
preserved during normalizations.

As a consequence, if there's another diacritic added on top of the
double diacritic, it can only be added at end of this sequence, but
the bad thing is that it will appear just after the encoding of the
second base grapheme cluster, and so it is subject to reordering, as
it will be interpreted as being part itself of the second grapheme
clusters.

Currently you cannot add another diacritic on top of a double
diacritic, we lack something for blocking such interpretation in the
second cluster.

To do that, we would need another base character with combining
property 0 (blocking canonical reorderings), and that would have the
same grouping semantic as other double diacritics. This character
would be abstract (and invisible by itself) and could be something
like:

  U+xyzt DOUBLE DIACRITIC HOLDER

For example to add an acute accent above the double breve joining the
two letters 'o', we would encode:

  006F, 035D, 006F, xyzt, 0301

instead of just 006F, 035D, 006F, 0301 which is canonically
equivalent to 006F, 035D, 00F3 and which encodes the letter 'o' and
the letter 'o' with an acute accent (centered on this second o) joined
with the double breve *above* the acute accent of the second 'o'.

My opinion is that such double diacritic holder exists: it's ZWJ,
which could be safely used as the needed invisible base for additional
diacritics occuring on top (and centered) of a double diacritic. But
others may have other preferences about the choice of this character.

I don't know if ZWJ has been specified so that it could occur only
before a defective combining sequence containing only combining
diacritics. for this case, this would mean that the semantic of the
combining diacritics encoded after it must apply to the full part of
the extended grapheme cluster encoded before it.

This use of ZWJ effectively allows the interpretation of the encoded
sequence as if it was in TeX syntax:

  \acute{ \breve{oo} }

Without the ZWJ, it would be interpreted as:

  \breve{ o\acute{o} }

The double diacritics or just intended to be used between each base
grapheme clusters to join. And it could possibly be used to groop more
than 2 base grapheme, for example with 3 'o' as:

  006F, 035D, 006F, 035D, 006F

interpreted in TeX syntax as: \breve{ooo}

But even with this case, you wont be able to encode with the ZWJ trick
in plain text, such groupings that are expressed this way in TeX:

  \breve{ \breve{oo} x \breve{ o\acute{o} } }

Because double diacritics encoded in Unicode can't be safely stacked
together (for such application you'll need a rich-text layer on top of
Unicode, such as TeX here).

Philippe.

re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-24 Thread verdy_p

Philippe Verdy  wrote:
 But even with this case, you wont be able to encode with the ZWJ trick
 in plain text, such groupings that are expressed this way in TeX:
 
 \breve{ \breve{oo} x \breve{ o\acute{o} } }
 
 Because double diacritics encoded in Unicode can't be safely stacked
 together (for such application you'll need a rich-text layer on top of
 Unicode, such as TeX here).

I just thought about a solution to allow stacking of double-diacritics: we 
could use variation selectors after them, 
to specify a higher level of grouping.

So in the example above:
- \breve{oo} remains encoded as: 
- x remains encoded as: 
- o \acute{o} remains encoded as:  followed by  or 
- \breve{o \acute{o}} remains encoded as: 

And to stack a second level of breves, we could use  between those three groups:



Even softwares ignoring how to create the layout would still consider this long 
sequence as an unbreakable extended 
grapheme cluter. and its important relative ordering will be presrved by 
normalizations. Here also you'll be able to 
add other single diacritics on top of the double breves...

This way, you may stack up to 256 additional levels of double diacritics in a 
structured layer that will be 
preserved as a single extended grapheme cluster.

Softwares that don't know what to with the variation selectors will ignore 
them, and will treat all double breves 
above as equal, so they will render something like this in TeX:

\breve{ oo x o \acute{o} }

in a single grouping (not so bad after all...)

BUT! Such variations sequences have NOT been allocated in the Unicode registry 
for this purpose. I think that such 
application should use something else than variation selectors, that are 
intended to represent glyphic variants for 
the individual double diacritics.

An I think that this could be done by allocating instead, in the special plane 
15, a block for STACKING selectors 
(or more generally GROUPING LEVELS), with exactly the same properties as 
variation selectors, except that they won't 
require a prior registration for their use in association with double 
diacritics.

Such selectors could eventually be used to encode bidimensional structures like 
those used in Egyptian hieroglyphs, 
and that already use the default horizontal layout and would require a single 
additional vertical stacking. For 
example:

-  generates the TeX equivalent of: \hiero{1} \hiero{2} : this is the normal 
horizontal reading


-  generates the TeX-like equivalent of: \vstack{ \hiero{1} \hiero{2} } : 
this is the 
vertical stacking behavior, and needs a joiner-like character to preserve the 
unbreakable extended grapheme 
cluster.

But when both horizontal and vertical layout are used, the direction of 
stacking in complex groupings must be 
disambiguated, and would require two distinct characters. We could use ZWJ for 
grouping with horizontal layout 
(within a larger vertically stacked compound), and ZWNJ for grouping with 
vertical layout. So we would encode here 
for this second case.

Now if the structure is more complex, we'll need several levels of grouping, 
both for the horizontal and the 
vertical joiners. Adding a GROUPING LEVEL (acting exactly like a variation 
selector), encoded just after ZWJ or ZWNJ 
(using the special codepoint in plane 15, encoded as a combining character with 
combining class 0) would solve the 
representation problem.

For example (HIERO1-HIERO2:HIERO3)-HIERO4:HIERO5 (usiong the WikiHiero 
notation), whose layout is similar to:

++++
| HIERO1 | HIERO2 | |
+++ HIERO4 |
| HIERO3 | |
+-++
| HIERO5 |
+--+

could be encoded as:



And it will still match the definition of extended grapheme clusters, while 
also fully preserving the semantic 
composition and structure of the cluster :

* The absence of a grouping level selector means that the horizontal or 
vertical joiners are acting at level 0.
* Sequences encoded at the same grouping level using ZWJ separators are 
assuming the horizontal layout for 
hieroglyphs
* Those encoded at the same grouping level with ZWNJ are assuming the vertical 
layout.
* ZWJ (horizontal layout) has as higher grouping priority than ZWNJ if they 
occur simultaneously at the same level.

If the grouping level selectors are not supported by the layout engine, it will 
just try to honor ZWJ and ZWNJ 
(ignoring the specified grouping levels) as if it was only encoded as:



which is the actual encoding (in WikiHiero syntax) of 
(HIERO1-HIERO2:HIERO3-HIERO4:HIERO5)

+++
| HIERO1 | HIERO2 |
+++
| HIERO3 | HIERO4 |
+++
| HIERO5 |
+-+

And if the vertical stacking is not supported by the layout engine, it will 
also ignore the ZWJ and ZWNJ, and so 
will render the five hieoroglyphs linearily, ignoring in fact just only the 
vertical layers by drawing them in three 
successive spans as:

re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-24 Thread vanisaac

Guys, does nobody read the bloody Standard anymore!?

You CAN currently add a diacritic on top of a double diacritic. The other 
base character is called the Combining Grapheme Joiner (U+304F).

From V. 5.0, ch 7.9:

Occasionally one runs across orthographic conventions that use a dot, an acute 
accent, or other simple diacritic above a ligature tie - that is, U+0361 
Combining Double Inverted Breve. Because of the considerations of canonical 
ordering [...], one cannot represent such text simply by putting a combining 
dot above or combining acute directly after U+0361 in the text. Instead, the 
recommended way of representing such text is to place U+034F Combining Grapheme 
Joiner (CGJ) between the ligature tie and the combining mark that follows it, as

0075 + 0361 + 034F + 0301 + 0069 .

Because CGJ has a combining class of zero, it blocks reordering of the double 
diacritic to follow the second combining mark in canonical order. The sequence 
of CGJ, acute is then rendered with default stacking, placing it centered 
above the ligature tie. This conventiona can be used to create similar effects 
with combining marks above other double diacritics (or below double diacritics 
that render below base characters).


Philippe Verdy wrote: 
First encode each base (unjoined) extended grapheme clusters 
separately (possibly with their own diacritics or extenders or 
prependers, including ZWJ and ZWNJ, according to their definition in 
the UAX defining text segmentations). 


Then encode the double diacritic between them. 


So for your examples you get 006F, 035D, 006F (double breve) or 
006F, 035D, 006F (double macron). 


Double diacritics have a combining property equal to zero, so they 
block the reordering for canonical equivalences and the relative order 
and independance for the encoding of base grapheme clusters will be 
preserved during normalizations. 


As a consequence, if there's another diacritic added on top of the 
double diacritic, it can only be added at end of this sequence, but 
the bad thing is that it will appear just after the encoding of the 
second base grapheme cluster, and so it is subject to reordering, as 
it will be interpreted as being part itself of the second grapheme 
clusters. 


Currently you cannot add another diacritic on top of a double 
diacritic, we lack something for blocking such interpretation in the 
second cluster. 


To do that, we would need another base character with combining 
property 0 (blocking canonical reorderings), and that would have the 
same grouping semantic as other double diacritics. This character 
would be abstract (and invisible by itself) and could be something 
like: 


  U+xyzt DOUBLE DIACRITIC HOLDER 


For example to add an acute accent above the double breve joining the 
two letters 'o', we would encode: 


  006F, 035D, 006F, xyzt, 0301 


instead of just 006F, 035D, 006F, 0301 which is canonically 
equivalent to 006F, 035D, 00F3 and which encodes the letter 'o' and 
the letter 'o' with an acute accent (centered on this second o) joined 
with the double breve *above* the acute accent of the second 'o'. 


My opinion is that such double diacritic holder exists: it's ZWJ, 
which could be safely used as the needed invisible base for additional 
diacritics occuring on top (and centered) of a double diacritic. But 
others may have other preferences about the choice of this character. 


I don't know if ZWJ has been specified so that it could occur only 
before a defective combining sequence containing only combining 
diacritics. for this case, this would mean that the semantic of the 
combining diacritics encoded after it must apply to the full part of 
the extended grapheme cluster encoded before it. 


This use of ZWJ effectively allows the interpretation of the encoded 
sequence as if it was in TeX syntax: 


  \acute{ \breve{oo} } 


Without the ZWJ, it would be interpreted as: 


  \breve{ o\acute{o} } 


The double diacritics or just intended to be used between each base 
grapheme clusters to join. And it could possibly be used to groop more 
than 2 base grapheme, for example with 3 'o' as: 


  006F, 035D, 006F, 035D, 006F 


interpreted in TeX syntax as: \breve{ooo} 


But even with this case, you wont be able to encode with the ZWJ trick 
in plain text, such groupings that are expressed this way in TeX: 


  \breve{ \breve{oo} x \breve{ o\acute{o} } } 


Because double diacritics encoded in Unicode can't be safely stacked 
together (for such application you'll need a rich-text layer on top of 
Unicode, such as TeX here). 


Philippe. 



verdy_p (verd...@wanadoo.fr) wrote:


I just thought about a solution to allow stacking of double-diacritics: we 
could use variation selectors after them, 
to specify a higher level of grouping. 


So in the example above: 
- \breve{oo}

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-24 Thread Kent Karlsson


Den 2010-07-24 10.07, skrev Philippe Verdy verd...@wanadoo.fr:

 Double diacritics have a combining property equal to zero, so they

No, they don't. The above ones have combining class 234 and the below
ones have combining class 233 (other characters with the word DOUBLE
in them are 'double' in some other way):

035C;COMBINING DOUBLE BREVE BELOW;Mn;233;NSM;N;
035F;COMBINING DOUBLE MACRON BELOW;Mn;233;NSM;N;
0362;COMBINING DOUBLE RIGHTWARDS ARROW BELOW;Mn;233;NSM;N;
1DFC;COMBINING DOUBLE INVERTED BREVE BELOW;Mn;233;NSM;N;

035D;COMBINING DOUBLE BREVE;Mn;234;NSM;N;
035E;COMBINING DOUBLE MACRON;Mn;234;NSM;N;
0360;COMBINING DOUBLE TILDE;Mn;234;NSM;N;
0361;COMBINING DOUBLE INVERTED BREVE;Mn;234;NSM;N;
1DCD;COMBINING DOUBLE CIRCUMFLEX ABOVE;Mn;234;NSM;N;

So everything you write based on your false premise is unjustified
(and most is false).

 block the reordering for canonical equivalences and the relative order
 and independance for the encoding of base grapheme clusters will be
 preserved during normalizations.
...
...
...

/kent k

re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-24 Thread verdy_p

 De : vanis...@boil.afraid.org
 Guys, does nobody read the bloody Standard anymore!?
 
 You CAN currently add a diacritic on top of a double diacritic. The other 
 base character is called the Combining Grapheme Joiner (U+304F).

Sorry, I had forgotten this one. Note that I was not sure about the character 
to use as the base for additional 
diacritics (so I indicated « U+xyzt »).

And I did not ask for encoding a new character, as I was nearly certain that 
such a solution existed using a base 
character with combining class 0.

Ok, ZWJ was a bad guess, but *does* CGJ enter in the definition of « default 
grapheme clusters », or at least in the 
definition of « extended grapheme clusters » ? I hope it does, and that 
software are ready to support it as 
documented.

Very few softwares were updated to support the still « recent » version 5.0 
Unicode specifically, there are tons 
that still know and implement only Unicode 4.0. When rendering texts containing 
some CGJ, they will try to map it 
into fonts (where it will most often not be found, because old renderers 
typically also use old fonts), so they will 
display a .notglyph rectangle before the diacritic displayed with a dotted 
circle (as if it was starting a « 
defective sequence ») instead of being smarter and trying to place the 
diacritic on top of the previously seen 
cluster, or at least on top of the last character of the sequence containing 
the double diacritic...

I'm so used to see all the defects in softwares based on Unicode 3.2 or 4.0 
that I often forget that thre may exist 
newer solutions.

Note that even Windows 7 does not include this CGJ control in its IME for rich 
text input controls, where it 
currently allows selecting and entering ZWJ and ZWNJ, or the various selector 
controls for « national » digit 
shapes, or the non-recommended BiDi embedding controls (or the really 
deprecated RS and US ASCII controls that could 
be more easily typed directly on the standard keyboard layout usng the Ctrl key 
in terminal emulators that still 
need these controls, but that are absolutely not needed in texts...).

So I'm not alone to have forgotten it, Microsoft also forgot it for the 
standard text input controls in Windows 7, 
and browsers also completely forgot to include such selector facility for input 
elements.

Philippe.

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-24 Thread Philippe Verdy

Clark S. Cox III clarkc...@me.com
 How can *any* combining character have a combining class of zero? Isn't that 
 a contradiction in terms?

 The U+035D in your example, for instance, has a combining class of 234.

No contradiction. Not all combining characters have a non-zero
combining class. The combining class is not fully discriminant to
determine all combining characters.

Yes U+035D has a non-zero combining class, but not all characters with
combining class 0 are base characters. For examples there exists valid
sequences starting by a character with combining class 0 but that are
still defective because this character has a combining class 0.

The non-zero combining classes that are assigned to *some* (not all)
combining characters, are just a convenient tehcnical tool used to
allow compatibility with various « legacy » encodings and that require
the concept of canonical equivalences and of normalized forms.

If there had not existed such legacy encodings (that are officially
supported by Unicode and by ISO, with their standardized mappings to
the UCS), all the numeric combining classes, the concept of canonical
equivalences and the standard Unicode normalized forms C and D would
not even be needed at all and should probably have never existed,
because all characters including those combining with another base
characters, would be entered and encoded ONLY in their logical order
(as determined from the language-specific semantics).

Combining classes are even completely avoided (i.e. assigned with a 0
value) when new scripts get encoded directly in the UCS without any
previous encoding to support.

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-24 Thread Philippe Verdy

Kent Karlsson kent.karlsso...@telia.com wrote:
 Den 2010-07-24 10.07, skrev Philippe Verdy verd...@wanadoo.fr:

  Double diacritics have a combining property equal to zero, so they

 No, they don't. The above ones have combining class 234 and the below
 ones have combining class 233 (other characters with the word DOUBLE
 in them are 'double' in some other way):

 035C;COMBINING DOUBLE BREVE BELOW;Mn;233;NSM;N;
 ...

Aren't they using the maximum value of the combining class ? If so,
you can still use double diacritics betweeb two sequences containing a
base character and any simple diacritic, and be sure that the double
diacritic will be rendered about them, as it will remain in the last
position of the normalized form.

Anyway I also said that a character with combining class 0 was needed
to add other diacritics on top of double diacritics, after encoding
the two sequences joined with the double diacritic.

Why did you assign such bogous non-zero combining class for double
diacritics is a mystery for me, as it was really not needed for
compatibility with legacy encodings?

These combining classes 233 and 234 have absolutely no interest except
that it complicated things for absolutely no benefit (including the
fact that now an additional character with combining class 0, such as
CGJ or other, is always needed to stack anything else on top of double
diacritics).

I did not realize that before (yes I should have looked in the UCD to
verify). And given their existing behavior, this has prevented other
simpler encodings of texts.

Also I have NEVER found any occurence ever where the fact that they
have combining class 233/234 instead of 0 makes any difference,
because double diacritics where ALWAYS encoded between the two base
graphemes encoded separately, and the canonical order preserves this
encoding position in all cases between the two base graphemes encoded
completely.


Note that I'm not even sure that CGJ is the right choice for stacking
more diacritics on top of double diacritics, because it would mean
that the additional diacritic will need to be encoded just after the
double diacritic and CGJ, but before the second grapheme, and this
does not really match with double diacritics used between triplets of
graphemes: where the additional diacritics need to be placed, on the
first or on the second double diacritic ?

For me the logical ordering would require encoding first the base
graphemes, separated by the double diacritic, then encode the
additional diacritics applicable to the whole previous group (and so
it requires adding a new virtual base to block the reordering.

(1) If using CGJ at end of the sequence containing the two bases and
the double diacritic, it will still attach logically and visually the
additional diacritics to the last base grapheme, and so they will
still stack on them, below the double macron for example, even if
their relative order is preserved.

It's needless (or logically wrong), in this order, to use CGJ instead
of ZWJ, in a sequence like:

  base-1, double-diacritic, base-2, CGJ, additional-diacritics

because in that position, CGJ has no other effect to block the
reordering of additional-diacritics as they are already blocked by
base-2, so it would be still interpreted as:

  base-1, double-diacritic, base-2, additional-diacritics

and so the additional diacritic will be linked to base-2, and the
double diacritic will cover the full group containing base-1 and
base-2, additional diacritics


(2) The only way to encode the additional diacritics in the middle of
the group linked by CGJ, in this order:

  base-1, double-diacritic, CGJ, additional diacritics..., base-2

and it will be impossible to have longer groups applying the double
diacritic to more than 2 bases. This encoding using CGJ clearly breaks
the logical assumption that the additional diacritic applying to a
group should be all encoded AFTER the full group has been encoded.

Here the additional diacritics need to be inserted at a specific
position in the middle of the sequence (and in pratice, for input
editors, they would have to scan back before base-2 through the
additional diacritics and CGJ just to find the double-diacritic and
see that any further diacritics need to be inserted there...)

CGJ was not intended to apply to more than one character, but only as
a way to block some normalized reordering of combining characters
occuring after a single base character (which always has combining
class 0). In that position, it should only occur between  two
combining characters with non-0 combining class, and only if the
second onle has a lower combining class than the first one, and only
if this creates a semantic or visual difference on rendered documents
(for example because of the variable positions of the cedilla, that
the combining class are unifying as if it was unique).


(3) Using ZWJ, this terminates the last base grapheme so you can
safely append other diacritics applying to the whole group

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

re: Using Combining Double Breve and expressing characters perhaps as if struck out.

re: Using Combining Double Breve and expressing characters perhaps as if struck out.

re: Using Combining Double Breve and expressing characters perhaps as if struck out.

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

re: Using Combining Double Breve and expressing characters perhaps as if struck out.

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

Re: Using Combining Double Breve and expressing characters perhaps as if struck out.

12 matches

Site Navigation

Mail list logo

Footer information