Re: FW:transform a (UNICODE) accented character to its equivalent (UNICODE) non-accented character

2003-08-06 Thread John Cowan
Magda Danish (Unicode) scripsit:

  I'm looking for the easiest and more stable way to transform 
  an (UNICODE) accented character to its equivalent (UNICODE) 
  non-accented character. 

The following mapping table is an approximation to that.

00C0;0041
00C1;0041
00C2;0041
00C3;0041
00C4;0041
00C5;0041
00C7;0043
00C8;0045
00C9;0045
00CA;0045
00CB;0045
00CC;0049
00CD;0049
00CE;0049
00CF;0049
00D1;004E
00D2;004F
00D3;004F
00D4;004F
00D5;004F
00D6;004F
00D9;0055
00DA;0055
00DB;0055
00DC;0055
00DD;0059
00E0;0061
00E1;0061
00E2;0061
00E3;0061
00E4;0061
00E5;0061
00E7;0063
00E8;0065
00E9;0065
00EA;0065
00EB;0065
00EC;0069
00ED;0069
00EE;0069
00EF;0069
00F1;006E
00F2;006F
00F3;006F
00F4;006F
00F5;006F
00F6;006F
00F9;0075
00FA;0075
00FB;0075
00FC;0075
00FD;0079
00FF;0079
0100;0041
0101;0061
0102;0041
0103;0061
0104;0041
0105;0061
0106;0043
0107;0063
0108;0043
0109;0063
010A;0043
010B;0063
010C;0043
010D;0063
010E;0044
010F;0064
0112;0045
0113;0065
0114;0045
0115;0065
0116;0045
0117;0065
0118;0045
0119;0065
011A;0045
011B;0065
011C;0047
011D;0067
011E;0047
011F;0067
0120;0047
0121;0067
0122;0047
0123;0067
0124;0048
0125;0068
0128;0049
0129;0069
012A;0049
012B;0069
012C;0049
012D;0069
012E;0049
012F;0069
0130;0049
0134;004A
0135;006A
0136;004B
0137;006B
0139;004C
013A;006C
013B;004C
013C;006C
013D;004C
013E;006C
0143;004E
0144;006E
0145;004E
0146;006E
0147;004E
0148;006E
014C;004F
014D;006F
014E;004F
014F;006F
0150;004F
0151;006F
0154;0052
0155;0072
0156;0052
0157;0072
0158;0052
0159;0072
015A;0053
015B;0073
015C;0053
015D;0073
015E;0053
015F;0073
0160;0053
0161;0073
0162;0054
0163;0074
0164;0054
0165;0074
0168;0055
0169;0075
016A;0055
016B;0075
016C;0055
016D;0075
016E;0055
016F;0075
0170;0055
0171;0075
0172;0055
0173;0075
0174;0057
0175;0077
0176;0059
0177;0079
0178;0059
0179;005A
017A;007A
017B;005A
017C;007A
017D;005A
017E;007A
01A0;004F
01A1;006F
01AF;0055
01B0;0075
01CD;0041
01CE;0061
01CF;0049
01D0;0069
01D1;004F
01D2;006F
01D3;0055
01D4;0075
01D5;0055
01D6;0075
01D7;0055
01D8;0075
01D9;0055
01DA;0075
01DB;0055
01DC;0075
01DE;0041
01DF;0061
01E0;0041
01E1;0061
01E2;00C6
01E3;00E6
01E6;0047
01E7;0067
01E8;004B
01E9;006B
01EA;004F
01EB;006F
01EC;004F
01ED;006F
01EE;01B7
01EF;0292
01F0;006A
01F4;0047
01F5;0067
01F8;004E
01F9;006E
01FA;0041
01FB;0061
01FC;00C6
01FD;00E6
01FE;00D8
01FF;00F8
0200;0041
0201;0061
0202;0041
0203;0061
0204;0045
0205;0065
0206;0045
0207;0065
0208;0049
0209;0069
020A;0049
020B;0069
020C;004F
020D;006F
020E;004F
020F;006F
0210;0052
0211;0072
0212;0052
0213;0072
0214;0055
0215;0075
0216;0055
0217;0075
0218;0053
0219;0073
021A;0054
021B;0074
021E;0048
021F;0068
0226;0041
0227;0061
0228;0045
0229;0065
022A;004F
022B;006F
022C;004F
022D;006F
022E;004F
022F;006F
0230;004F
0231;006F
0232;0059
0233;0079
0385;00A8
0386;0391
0388;0395
0389;0397
038A;0399
038C;039F
038E;03A5
038F;03A9
0390;03B9
03AA;0399
03AB;03A5
03AC;03B1
03AD;03B5
03AE;03B7
03AF;03B9
03B0;03C5
03CA;03B9
03CB;03C5
03CC;03BF
03CD;03C5
03CE;03C9
03D3;03D2
03D4;03D2
0400;0415
0401;0415
0403;0413
0407;0406
040C;041A
040D;0418
040E;0423
0419;0418
0439;0438
0450;0435
0451;0435
0453;0433
0457;0456
045C;043A
045D;0438
045E;0443
0476;0474
0477;0475
04C1;0416
04C2;0436
04D0;0410
04D1;0430
04D2;0410
04D3;0430
04D6;0415
04D7;0435
04DA;04D8
04DB;04D9
04DC;0416
04DD;0436
04DE;0417
04DF;0437
04E2;0418
04E3;0438
04E4;0418
04E5;0438
04E6;041E
04E7;043E
04EA;04E8
04EB;04E9
04EC;042D
04ED;044D
04EE;0423
04EF;0443
04F0;0423
04F1;0443
04F2;0423
04F3;0443
04F4;0427
04F5;0447
04F8;042B
04F9;044B
0622;0627
0623;0627
0624;0648
0625;0627
0626;064A
06C0;06D5
06C2;06C1
06D3;06D2
0929;0928
0931;0930
0934;0933
0958;0915
0959;0916
095A;0917
095B;091C
095C;0921
095D;0922
095E;092B
095F;092F
09CB;09C7
09CC;09C7
09DC;09A1
09DD;09A2
09DF;09AF
0A33;0A32
0A36;0A38
0A59;0A16
0A5A;0A17
0A5B;0A1C
0A5E;0A2B
0B48;0B47
0B4B;0B47
0B4C;0B47
0B5C;0B21
0B5D;0B22
0B94;0B92
0BCA;0BC6
0BCB;0BC7
0BCC;0BC6
0C48;0C46
0CC0;0CBF
0CC7;0CC6
0CC8;0CC6
0CCA;0CC6
0CCB;0CC6
0D4A;0D46
0D4B;0D47
0D4C;0D46
0DDA;0DD9
0DDC;0DD9
0DDD;0DD9
0DDE;0DD9
0F43;0F42
0F4D;0F4C
0F52;0F51
0F57;0F56
0F5C;0F5B
0F69;0F40
0F73;0F71
0F75;0F71
0F76;0FB2
0F78;0FB3
0F81;0F71
0F93;0F92
0F9D;0F9C
0FA2;0FA1
0FA7;0FA6
0FAC;0FAB
0FB9;0F90
1026;1025
1E00;0041
1E01;0061
1E02;0042
1E03;0062
1E04;0042
1E05;0062
1E06;0042
1E07;0062
1E08;0043
1E09;0063
1E0A;0044
1E0B;0064
1E0C;0044
1E0D;0064
1E0E;0044
1E0F;0064
1E10;0044
1E11;0064
1E12;0044
1E13;0064
1E14;0045
1E15;0065
1E16;0045
1E17;0065
1E18;0045
1E19;0065
1E1A;0045
1E1B;0065
1E1C;0045
1E1D;0065
1E1E;0046
1E1F;0066
1E20;0047
1E21;0067
1E22;0048
1E23;0068
1E24;0048
1E25;0068
1E26;0048
1E27;0068
1E28;0048
1E29;0068
1E2A;0048
1E2B;0068
1E2C;0049
1E2D;0069
1E2E;0049
1E2F;0069
1E30;004B
1E31;006B
1E32;004B
1E33;006B
1E34;004B
1E35;006B
1E36;004C
1E37;006C
1E38;004C
1E39;006C
1E3A;004C
1E3B;006C
1E3C;004C
1E3D;006C
1E3E;004D
1E3F;006D
1E40;004D
1E41;006D
1E42;004D
1E43;006D
1E44;004E
1E45;006E
1E46;004E
1E47;006E
1E48;004E
1E49;006E
1E4A;004E
1E4B;006E
1E4C;004F
1E4D;006F

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-06 Thread Peter Kirk
On 04/08/2003 17:36, Kenneth Whistler wrote:

Peter Kirk asked:

 

A similar issue which is not Hebrew related would be a (mythical) 
requirement to display a diacritic like 0315, 031B or 0322 in isolation. 
It would not always be appropriate to use a space or NBSP as a base 
character as this would indent the glyph from the beginning of a line in 
a way which might not be wanted. What would be the recommended encoding 
if one wanted to display one of these characters with no leading white 
space?
   

If you just want to display a nonspacing mark in isolation, then
you apply it to a SPACE (or NO-BREAK SPACE) and typically let the
metrics of the font then handle how the mark is going to appear
floating in space as it were.
If you want to display some character like U+0315 COMBINING COMMA
ABOVE RIGHT *and* you want to do it is isolation *and* you want
it to occur at the beginning of a line *and* you want there to
be no display width between the margin and the left edge of the
display bits of the glyph, then you have stepped over the boundaries
of what is reasonable to expect plain text to convey. Feel free
to make use of the higher-level capabilities of your word
processor or page layout program to individually adjust the
positioning of particular glyphs displayed in particular fonts.
 

Thank you. Understood.

More generally, however, when the issue of the relative
position of a non-spacing mark with respect to its base
glyph is what is in question, the standard recommends
(and uses) the convention of displaying the non-spacing
mark on a dotted circle as a base. This makes it clear that
we are talking about the non-spacing mark itself, but also
makes clear the positional differences between left, centered,
and right forms, for example.
 

If I want to do this, should I explicitly encode a dotted circle, or 
should I encode nothing and expect the font to generate the dotted 
circle, as it often does?

--Ken

 



--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-06 Thread Philippe Verdy
On Wednesday, August 06, 2003 1:59 AM, Curtis Clark [EMAIL PROTECTED] wrote:

 on 2003-08-05 15:31 Peter Kirk wrote:
  Thank you, Mark. This helps to clarify things, but still doesn't
  explicitly answer my question of how to encode a sentence like In
  this language the diacritic ^ may appear above the letters ...,
  but instead of ^ I want to use a combining character  and want to
  display exactly one space before the combining character - do I
  encode two spaces or one? 
 
 In this language the diacritic   may appear above the letters...
 
 Two spaces, at least in Thunderbird Mail.

The NFD decompositions of spacing marks is alredy defined as a SPACE
plus a non-spacing combining character. This officially documents the
usage of SPACE as a base character, and its use in combining sequences.
In the context of XML processing, where strings should (must?) be
presented in NFC form, this extra SPACE will be invisible, hidden within the
precomposed sequence, so this space does not have the line-breaking
property.

Breaking properties apply only to combining sequences, not to isolated
encoded characters. It's illegal to break in the middle of a combining
sequence. So as soon as a SPACE is followed by a combining character,
it looses its breaking properties, as those properties are only defined for
the combining sequence containing only a SPACE. So I don't think there's
any ambiguity: parsers and renderers must correctly identify combining
sequences before applying any algorithm.

This means that an algorithm like normalization of whitespace sequences
in XML or HTML should not include SPACEs that are used as base
characters in a combining sequence, and so it should keep two spaces
if the intent is to encode a logical space followed by a logical spacing
diacritic. (This is not a problem for XML which processes strings in their
NFC form).

-- 
Philippe.
Spams non tolrs: tout message non sollicit sera
rapport  vos fournisseurs de services Internet.




Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-06 Thread Peter Kirk
On 06/08/2003 03:54, Philippe Verdy wrote:

On Wednesday, August 06, 2003 1:59 AM, Curtis Clark [EMAIL PROTECTED] wrote:

 

on 2003-08-05 15:31 Peter Kirk wrote:
   

Thank you, Mark. This helps to clarify things, but still doesn't
explicitly answer my question of how to encode a sentence like In
this language the diacritic ^ may appear above the letters ...,
but instead of ^ I want to use a combining character  and want to
display exactly one space before the combining character - do I
encode two spaces or one? 
 

In this language the diacritic   may appear above the letters...

Two spaces, at least in Thunderbird Mail.
   

The NFD decompositions of spacing marks is alredy defined as a SPACE
plus a non-spacing combining character. ...
Really? It looks to me as if U+00B4 and U+02D8 to U+02DD have only a 
compatibility equivalences to space plus diacritic, and U+005E and 
U+0060 don't even have compatibility equivalences.

... 
This means that an algorithm like normalization of whitespace sequences
in XML or HTML should not include SPACEs that are used as base
characters in a combining sequence, and so it should keep two spaces
if the intent is to encode a logical space followed by a logical spacing
diacritic. (This is not a problem for XML which processes strings in their
NFC form).

 

It is,  because there are very many combining marks which do not have 
spacing equivalents (even for compatibility), and so with these the NFC 
form will certainly be space plus diacritic.

--
Peter Kirk
[EMAIL PROTECTED]
http://web.onetel.net.uk/~peterkirk/




RE: Does Unicode 3.1 take care of all characters of 'Hong Kong Supplimentary Character Set - 2001' (HKSCS-2001) ?

2003-08-06 Thread John McConnell
Sourav,

However, I could not map the block you mentioned to the block names
provided in Unicode site (http://www.unicode.org/charts/). I tried to
map them based on the similarity of names and specified the actual block
down below. Could you please once verify it?

The block names are the ones used by the HKSCS web site. Specifically

http://www.info.gov.hk/digital21/eng/hkscs/document.html


Section 3 page 2 describes the mapping in detail with the ranges.


John
GIFT





Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-06 Thread Kenneth Whistler
Peter Kirk said:

  From what Ken says, it sounds like it will be wrong from whenever 
 Unicode 4.0 is officially issued 

Actually Unicode 4.0 was officially issued on April 17, 2003.

What we are waiting on now is for the publication of the text
of the book to catch up to that fact. ;-)

 because this paragraph  has been 
 excised from that standard. But until then it seems to be correct, SPACE 
 is indeed considered a format character.

Nope. It is incorrect to try to mix and match between versions
of the standard.

In Unicode 3.0 this was an ambiguity in the meaning and usage
of the term format character, and for Unicode 3.0, we can
all see how people who ran into section 4.5 of the standard
could be a little confused about the status of SPACE.

The actual intent of that offending paragraph was to attempt to
explain the somewhat procrustean nature of the General Category
classes, which may not do justice to the complicated behavior
of some of the characters in Unicode, rather than to explain the
status of SPACE in particular. 

 I was misled by Jim's 
 reference to the URL of the final draft (as clearly stamped on the first 
 page) of 4.0, but since in fact he was quoting from 3.0 what he says can 
 hardly be considered obsolete yet.

Actually it can. And that would have been obvious to everyone if
a preview version of Chapter 4 had also been posted.

Once again, I appeal to people to stop trying to second-guess
the text of the standard. The final pdf for the online version
is in preparation even as I write this. The final final
proofs for the book itself have already been produced by
the printer -- all they need to do now is turn on the press
and start the binder.

If everyone would just go off for a week or two on their
August vacation, like they should be, we could all come back
about Labor Day and we wouldn't have to be having these
discussions. ;-)

--Ken




RE: Questions on ZWNBS - for line initial holam plus alef

2003-08-06 Thread Kenneth Whistler
Kent Karlsson responded:

   I see no particular *technical* problem with using WJ, though.  In
   contrast
   to the suggestion of using CGJ (re. another problem) 
  anywhere else but
   at the end of a combining sequence. CGJ has combining class 
  0, despite
   being invisible and not (visually) interfering with any other
   combining
   mark. Using CGJ at a non-final position in a combining sequence puts
   in doubt the entire idea with combining classes and normal forms.
  
  Why? 
 
 See above (I DID write the motivation!). 

I guess that I did not (and still do not) see the motivation for
your final statement.

 Combining classes are generally
 assigned according to typographic placement. Combining characters
 (except those that are really letters) that have the same placement,
 and interfere typographically are assigned the same combining class,
 while those that don't get different classes, and the relative order is
 then considered unimportant (canonically equivalent). How is then,
 e.g. a, ring above, cgj, dot below supposed to be different from
 a, dot below, cgj, ring above (supposing all involved characters
 are fully supported), when a, ring above, dot below is NOT
 supposed to be much different from a, dot below, ring above
 (them being canonically equivalent)? An invisible combining character
 does not interfere typographically with anything, it being invisible!

The same thing can be said about any inserted invisible character,
combining or not.

How is: a, ring above, null, dot below supposed to be different from
a, dot below, null, ring above

How is: a, ring above, LRM, dot below supposed to be different from
a, dot below, LRM, ring above

In display, they might not be distinct, unless you were doing some kind of
show-hidden display. Yet these sequences are not canonically
equivalent, and the presence of an embedded control character or an
embedded format control character would block canonical reordering.

Of course, they *might* be distinct in rendering, depending on
what assumptions the renderer makes about default ignorable
characters and their interaction with combining character sequences.
But you cannot depend on them being distinct in display -- the
standard doesn't mandate the particulars here.

Whether you think it is *reasonable* or not that there should be
non-canonically equivalent ways of representing the same
visual display, sequences such as those above, including sequences
with CGJ, are possible and allowed by the standard. They are:

   a. well-formed sequences, conformantly interpretable
   b. could be displayed by reasonable renderers, making reasonable
  assumptions, as visually identical
  
I have been pointing out use of the CGJ, which *exists* as an encoded
character, and which has a particular set of properties defined,
would result in the kinds of non-canonically equivalent ordering
distinctions required in Hebrew, if inserted into vowel sequences.
Those are facts about the current standard, as currently
defined. And unless you or someone else convinces the UTC to
establish cooccurrence constraints on CGJ or to change its
properties, they will continue to be current facts about the
standard. 

 The other invisible (per se!) combining characters with combining
 class 0, the variation selectors, are ok, since their *conforming* use
 is
 vary highly constrained. Maybe I've been wrong, but I have taken
 CGJ as similarly constrained as it was given a semantics only when
 followed by a base character (but now it seems to have no semantics
 at all).

There was no such constraint defined for CGJ. The current statement
about CGJ is merely that it should be ignored in language-sensitive
sorting and searching unless it specifically occurs within
a tailored collation element mapping. There is no constraint
on what particular sequences involving CGJ could be tailored
that way, and hence no constraint on what particular sequences
CGJ might occur in, in Unicode plain text.

  A combining character sequence is a base character followed
  by any number of combining characters. There is no constraint
  in that definition that the combining characters have to
  have non-zero combining class.
 
 Well, you cannot *conformantly* place a VS anywhere in a combining
 sequence! Only certain combinations of base+vs are allowed in
 any given version of Unicode. (Breaking that does not make the
 combining sequence ill-formed, or illegal, but would make it
 non-conformant, just like using an unassigned code point.)

Actually, it is not non-conformant like using an unassigned
code point would be. The latter is directly subject to conformance
clause C6:

C6 A process shall not interpret an unassigned code point as an
   abstract character.
   
The case for variation sequences is subtly different. Suppose
I encounter a variation sequence X, VS1, where X could be
any Unicode character. X itself is conformantly interpretable.
VS1 itself is conformantly 

Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)

2003-08-06 Thread Doug Ewell
Peter Kirk peter dot r dot kirk at ntlworld dot com wrote:

 Or it may not.  It may be a deficiency in the level of Unicode
 support afforded by the fonts and rendering engines. ...

 If there are such deficiencies in fonts and rendering engines which
 purport to be Unicode compliant, that suggests a lack of clarity in
 the standard which should be rectified.

I wish I had a dollar for every Unicode-compliant font, rendering
engine, or other software that was in some way less compliant than
advertised.  Only a fraction of the non-compliances are traceable to
ambiguities or deficiencies in the Unicode Standard.

 ... It may simply reflect a difference between your requirements
 and what the standard promises, and doesn't promise.

 If Unicode doesn't promise what I require, surely it is at least
 reasonable for me to ask on this list whether it ought to be extended
 or clarified to do so. The UTC may choose not to make any changes, but
 I don't see why they shouldn't even be asked to.

Absolutely, you are allowed to ask.  Go ahead.  I wasn't trying to
prevent questions from being asked, only trying to state why I think the
problem is out of scope for Unicode.

 The standard doesn't say anything about width in this case.  It
 leaves it up to the display engine, which is as it should be.

 The standard does say, section 2.10 of 4.0, that In rendering, the
 combination of a base character and a nonspacing character may have a
 different advance width than the base character itself.

I apologize for missing this reference.

 And any intelligent typographer will realise that this may is a
 must, with regular character designs but not of course in monospace,
 in some cases like the example given of i with circumflex. This
 sentence applies to spaces with diacritics as space is a base
 character, as we have been informed. The subsection of 2.10 entitled
 Spacing Clones of European Diacritical Marks (by the way, why
 European when the text appears to apply to all diacritical marks?)
 should suggest to any intelligent typographer that the sequence space,
 diacritic is intended to be spaced as the diacritic and not as a
 space, but it would help for this to be clarified as not all
 typographers are very intelligent and some may not be aware that this
 space has actually lost most of the properties of a space e.g. line
 breaking and is being used only By convention.

Like Freud's cigar, sometimes a may is just a may.  And I suspect
the phrase any intelligent typographer MAY generate some flak from
typographers on this list who consider themselves intelligent enough
yet have a different opinion.

I'm not a typographer (intelligent or otherwise), but I'm having a tough
time seeing how Section 2.10 *requires* fonts and rendering engines to
give a space-plus-combining-diacritic combination the exact minimum
width of the diacritic alone, or to leave equal space before and after
such a combination.  All I think it is saying is that, for example, the
combination i-plus-tilde may be wider than i alone, because tilde is
wider than i.

 When the specific alignment of isolated glyphs is important to me, I
 use markup.  I'm a big supporter of plain text, as many members of
 this list know, but the exact spacing of isolated combining marks
 seems like a layout issue to me.

 OK, what kind of markup should I use, in any well-known markup
 language, to ensure that an isolated diacritic is centred in the space
 between the words before and after it?

All right, you've got me there.  I'll have to think about it.  But I
still think this is a layout problem, a problem having to do with glyphs
and not characters.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Fw: Questions on ZWNBS - for line initial holam plus alef

2003-08-06 Thread Philippe Verdy
On Thursday, August 07, 2003 1:13 AM, Kenneth Whistler
[EMAIL PROTECTED] wrote: 

 Well, yes, which is why I have been advocating it as the
 solution to the Biblical Hebrew text representation problem.
 I agree with you about that. But it need not be characterized
 as legal in opposition to the other examples I cited above.
 All of these sequences are legal and allowed by the
 standard.

Once again sorry if I used the terms ill-formed or well-formed
instead of defective or non defective (normal?). Such distinction
in the standard does not help its understanding when discussing
about interoperability of text processing where neither ill-formed
nor defective sequences should be used if interoperability is the
main focus (and also normally the design focus for Unicode).

The canonical equivalences (NFC, NFD, canonical ordering) is
needed now for XML processing and in fact it greatly reduces
the number of ill-formed, invalid, or defective sequences or
whatever bad encoding of actual text, to simplify its processing.
Still these equivalences don't solve all the issues and create their
own (and this is now a good reason to use CGJ to override the
canonical ordering of combining diacritics).

Of course there may be a lot of strings created with Unicode
which are not ill-formed and not canonically equivalent (per
NFC, NFD, canonical ordering), but I won't enter in that zone.
For XML what is relevant is that it processes strings in NFC
form and thus implies only canonical equivalences, but XML
will still process defective sequences by correctly
processing characters per its canonical combining sequences.

I'd like to see a more formal rule for defective uses of CGJ used
to fix canonical ordering. What I suggested was to specify that
only some sequences with CGJ would be non defective, if
the CGJ appears before a base character or between two
combining characters. The character model needs then to be
refined to be more precise to document which uses are
considered non defective, and which ones are not.

So a sequence ..., ring above, CGJ, cedilla, ... would
not be defective as it fixes the canonical ordering, even if
in this case it does not interact graphically (note that this
statement supposes that the cedilla effectively appears
below, something which is wrong with some languages,
where the cedilla appears in fact like an acute accent
above right...).

The example of the effective rendering of diacritics at the
presupposed placement indicated by their combining class
is significant: it shows that combining classes just handle
some common placement rules, but not every case, and
a particular language or renderer may need to place
diacritics on other positions, in which case the canonical
ordering would have an impact on the renderer. That's a
good enough reason to justify and document the use of
CGJ as a combining class override for diacritics, whose
usage should be restricted for interoperability.

This has a consequence for input methods and editors:
users can type base characters and diacritics, and the
editor will, by default, use a canonical ordering, that the user
may fix if needed for a particular language with a control
command that would swap two misplaced diacritics by
automatically inserting a CGJ only if needed because both
diacritics have distinct combining classes: this editor control
command would have no other effect if executed after two
diacritics with identical combining, or after a single diacritic,
and the editor should make its best effort to not allow user
enter ill-formed or defective sequences.

-- 
Philippe.
Spams non tolérés: tout message non sollicité sera
rapporté à vos fournisseurs de services Internet.




Re: Conflicting principles

2003-08-06 Thread Michael Everson
At 16:16 -0400 2003-08-06, John Cowan wrote:
I would like to ask the old farts^W^Wrespected elders of the UTC
which principle they consider more important, abstractly speaking:
the principle that combining marks always follow their base characters
(a typographical principle), or that text is stored, with a few minor
exceptions, in phonetic order (a lexicographical principle).
Are you thinking of the Tengwar?
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: Questions on ZWNBS - for line initial holam plus alef

2003-08-06 Thread Kenneth Whistler
Philippe Verdy said:

  The same thing can be said about any inserted invisible character,
  combining or not.
  
  How is: a, ring above, null, dot below supposed to be different from
  a, dot below, null, ring above
  
  How is: a, ring above, LRM, dot below supposed to be different from
  a, dot below, LRM, ring above
  
  In display, they might not be distinct, unless you were doing some
  kind of show-hidden display. Yet these sequences are not canonically
  equivalent, and the presence of an embedded control character or an
  embedded format control character would block canonical reordering.
 
 
 I disagree with you, using a LRM mark in the middle of a combining
 sequence is conforming to canonicalization rules but is clearly
 ill-formed, 

It is not. TUS 4.0, p. 71:

D17a Defective combining character sequence: A combining character
 sequence that does not start with a base character.
 
 * Defective combining character sequences occur when a sequence
   of combining characters appears at the start of a string or
   follows a control or format character. Such sequences are
   defective from the point of view of handling of combining
   marks, but are not ill-formed.
  ^^

 as well as using a NULL control in the middle, which
 breaks the combining sequence.

I'm not claiming it doesn't break the combining sequence. Of
course it does. It creates a defective combining character
sequence, and that poses a challenge for rendering, since it
departs from the usual expectations for normal combining
character sequences. The renderer has to split hairs between
the fact that it is dealing with a defective combining
character sequence and the fact that it is dealing with a
default ignorable character which is supposed to be ignored
for text processes it is not immediately applicable to.

But I challenge you to find anything in the standard that
*prohibits* such sequences from occurring.

And *if* they occur, they are not canonically equivalent, which
was the point I was making to Kent.

 The proposal to use CGJ however is legal: it does not break the
 combining sequences and grapheme clusters, and thus the whole
 encoded sequence encoded with CGJ will be considered by
 rendering engines, where CGJ is a no-op for rendering but not for
 the canonical ordering ...

Well, yes, which is why I have been advocating it as the
solution to the Biblical Hebrew text representation problem.
I agree with you about that. But it need not be characterized
as legal in opposition to the other examples I cited above.
All of these sequences are legal and allowed by the
standard.

--Ken