Re: FW:transform a (UNICODE) accented character to its equivalent (UNICODE) non-accented character
Magda Danish (Unicode) scripsit: I'm looking for the easiest and more stable way to transform an (UNICODE) accented character to its equivalent (UNICODE) non-accented character. The following mapping table is an approximation to that. 00C0;0041 00C1;0041 00C2;0041 00C3;0041 00C4;0041 00C5;0041 00C7;0043 00C8;0045 00C9;0045 00CA;0045 00CB;0045 00CC;0049 00CD;0049 00CE;0049 00CF;0049 00D1;004E 00D2;004F 00D3;004F 00D4;004F 00D5;004F 00D6;004F 00D9;0055 00DA;0055 00DB;0055 00DC;0055 00DD;0059 00E0;0061 00E1;0061 00E2;0061 00E3;0061 00E4;0061 00E5;0061 00E7;0063 00E8;0065 00E9;0065 00EA;0065 00EB;0065 00EC;0069 00ED;0069 00EE;0069 00EF;0069 00F1;006E 00F2;006F 00F3;006F 00F4;006F 00F5;006F 00F6;006F 00F9;0075 00FA;0075 00FB;0075 00FC;0075 00FD;0079 00FF;0079 0100;0041 0101;0061 0102;0041 0103;0061 0104;0041 0105;0061 0106;0043 0107;0063 0108;0043 0109;0063 010A;0043 010B;0063 010C;0043 010D;0063 010E;0044 010F;0064 0112;0045 0113;0065 0114;0045 0115;0065 0116;0045 0117;0065 0118;0045 0119;0065 011A;0045 011B;0065 011C;0047 011D;0067 011E;0047 011F;0067 0120;0047 0121;0067 0122;0047 0123;0067 0124;0048 0125;0068 0128;0049 0129;0069 012A;0049 012B;0069 012C;0049 012D;0069 012E;0049 012F;0069 0130;0049 0134;004A 0135;006A 0136;004B 0137;006B 0139;004C 013A;006C 013B;004C 013C;006C 013D;004C 013E;006C 0143;004E 0144;006E 0145;004E 0146;006E 0147;004E 0148;006E 014C;004F 014D;006F 014E;004F 014F;006F 0150;004F 0151;006F 0154;0052 0155;0072 0156;0052 0157;0072 0158;0052 0159;0072 015A;0053 015B;0073 015C;0053 015D;0073 015E;0053 015F;0073 0160;0053 0161;0073 0162;0054 0163;0074 0164;0054 0165;0074 0168;0055 0169;0075 016A;0055 016B;0075 016C;0055 016D;0075 016E;0055 016F;0075 0170;0055 0171;0075 0172;0055 0173;0075 0174;0057 0175;0077 0176;0059 0177;0079 0178;0059 0179;005A 017A;007A 017B;005A 017C;007A 017D;005A 017E;007A 01A0;004F 01A1;006F 01AF;0055 01B0;0075 01CD;0041 01CE;0061 01CF;0049 01D0;0069 01D1;004F 01D2;006F 01D3;0055 01D4;0075 01D5;0055 01D6;0075 01D7;0055 01D8;0075 01D9;0055 01DA;0075 01DB;0055 01DC;0075 01DE;0041 01DF;0061 01E0;0041 01E1;0061 01E2;00C6 01E3;00E6 01E6;0047 01E7;0067 01E8;004B 01E9;006B 01EA;004F 01EB;006F 01EC;004F 01ED;006F 01EE;01B7 01EF;0292 01F0;006A 01F4;0047 01F5;0067 01F8;004E 01F9;006E 01FA;0041 01FB;0061 01FC;00C6 01FD;00E6 01FE;00D8 01FF;00F8 0200;0041 0201;0061 0202;0041 0203;0061 0204;0045 0205;0065 0206;0045 0207;0065 0208;0049 0209;0069 020A;0049 020B;0069 020C;004F 020D;006F 020E;004F 020F;006F 0210;0052 0211;0072 0212;0052 0213;0072 0214;0055 0215;0075 0216;0055 0217;0075 0218;0053 0219;0073 021A;0054 021B;0074 021E;0048 021F;0068 0226;0041 0227;0061 0228;0045 0229;0065 022A;004F 022B;006F 022C;004F 022D;006F 022E;004F 022F;006F 0230;004F 0231;006F 0232;0059 0233;0079 0385;00A8 0386;0391 0388;0395 0389;0397 038A;0399 038C;039F 038E;03A5 038F;03A9 0390;03B9 03AA;0399 03AB;03A5 03AC;03B1 03AD;03B5 03AE;03B7 03AF;03B9 03B0;03C5 03CA;03B9 03CB;03C5 03CC;03BF 03CD;03C5 03CE;03C9 03D3;03D2 03D4;03D2 0400;0415 0401;0415 0403;0413 0407;0406 040C;041A 040D;0418 040E;0423 0419;0418 0439;0438 0450;0435 0451;0435 0453;0433 0457;0456 045C;043A 045D;0438 045E;0443 0476;0474 0477;0475 04C1;0416 04C2;0436 04D0;0410 04D1;0430 04D2;0410 04D3;0430 04D6;0415 04D7;0435 04DA;04D8 04DB;04D9 04DC;0416 04DD;0436 04DE;0417 04DF;0437 04E2;0418 04E3;0438 04E4;0418 04E5;0438 04E6;041E 04E7;043E 04EA;04E8 04EB;04E9 04EC;042D 04ED;044D 04EE;0423 04EF;0443 04F0;0423 04F1;0443 04F2;0423 04F3;0443 04F4;0427 04F5;0447 04F8;042B 04F9;044B 0622;0627 0623;0627 0624;0648 0625;0627 0626;064A 06C0;06D5 06C2;06C1 06D3;06D2 0929;0928 0931;0930 0934;0933 0958;0915 0959;0916 095A;0917 095B;091C 095C;0921 095D;0922 095E;092B 095F;092F 09CB;09C7 09CC;09C7 09DC;09A1 09DD;09A2 09DF;09AF 0A33;0A32 0A36;0A38 0A59;0A16 0A5A;0A17 0A5B;0A1C 0A5E;0A2B 0B48;0B47 0B4B;0B47 0B4C;0B47 0B5C;0B21 0B5D;0B22 0B94;0B92 0BCA;0BC6 0BCB;0BC7 0BCC;0BC6 0C48;0C46 0CC0;0CBF 0CC7;0CC6 0CC8;0CC6 0CCA;0CC6 0CCB;0CC6 0D4A;0D46 0D4B;0D47 0D4C;0D46 0DDA;0DD9 0DDC;0DD9 0DDD;0DD9 0DDE;0DD9 0F43;0F42 0F4D;0F4C 0F52;0F51 0F57;0F56 0F5C;0F5B 0F69;0F40 0F73;0F71 0F75;0F71 0F76;0FB2 0F78;0FB3 0F81;0F71 0F93;0F92 0F9D;0F9C 0FA2;0FA1 0FA7;0FA6 0FAC;0FAB 0FB9;0F90 1026;1025 1E00;0041 1E01;0061 1E02;0042 1E03;0062 1E04;0042 1E05;0062 1E06;0042 1E07;0062 1E08;0043 1E09;0063 1E0A;0044 1E0B;0064 1E0C;0044 1E0D;0064 1E0E;0044 1E0F;0064 1E10;0044 1E11;0064 1E12;0044 1E13;0064 1E14;0045 1E15;0065 1E16;0045 1E17;0065 1E18;0045 1E19;0065 1E1A;0045 1E1B;0065 1E1C;0045 1E1D;0065 1E1E;0046 1E1F;0066 1E20;0047 1E21;0067 1E22;0048 1E23;0068 1E24;0048 1E25;0068 1E26;0048 1E27;0068 1E28;0048 1E29;0068 1E2A;0048 1E2B;0068 1E2C;0049 1E2D;0069 1E2E;0049 1E2F;0069 1E30;004B 1E31;006B 1E32;004B 1E33;006B 1E34;004B 1E35;006B 1E36;004C 1E37;006C 1E38;004C 1E39;006C 1E3A;004C 1E3B;006C 1E3C;004C 1E3D;006C 1E3E;004D 1E3F;006D 1E40;004D 1E41;006D 1E42;004D 1E43;006D 1E44;004E 1E45;006E 1E46;004E 1E47;006E 1E48;004E 1E49;006E 1E4A;004E 1E4B;006E 1E4C;004F 1E4D;006F
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 04/08/2003 17:36, Kenneth Whistler wrote: Peter Kirk asked: A similar issue which is not Hebrew related would be a (mythical) requirement to display a diacritic like 0315, 031B or 0322 in isolation. It would not always be appropriate to use a space or NBSP as a base character as this would indent the glyph from the beginning of a line in a way which might not be wanted. What would be the recommended encoding if one wanted to display one of these characters with no leading white space? If you just want to display a nonspacing mark in isolation, then you apply it to a SPACE (or NO-BREAK SPACE) and typically let the metrics of the font then handle how the mark is going to appear floating in space as it were. If you want to display some character like U+0315 COMBINING COMMA ABOVE RIGHT *and* you want to do it is isolation *and* you want it to occur at the beginning of a line *and* you want there to be no display width between the margin and the left edge of the display bits of the glyph, then you have stepped over the boundaries of what is reasonable to expect plain text to convey. Feel free to make use of the higher-level capabilities of your word processor or page layout program to individually adjust the positioning of particular glyphs displayed in particular fonts. Thank you. Understood. More generally, however, when the issue of the relative position of a non-spacing mark with respect to its base glyph is what is in question, the standard recommends (and uses) the convention of displaying the non-spacing mark on a dotted circle as a base. This makes it clear that we are talking about the non-spacing mark itself, but also makes clear the positional differences between left, centered, and right forms, for example. If I want to do this, should I explicitly encode a dotted circle, or should I encode nothing and expect the font to generate the dotted circle, as it often does? --Ken -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On Wednesday, August 06, 2003 1:59 AM, Curtis Clark [EMAIL PROTECTED] wrote: on 2003-08-05 15:31 Peter Kirk wrote: Thank you, Mark. This helps to clarify things, but still doesn't explicitly answer my question of how to encode a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character and want to display exactly one space before the combining character - do I encode two spaces or one? In this language the diacritic may appear above the letters... Two spaces, at least in Thunderbird Mail. The NFD decompositions of spacing marks is alredy defined as a SPACE plus a non-spacing combining character. This officially documents the usage of SPACE as a base character, and its use in combining sequences. In the context of XML processing, where strings should (must?) be presented in NFC form, this extra SPACE will be invisible, hidden within the precomposed sequence, so this space does not have the line-breaking property. Breaking properties apply only to combining sequences, not to isolated encoded characters. It's illegal to break in the middle of a combining sequence. So as soon as a SPACE is followed by a combining character, it looses its breaking properties, as those properties are only defined for the combining sequence containing only a SPACE. So I don't think there's any ambiguity: parsers and renderers must correctly identify combining sequences before applying any algorithm. This means that an algorithm like normalization of whitespace sequences in XML or HTML should not include SPACEs that are used as base characters in a combining sequence, and so it should keep two spaces if the intent is to encode a logical space followed by a logical spacing diacritic. (This is not a problem for XML which processes strings in their NFC form). -- Philippe. Spams non tolrs: tout message non sollicit sera rapport vos fournisseurs de services Internet.
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
On 06/08/2003 03:54, Philippe Verdy wrote: On Wednesday, August 06, 2003 1:59 AM, Curtis Clark [EMAIL PROTECTED] wrote: on 2003-08-05 15:31 Peter Kirk wrote: Thank you, Mark. This helps to clarify things, but still doesn't explicitly answer my question of how to encode a sentence like In this language the diacritic ^ may appear above the letters ..., but instead of ^ I want to use a combining character and want to display exactly one space before the combining character - do I encode two spaces or one? In this language the diacritic may appear above the letters... Two spaces, at least in Thunderbird Mail. The NFD decompositions of spacing marks is alredy defined as a SPACE plus a non-spacing combining character. ... Really? It looks to me as if U+00B4 and U+02D8 to U+02DD have only a compatibility equivalences to space plus diacritic, and U+005E and U+0060 don't even have compatibility equivalences. ... This means that an algorithm like normalization of whitespace sequences in XML or HTML should not include SPACEs that are used as base characters in a combining sequence, and so it should keep two spaces if the intent is to encode a logical space followed by a logical spacing diacritic. (This is not a problem for XML which processes strings in their NFC form). It is, because there are very many combining marks which do not have spacing equivalents (even for compatibility), and so with these the NFC form will certainly be space plus diacritic. -- Peter Kirk [EMAIL PROTECTED] http://web.onetel.net.uk/~peterkirk/
RE: Does Unicode 3.1 take care of all characters of 'Hong Kong Supplimentary Character Set - 2001' (HKSCS-2001) ?
Sourav, However, I could not map the block you mentioned to the block names provided in Unicode site (http://www.unicode.org/charts/). I tried to map them based on the similarity of names and specified the actual block down below. Could you please once verify it? The block names are the ones used by the HKSCS web site. Specifically http://www.info.gov.hk/digital21/eng/hkscs/document.html Section 3 page 2 describes the mapping in detail with the ranges. John GIFT
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter Kirk said: From what Ken says, it sounds like it will be wrong from whenever Unicode 4.0 is officially issued Actually Unicode 4.0 was officially issued on April 17, 2003. What we are waiting on now is for the publication of the text of the book to catch up to that fact. ;-) because this paragraph has been excised from that standard. But until then it seems to be correct, SPACE is indeed considered a format character. Nope. It is incorrect to try to mix and match between versions of the standard. In Unicode 3.0 this was an ambiguity in the meaning and usage of the term format character, and for Unicode 3.0, we can all see how people who ran into section 4.5 of the standard could be a little confused about the status of SPACE. The actual intent of that offending paragraph was to attempt to explain the somewhat procrustean nature of the General Category classes, which may not do justice to the complicated behavior of some of the characters in Unicode, rather than to explain the status of SPACE in particular. I was misled by Jim's reference to the URL of the final draft (as clearly stamped on the first page) of 4.0, but since in fact he was quoting from 3.0 what he says can hardly be considered obsolete yet. Actually it can. And that would have been obvious to everyone if a preview version of Chapter 4 had also been posted. Once again, I appeal to people to stop trying to second-guess the text of the standard. The final pdf for the online version is in preparation even as I write this. The final final proofs for the book itself have already been produced by the printer -- all they need to do now is turn on the press and start the binder. If everyone would just go off for a week or two on their August vacation, like they should be, we could all come back about Labor Day and we wouldn't have to be having these discussions. ;-) --Ken
RE: Questions on ZWNBS - for line initial holam plus alef
Kent Karlsson responded: I see no particular *technical* problem with using WJ, though. In contrast to the suggestion of using CGJ (re. another problem) anywhere else but at the end of a combining sequence. CGJ has combining class 0, despite being invisible and not (visually) interfering with any other combining mark. Using CGJ at a non-final position in a combining sequence puts in doubt the entire idea with combining classes and normal forms. Why? See above (I DID write the motivation!). I guess that I did not (and still do not) see the motivation for your final statement. Combining classes are generally assigned according to typographic placement. Combining characters (except those that are really letters) that have the same placement, and interfere typographically are assigned the same combining class, while those that don't get different classes, and the relative order is then considered unimportant (canonically equivalent). How is then, e.g. a, ring above, cgj, dot below supposed to be different from a, dot below, cgj, ring above (supposing all involved characters are fully supported), when a, ring above, dot below is NOT supposed to be much different from a, dot below, ring above (them being canonically equivalent)? An invisible combining character does not interfere typographically with anything, it being invisible! The same thing can be said about any inserted invisible character, combining or not. How is: a, ring above, null, dot below supposed to be different from a, dot below, null, ring above How is: a, ring above, LRM, dot below supposed to be different from a, dot below, LRM, ring above In display, they might not be distinct, unless you were doing some kind of show-hidden display. Yet these sequences are not canonically equivalent, and the presence of an embedded control character or an embedded format control character would block canonical reordering. Of course, they *might* be distinct in rendering, depending on what assumptions the renderer makes about default ignorable characters and their interaction with combining character sequences. But you cannot depend on them being distinct in display -- the standard doesn't mandate the particulars here. Whether you think it is *reasonable* or not that there should be non-canonically equivalent ways of representing the same visual display, sequences such as those above, including sequences with CGJ, are possible and allowed by the standard. They are: a. well-formed sequences, conformantly interpretable b. could be displayed by reasonable renderers, making reasonable assumptions, as visually identical I have been pointing out use of the CGJ, which *exists* as an encoded character, and which has a particular set of properties defined, would result in the kinds of non-canonically equivalent ordering distinctions required in Hebrew, if inserted into vowel sequences. Those are facts about the current standard, as currently defined. And unless you or someone else convinces the UTC to establish cooccurrence constraints on CGJ or to change its properties, they will continue to be current facts about the standard. The other invisible (per se!) combining characters with combining class 0, the variation selectors, are ok, since their *conforming* use is vary highly constrained. Maybe I've been wrong, but I have taken CGJ as similarly constrained as it was given a semantics only when followed by a base character (but now it seems to have no semantics at all). There was no such constraint defined for CGJ. The current statement about CGJ is merely that it should be ignored in language-sensitive sorting and searching unless it specifically occurs within a tailored collation element mapping. There is no constraint on what particular sequences involving CGJ could be tailored that way, and hence no constraint on what particular sequences CGJ might occur in, in Unicode plain text. A combining character sequence is a base character followed by any number of combining characters. There is no constraint in that definition that the combining characters have to have non-zero combining class. Well, you cannot *conformantly* place a VS anywhere in a combining sequence! Only certain combinations of base+vs are allowed in any given version of Unicode. (Breaking that does not make the combining sequence ill-formed, or illegal, but would make it non-conformant, just like using an unassigned code point.) Actually, it is not non-conformant like using an unassigned code point would be. The latter is directly subject to conformance clause C6: C6 A process shall not interpret an unassigned code point as an abstract character. The case for variation sequences is subtly different. Suppose I encounter a variation sequence X, VS1, where X could be any Unicode character. X itself is conformantly interpretable. VS1 itself is conformantly
Re: Display of Isolated Nonspacing Marks (was Re: Questions on ZWNBS...)
Peter Kirk peter dot r dot kirk at ntlworld dot com wrote: Or it may not. It may be a deficiency in the level of Unicode support afforded by the fonts and rendering engines. ... If there are such deficiencies in fonts and rendering engines which purport to be Unicode compliant, that suggests a lack of clarity in the standard which should be rectified. I wish I had a dollar for every Unicode-compliant font, rendering engine, or other software that was in some way less compliant than advertised. Only a fraction of the non-compliances are traceable to ambiguities or deficiencies in the Unicode Standard. ... It may simply reflect a difference between your requirements and what the standard promises, and doesn't promise. If Unicode doesn't promise what I require, surely it is at least reasonable for me to ask on this list whether it ought to be extended or clarified to do so. The UTC may choose not to make any changes, but I don't see why they shouldn't even be asked to. Absolutely, you are allowed to ask. Go ahead. I wasn't trying to prevent questions from being asked, only trying to state why I think the problem is out of scope for Unicode. The standard doesn't say anything about width in this case. It leaves it up to the display engine, which is as it should be. The standard does say, section 2.10 of 4.0, that In rendering, the combination of a base character and a nonspacing character may have a different advance width than the base character itself. I apologize for missing this reference. And any intelligent typographer will realise that this may is a must, with regular character designs but not of course in monospace, in some cases like the example given of i with circumflex. This sentence applies to spaces with diacritics as space is a base character, as we have been informed. The subsection of 2.10 entitled Spacing Clones of European Diacritical Marks (by the way, why European when the text appears to apply to all diacritical marks?) should suggest to any intelligent typographer that the sequence space, diacritic is intended to be spaced as the diacritic and not as a space, but it would help for this to be clarified as not all typographers are very intelligent and some may not be aware that this space has actually lost most of the properties of a space e.g. line breaking and is being used only By convention. Like Freud's cigar, sometimes a may is just a may. And I suspect the phrase any intelligent typographer MAY generate some flak from typographers on this list who consider themselves intelligent enough yet have a different opinion. I'm not a typographer (intelligent or otherwise), but I'm having a tough time seeing how Section 2.10 *requires* fonts and rendering engines to give a space-plus-combining-diacritic combination the exact minimum width of the diacritic alone, or to leave equal space before and after such a combination. All I think it is saying is that, for example, the combination i-plus-tilde may be wider than i alone, because tilde is wider than i. When the specific alignment of isolated glyphs is important to me, I use markup. I'm a big supporter of plain text, as many members of this list know, but the exact spacing of isolated combining marks seems like a layout issue to me. OK, what kind of markup should I use, in any well-known markup language, to ensure that an isolated diacritic is centred in the space between the words before and after it? All right, you've got me there. I'll have to think about it. But I still think this is a layout problem, a problem having to do with glyphs and not characters. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/
Fw: Questions on ZWNBS - for line initial holam plus alef
On Thursday, August 07, 2003 1:13 AM, Kenneth Whistler [EMAIL PROTECTED] wrote: Well, yes, which is why I have been advocating it as the solution to the Biblical Hebrew text representation problem. I agree with you about that. But it need not be characterized as legal in opposition to the other examples I cited above. All of these sequences are legal and allowed by the standard. Once again sorry if I used the terms ill-formed or well-formed instead of defective or non defective (normal?). Such distinction in the standard does not help its understanding when discussing about interoperability of text processing where neither ill-formed nor defective sequences should be used if interoperability is the main focus (and also normally the design focus for Unicode). The canonical equivalences (NFC, NFD, canonical ordering) is needed now for XML processing and in fact it greatly reduces the number of ill-formed, invalid, or defective sequences or whatever bad encoding of actual text, to simplify its processing. Still these equivalences don't solve all the issues and create their own (and this is now a good reason to use CGJ to override the canonical ordering of combining diacritics). Of course there may be a lot of strings created with Unicode which are not ill-formed and not canonically equivalent (per NFC, NFD, canonical ordering), but I won't enter in that zone. For XML what is relevant is that it processes strings in NFC form and thus implies only canonical equivalences, but XML will still process defective sequences by correctly processing characters per its canonical combining sequences. I'd like to see a more formal rule for defective uses of CGJ used to fix canonical ordering. What I suggested was to specify that only some sequences with CGJ would be non defective, if the CGJ appears before a base character or between two combining characters. The character model needs then to be refined to be more precise to document which uses are considered non defective, and which ones are not. So a sequence ..., ring above, CGJ, cedilla, ... would not be defective as it fixes the canonical ordering, even if in this case it does not interact graphically (note that this statement supposes that the cedilla effectively appears below, something which is wrong with some languages, where the cedilla appears in fact like an acute accent above right...). The example of the effective rendering of diacritics at the presupposed placement indicated by their combining class is significant: it shows that combining classes just handle some common placement rules, but not every case, and a particular language or renderer may need to place diacritics on other positions, in which case the canonical ordering would have an impact on the renderer. That's a good enough reason to justify and document the use of CGJ as a combining class override for diacritics, whose usage should be restricted for interoperability. This has a consequence for input methods and editors: users can type base characters and diacritics, and the editor will, by default, use a canonical ordering, that the user may fix if needed for a particular language with a control command that would swap two misplaced diacritics by automatically inserting a CGJ only if needed because both diacritics have distinct combining classes: this editor control command would have no other effect if executed after two diacritics with identical combining, or after a single diacritic, and the editor should make its best effort to not allow user enter ill-formed or defective sequences. -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.
Re: Conflicting principles
At 16:16 -0400 2003-08-06, John Cowan wrote: I would like to ask the old farts^W^Wrespected elders of the UTC which principle they consider more important, abstractly speaking: the principle that combining marks always follow their base characters (a typographical principle), or that text is stored, with a few minor exceptions, in phonetic order (a lexicographical principle). Are you thinking of the Tengwar? -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Questions on ZWNBS - for line initial holam plus alef
Philippe Verdy said: The same thing can be said about any inserted invisible character, combining or not. How is: a, ring above, null, dot below supposed to be different from a, dot below, null, ring above How is: a, ring above, LRM, dot below supposed to be different from a, dot below, LRM, ring above In display, they might not be distinct, unless you were doing some kind of show-hidden display. Yet these sequences are not canonically equivalent, and the presence of an embedded control character or an embedded format control character would block canonical reordering. I disagree with you, using a LRM mark in the middle of a combining sequence is conforming to canonicalization rules but is clearly ill-formed, It is not. TUS 4.0, p. 71: D17a Defective combining character sequence: A combining character sequence that does not start with a base character. * Defective combining character sequences occur when a sequence of combining characters appears at the start of a string or follows a control or format character. Such sequences are defective from the point of view of handling of combining marks, but are not ill-formed. ^^ as well as using a NULL control in the middle, which breaks the combining sequence. I'm not claiming it doesn't break the combining sequence. Of course it does. It creates a defective combining character sequence, and that poses a challenge for rendering, since it departs from the usual expectations for normal combining character sequences. The renderer has to split hairs between the fact that it is dealing with a defective combining character sequence and the fact that it is dealing with a default ignorable character which is supposed to be ignored for text processes it is not immediately applicable to. But I challenge you to find anything in the standard that *prohibits* such sequences from occurring. And *if* they occur, they are not canonically equivalent, which was the point I was making to Kent. The proposal to use CGJ however is legal: it does not break the combining sequences and grapheme clusters, and thus the whole encoded sequence encoded with CGJ will be considered by rendering engines, where CGJ is a no-op for rendering but not for the canonical ordering ... Well, yes, which is why I have been advocating it as the solution to the Biblical Hebrew text representation problem. I agree with you about that. But it need not be characterized as legal in opposition to the other examples I cited above. All of these sequences are legal and allowed by the standard. --Ken