One of the early problems encountered with Unicode was that there can be multiple ways of representing the same text. For many scripts, the solution was canonical equivalence - the multiple ways were declared to be equivalent, and anything that thought they had different meanings and should *therefore* be treated differently was non-compliant with the Unicode standard.
Where canonical equivalence actually leads to the wrong conclusion a method was subsequently found to make sequences canonically inequivalent, U+034F COMBINING GRAPHEME JOINER (CGJ). It generally takes extra effort to insert this character. However, canonical equivalence hit a severe problem with two-part Indic vowels, and the use of non-zero canonical combining classes in Indic scripts is generally low. A similar issue might arise with graphically non-interacting subordinated consonants, especially when encoded as virama/coeng plus base consonant. One solution to this problem is for renderers to produce a strange rendering if characters appear in a non-standard order. However, character strings are not just rendered and compared for identity. They are also be transliterated, sorted into alphabetical order, and may be input to automatic speech generation systems with limited capabilities for resolving homographs. This may require some way of tagging an apparently incorrectly ordered string, analogous to the use of 'sic' in English, to indicate that the text is intended not to accord with the 'standard' character order. What characters are available for such a rĂ´le? CGJ is a possibility, but I am concerned that it may be being overworked. It is already suggested as a solution for dealing with sorting when a digraph is treated as a letter, but accidental sequences are not, as in the Welsh letter 'ng' (which comes between 'g' and 'h' in the alphabet) as opposed to an 'accidental' sequence such as in 'Bangor' and 'Llangollen'. Such characters probably don't work now, but it may be possible to persuade the suppliers to heed them. The ideal character would be disallowed in domain names, which should allay the greatest security worries about simply rendering the text as it stands. Some potential ambiguities arise from Sanskrit, and were raised long ago by Peter Constable on the Unicode Indic list on 28 August 2006 under the heading 'contrastive /Crv/ and /Cvr/ in Telugu, Malayalam'. The cases he gave were 'grva' v. 'gvra', 'drva' v. 'dvra' and 'srva' v. 'svra'. For the Khmer script, the KhmerOS font renders the pairs identically, which did surprise me, as I had got it into my head that one could tell from the depth of the <COENG, RO> where the RO came in the sequence of conjoined letters. Richard.

