What I really wish we had would be a machine readable set of regexes for each complex script (and for each language-script combination that is different than the default for that script).
Such a regex R could be used for determining the well-formed ordering of code points within words. The regex need not be for syllables, or grapheme clusters, or any other formal construct. The *only* requirement it would need to fulfill is that you could determine well-formed words with: word := (R)+ That is, if R were (C V C? | V C?) then any of CVC CVCVC VC V CV would pass the text, but CCV would fail. Ideally R would be as simple as possible (but no simpler). Mark On Tue, Jan 10, 2017 at 9:06 AM, Asmus Freytag <asm...@ix.netcom.com> wrote: > On 1/9/2017 2:24 PM, Richard Wordingham wrote: > > Where, if anywhere, is the encoding of plain text specified? I am > particularly concerned with the arrangement of the code sequences for > non-spacing abstract characters once one has determined an encoding for > the abstract characters. > > For example, a naive reading of TUS 9.0 Section 16.4 Subsection > "Ordering of Syllable Components" would lead one to believe that the > word _khnyom_ 'I' shall be encoded as <U+1781 KHMER LETTER KHA, > U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL > SIGN U, U+17C6 KHMER SIGN NIKAHIT>. > > Richard, > > the group of Khmer experts that developed the recent label generation > rules for root zone domain names considers that ordering the only one > supported, a specification you find here: https://www.icann.org/en/ > system/files/files/proposal-khmer-lgr-15aug16-en.pdf > > That document states: > > *7.4 Context of COENG Sign (U+17D2)* > The sign ្ KHMER SIGN COENG (U+17D2) used for subscripting consonants must > occur between two consonants. If it occurs between any other categories, it > is not in a valid context so the label is not well formed. Further, the > consonant following it must not include ឡ KHMER LETTER LA (U+17A1), ... > > So, you are not alone in thinking that the COENG goes between consonants. > > Did they just make this up? No, they followed what is laid out in the > standard: > > Page 621 in Unicode 9.0.0, you find (http://www.unicode.org/ > versions/Unicode9.0.0/ch16.pdf) > > *Subscript Consonants.* Subscript consonant signs differ from independent > consonant > characters and are called coeng (literally, “foot, leg”) after their > subscript position. While a > consonant character can constitute an orthographic syllable by itself, a > subscript consonant > sign cannot. Note that U+17A1 C khmer letter la does not have a > corresponding subscript > consonant sign in standard Khmer.... Subscript consonant signs are used to > represent any > consonant following the first consonant in an orthographic syllable. > > and on page 624: > > .... each of these [subscript consonant] signs is represented by the > sequence of two characters: a > special control character (U+17D2 khmer sign coeng) and a corresponding > consonant > character. > > That text fixes the order MAIN CONSONANT + COENG OPERATOR + SUBSCRIPT > CONSONANT > with suffficient clarity (as do all the examples and tables). > > > However, on further investigation, > I cannot find any text that says that <U+1781, U+17C6, U+17D2, U+1789, > U+17BB> would not be compliant with the Unicode standard. Have I > missed anything? > > > In this example, your coeng operator U+17D2 is out of order, while it is > followed by > a consonant, it does not in turn immediately follow the main consonant, > because a > sign NIKAHIT is inserted in your example. > > Again, from the Root Zone LGR document we find an explicit rule: > > *7.10 Context of NIKAHIT SIGN (U+17C6)* > The sign ្ំ KHMER SIGN NIKAHIT (U+17C6) can only be preceded by a > consonant or a shifter or one of the subset of dependent vowels tagged > “dependent-vowel-1” in the repertoire table (្ ្ុ), i.e. vowel signs AA and > U. > > That would allow the NIKAHIT to be placed where you suggest, if it were > not for the > rule on the coeng operator (7.4). > > Now, it is a known fact that the label generation rules are slightly more > restrictive than the rules for general text. (See also section 5 in that > document). > > See the text on p. 622 in TUS 9.0.0 where the following *exception* is > noted: > > "The subscript consonant signs in the Khmer script can be used to denote a > final consonant, > although this practice is uncommon." > > The associated example shows MAIN CONSONANT + VOWEL + NIKHAHIT + COENG + > FINAL CONSONANT > > Another exception that is noted on p. 623 is the following: > > "While these subscript consonant signs are usually attached to a consonant > character, they > can also be attached to an independent vowel character. Although this > practice is relatively > rare, it is used in one very common word, meaning “to give.”" > > Taken together, it would appear that, unless your example fits the first > of these two exceptions, > the NIKAHIT in it is out of order. > > (The label generation rules disallow both of these exceptions, > in an attempt to streamline the rules, sacrificing a number of potential > domain names. Equivelant > rule sets for validating text would have to be more complete). > > One might hope that the subsection about 'logical order' in TUS 9.0 > Section 2.2 Unicode Design Principles would help, but: > > 1) Section 3 'Conformance' says nothing about logical order; and > 2) The subsection about 'logical order' seems to assume that there > exists a common practice; it does not actually place any requirement > on this common practice. > > Richard. > > > > I don't think either of these general sections are intended to provide the > correct > or expected ordering of characters for complex scripts. Any preferred > ordering that > doesn't result by happenstance from normalization would presumably be > describe > in the text of the scrip section, such as Section 16.4 Khmer, in TUS 9.0.0. > > http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf > > A./ > > >