On Fri, 8 Jun 2018 20:45:26 +0200 Philippe Verdy via Unicode <unicode@unicode.org> wrote:
> 2018-06-08 19:41 GMT+02:00 Richard Wordingham via Unicode < > unicode@unicode.org>: > The way tailoring is designed in CLDR using only data used by a > generic algorithm, and not custom algorithm is not the only way to > collate Lao. You can perectly add new custom algorithm promitives > that will use new collation data rules that can be inserted as > "hooks" in UCA (which provides several points at which it is > possible, but UCA just makes these hooks act as "no-op". The ideal is to have a common library rather than add specific routines to support specific languages. Now, this can be done in a common library; ICU break iterators have dedicated routines for CJK and for Siamese. I wonder if this could be done for Lao and possibly Tai Lue. I've a vague recollection that UCA collation for Tai Lue in the New Tai Lue script only needs thousands of contractions, so it may work well enough in the main CLDR collation algorithm. Martin Hosken provided the numbers, probably on the Unicore list, when New Tai Lue formally switched from phonetic to visual order. Taking the definition of logical order literally, the change legitimised the logical order of New Tai Lue. > You can be much faster is you create a specific library for Lao, that > would still be able to process the basic collation rules and then > make more advanced inferences based on larger cluster boundaries than > just those considered in the standard basic UCA, so it is perfectly > possible to extend it to cover more complex Lao syllables and various > specific quirks (such as hyphenation in the middle of clusters, as > seen in some Indic scripts using left matras). How is this hyphenation done? The answer probably belongs in the thread entitled 'Hyphenation Markup', unless its restricted to the visual order scripts. If it's occurring in the visual order scripts, we may need to add contractions for <preposed vowel, soft hyphen, consonant>; U+00AD breaks contractions, and, indeed, may be used for exactly that purpose, as it is generally easier to type than CGJ. While I've seen line-breaking after a left matra in Thai, I've never *seen* a hyphen after a left matra. Richard.