2018-06-08 19:41 GMT+02:00 Richard Wordingham via Unicode < [email protected]>:
> On Fri, 8 Jun 2018 13:40:21 +0200 > Mark Davis ☕️ <[email protected]> wrote: > > > Mark > > > > On Fri, Jun 8, 2018 at 10:06 AM, Richard Wordingham via Unicode < > > [email protected]> wrote: > > > > > On Fri, 8 Jun 2018 05:32:51 +0200 (CEST) > > > Marcel Schneider via Unicode <[email protected]> wrote: > > > > > > > Thank you for confirming. All witnesses concur to invalidate the > > > > statement about uniqueness of ISO/IEC 10646 ‐ Unicode synchrony. — > > > > After being invented in its actual form, sorting was standardized > > > > simultaneously in ISO/IEC 14651 and in Unicode Collation > > > > Algorithm, the latter including practice‐oriented extra > > > > features. > > > > > > The UCA contains features essential for respecting canonical > > > equivalence. ICU works hard to avoid the extra effort involved, > > > apparently even going to the extreme of implicitly declaring that > > > Vietnamese is not a human language. > > > A bit over the top, eh? > > Then remove the "no known language" from the bug list, or declare that > you don't know SE Asian languages. > > The root problem is that the UCA cannot handle syllable by syllable > comparisons; if the UCA could handle that, the correct collation of > unambiguous true Lao would become simple. The CLDR algorithm provides > just enough memory to make Lao collation possible; however, ICU isn't > fast enough to load a collation from customisation - it takes hours! > One could probably do better if one added suffix contractions, but > adding that capability might be nightmare. The way tailoring is designed in CLDR using only data used by a generic algorithm, and not custom algorithm is not the only way to collate Lao. You can perectly add new custom algorithm promitives that will use new collation data rules that can be inserted as "hooks" in UCA (which provides several points at which it is possible, but UCA just makes these hooks act as "no-op". You can be much faster is you create a specific library for Lao, that would still be able to process the basic collation rules and then make more advanced inferences based on larger cluster boundaries than just those considered in the standard basic UCA, so it is perfectly possible to extend it to cover more complex Lao syllables and various specific quirks (such as hyphenation in the middle of clusters, as seen in some Indic scripts using left matras). Not everything has to be specified by UCA itself notably if it's specific to a script (or sometimes only a single locale, i.e. a specific combination of a script, language, orthographic convention, and stylistic convention for some kinds of documents or presentations).

