On Sun, 9 Aug 2015 21:14:38 +0200 Mark Davis ☕️ <[email protected]> wrote:
> Mark <https://google.com/+MarkDavis> > > *— Il meglio è l’inimico del bene —* > > On Sun, Aug 9, 2015 at 7:10 PM, Richard Wordingham < > [email protected]> wrote: > > > On Sun, 9 Aug 2015 17:10:01 +0200 > > Mark Davis ☕️ <[email protected]> wrote: > > > For example, perhaps the addition of real data to CLDR for a > > > "basic-validity-check" on a language-by-language basis. > > CLDR is currently not useful. Are you really going to get Mayan > > time formats when the script is encoded? Without them, there will > > be no CLDR data. > That is a misunderstanding. CLDR provides both locale (language) > specific data for formatting, collation, etc., but also data about > languages. It is not limited to the first. I'm basing my statement on the 'minimal data commitment' listed in http://cldr.unicode.org/index/cldr-spec/minimaldata . If there is a sustained failure to provide 4 main data/time formats, the locale may be removed. > > > It might be > > > possible to use a BNF grammar for the components, for which we are > > > already set up. > > Are you sure? > I said "might be possible". That normally indicates that a degree of > uncertainty. That is, "no, I'm not sure". > There is no reason to be unnecessarily argumentative; it doesn't > exactly encourage people to explore solutions to a problem. I was responding to the 'for which we are already set up'. The problem is that canonical equivalence can make it very difficult to specify a syntax. The text segmentation appendices suggest that you have already hit trouble with canonical equivalence; I suspect you have tools set up to prevent such problems recurring. With a view to analysing the effects of analysing the rquirements of the USE, I investigated the effects of canonical equivalence on regular expressions. I eventually discovered the relevant mathematical theory - it replaces strings by 'traces', which for our purposes are fully decomposed character strings modulo canonical equivalence. I found very little interest in the matter on this list. I gave the example of the regular expression [:InPC=Top:]*[:InPC=Bottom:]* Usefully converting that expression to specify NFD equivalents in accordance with UTS#18 Version 17 Section 2.1 is non-trivial, though it is doable. I have a feeling that some have claimed that an expression like that is already in NFD. > I don't think any algorithmic description would get all and only those > strings that would be acceptable to writers of the language. What > you'd end up with is a mechanism that had three values: clearly ok > (eg, cat), clearly bogus (eg, a\u0308\u0308\u0308\u0308), and > somewhere in between. What have you got against 8th derivatives? -:) You are looking at a different issue to me. One of the issues is rather that for a word of one syllable, there should only be one order per meaning, appearance and pronunciation for a pair of non-commuting combining marks. For non-Indic scripts, that is generally handled by ensuring that different orders of non-commuting combining marks render differently. > If the goal for the script rules is to cover all languages customarily > written with that script, one way to do that is to develop the > language rules as they come, and make sure that the script rules are > broadened if necessary for each language. But there is also utility > to having the language rules, especially for high-frequency languages. The language rules serve a different function. The sequence "xxxxlttttuuupppp" is clearly not English, but it is a perfectly acceptable string for sorting, searching and rendering. Richard.

