Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-21 Thread Richard Wordingham
On Sat, 19 May 2012 01:12:17 +0100 Richard Wordingham richard.wording...@ntlworld.com wrote: This will then work for DUCET 6.1.0, work for Danish, and work for my mischievous 0302 COMBINING CIRCUMFLEX ACCENT+0067 LATIN SMALL LETTER G contraction. There is a very similar rule in CLDR for

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-20 Thread Richard Wordingham
On Sat, 19 May 2012 01:12:17 +0100 Richard Wordingham richard.wording...@ntlworld.com wrote: Just in case you haven't already thought of it, one reasonable scheme would be to decompose input if and only if searching for contractions or the input character could *hide* the start of a

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-20 Thread Richard Wordingham
On Sun, 20 May 2012 16:15:24 +0100 Richard Wordingham richard.wording...@ntlworld.com wrote: CORRECTION: For the general case, we ought to be able to express a rule such as 'ignore the countering of sof-dottedness', as in Lithuanian casing, but I don't see any finite method of expressing it

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-20 Thread Richard Wordingham
On Sun, 20 May 2012 17:05:00 +0100 Richard Wordingham richard.wording...@ntlworld.com wrote: CORRECTION to correction I wrote rules for soft-dotted indecomposable+0307+ccc=203 when, of course, I meant rules for soft-dotted indecomposable+0307+ccc=230 Sorry about that. Richard.

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-20 Thread Markus Scherer
Hi Richard, This is essentially the same problem as http://bugs.icu-project.org/trac/ticket/9319 right? (Contractions overlapping with decomposition mappings.) Would you mind adding a reply to that with the Lithuanian issue? Thanks, markus

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Richard Wordingham
On Thu, 17 May 2012 21:32:19 -0700 Markus Scherer markus@gmail.com wrote: On Thu, May 17, 2012 at 4:29 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: As I've already said, DUCET 6.1.0 omits a contraction for 0FB2+0F71, and so CE(0FB2, 0334, 0F71, 0F80) =

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Richard Wordingham
On Thu, 17 May 2012 21:32:19 -0700 Markus Scherer markus@gmail.com wrote: Ok, but assuming we didn't add 0FB2+0F71, why can't we add the contraction 0FB2+0F81 and have the 0334 and any other non-starter be handled via discontiguous matching? Time for me to make a pronouncement on

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Markus Scherer
Back to first principles. UCA conformance requires getting the same results as the Main Algorithm. This can be done easily with NFD input text, or by implementing Step 1 which normalizes the input to NFD. Everything else is a performance optimization, and there are trade-offs. We also want

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Richard Wordingham
On Fri, 18 May 2012 09:51:34 -0700 Markus Scherer markus@gmail.com wrote: There is nothing that requires us to get correct results *without normalization* for all FCD strings or any other particular input conditions (except NFD input). So long as you don't claim conformance to the CLDR

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Mark Davis ☕
There is an action item from the UTC and CLDR committees to clarify the meanings of the setting; they are supposed to allow some degree of variation. -- Mark https://plus.google.com/114199149796022210033 * * *— Il meglio è l’inimico del bene —* ** On Fri, May 18,

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-18 Thread Richard Wordingham
On Fri, 18 May 2012 09:51:34 -0700 Markus Scherer markus@gmail.com wrote: On inspection, we think we can do better (and want to), probably by adding overlap contractions. If we get into trouble with that, we will think of alternatives. One is to decompose more characters even in FCD

Re: Fw: Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Asmus Freytag
On 5/16/2012 9:46 PM, Mark Davis ☕ wrote: No, it's not. Including x in Lao for some pedagogical (I'm guessing) purpose is completely out of scope. That'd be like including π in Latin because it sometimes occurs in the middle of English text.

Re: Fw: Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread vanisaac
From: Mark Davis ☕ m...@macchiato.com On Wed, May 16, 2012 at 9:20 PM, vanis...@boil.afraid.org wrote: From: Ken Whistler kenw_at_sybase.com Orthographies which mix in random characters from other scripts do not (or should not) drive the identity of characters for *scripts* per se. And

Re: Fw: Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Markus Scherer
*Please* use a different email subject line for the x vs. Lao discussion. markus On Thu, May 17, 2012 at 1:57 AM, vanis...@boil.afraid.org wrote: Well, I was speaking of the general case, not this specific example. Orthographies which mix in random characters from other scripts do not, and

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Richard Wordingham
On Wed, 16 May 2012 16:03:08 -0700 Markus Scherer markus@gmail.com wrote: The problem is a contraction x+0F72 and input text x+0F73 where the inner 0F71 should be skipped. We can avoid this by adding a contraction for x+0F73 (and one for the equivalent x+0F71+0F72). On the other hand,

Mark-Driven Script Categorisation (was: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm)

2012-05-17 Thread Richard Wordingham
On Wed, 16 May 2012 21:46:17 -0700 Mark Davis ☕ m...@macchiato.com wrote: No, it's not. Including x in Lao for some pedagogical (I'm guessing) purpose is completely out of scope. That'd be like including π in Latin because it sometimes occurs in the middle of English text. No, it's more

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Markus Scherer
On Thu, May 17, 2012 at 1:02 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: As x = 0F71, we also need the contractions of x+0F73 (or x+0F71+0F72) with 0F72, 0F74 and 0F80 to give the pair of long vowels. We don't need to worry about x+0F73,0F73 because that is not FCD. I am

Re: Mark-Driven Script Categorisation (was: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm)

2012-05-17 Thread Philippe Verdy
2012/5/17 Richard Wordingham richard.wording...@ntlworld.com: On Wed, 16 May 2012 21:46:17 -0700 Mark Davis ☕ m...@macchiato.com wrote: No, it's not. Including x in Lao for some pedagogical (I'm guessing) purpose is completely out of scope. That'd be like including π in Latin because it

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Richard Wordingham
On Thu, 17 May 2012 13:39:08 -0700 Markus Scherer markus@gmail.com wrote: On Thu, May 17, 2012 at 1:02 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: As x = 0F71, we also need the contractions of x+0F73 (or x+0F71+0F72) with 0F72, 0F74 and 0F80 to give the pair of

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Markus Scherer
On Thu, May 17, 2012 at 3:00 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: If using DUCET, the collation elements for 0F71+0F71+0F72 are those for 0F73, 0F71, namely (at 6.1.0): [.2572.0020.0002.0F73][.2570.0020.0002.0F71]. The correct collation elements for FCD sequence

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Richard Wordingham
On Thu, 17 May 2012 15:42:37 -0700 Markus Scherer markus@gmail.com wrote: On Thu, May 17, 2012 at 3:00 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: HOWEVER, you must *not* have the added contraction for 0F71+0F71. If we don't have this prefix contraction, then we will

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread Markus Scherer
On Thu, May 17, 2012 at 4:29 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: On Thu, 17 May 2012 15:42:37 -0700 Markus Scherer markus@gmail.com wrote: On Thu, May 17, 2012 at 3:00 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: HOWEVER, you must *not*

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-16 Thread Richard Wordingham
On Tue, 15 May 2012 21:33:03 -0700 Markus Scherer markus@gmail.com wrote: On Tue, May 15, 2012 at 4:42 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: I am puzzled as to how an implementation can compliantly implement the tailoring of normalisation in the UCA. I think

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-16 Thread Markus Scherer
On Wed, May 16, 2012 at 1:24 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: Section 5.1 of the UCA says that one may have a parametric normalisation tailoring. Aha :-) When you write normalisation tailoring it sounds like you are tailoring the normalization algorithm or

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-16 Thread Richard Wordingham
On Wed, 16 May 2012 09:17:51 -0700 Markus Scherer markus@gmail.com wrote: On Wed, May 16, 2012 at 1:24 AM, Richard Wordingham richard.wording...@ntlworld.com wrote: Section 5.1 of the UCA says that one may have a parametric normalisation tailoring. Section 5.1 is about runtime

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-16 Thread Ken Whistler
On 5/16/2012 2:54 PM, Richard Wordingham wrote: Similar remarks apply to 'reorder'. What if I move 'Q' and 'q' into the Cyrillic sequence? (I've a recollection that this letter is used in Kurdish written in Cyrillic.) Obsolete recollection. See: 051A;CYRILLIC CAPITAL LETTER

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-16 Thread Markus Scherer
On Wed, May 16, 2012 at 2:54 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: The tailoring 'locale' is not orthogonal. Well, right, that one selects the Collation Element Table :-) The tailoring 'caseFirst' rather reshuffles the tertiary weights. I am not entirely convinced

Fw: Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-16 Thread vanisaac
From: Ken Whistler kenw_at_sybase.com On 5/16/2012 2:54 PM, Richard Wordingham wrote: I have been wondering if U+0078 LATIN SMALL LETTER X should be made common script because of its use for displaying Lao vowels, but perhaps the principle of separation of scripts should lead to LAO

Re: Fw: Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-16 Thread Mark Davis ☕
No, it's not. Including x in Lao for some pedagogical (I'm guessing) purpose is completely out of scope. That'd be like including π in Latin because it sometimes occurs in the middle of English text. -- Mark https://plus.google.com/114199149796022210033 * * *— Il

Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-15 Thread Richard Wordingham
I am puzzled as to how an implementation can compliantly implement the tailoring of normalisation in the UCA. Can an implementation be said to compliantly implement the tailoring of normalisation if nominally turning it off actually has no effect? If it can, my puzzlement goes away. Simply

Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-15 Thread Markus Scherer
On Tue, May 15, 2012 at 4:42 PM, Richard Wordingham richard.wording...@ntlworld.com wrote: I am puzzled as to how an implementation can compliantly implement the tailoring of normalisation in the UCA. I think you mean something like implement tailorings where contractions overlap with