On Fri, Jun 17, 2011 at 11:43 AM, Pablo Castro
<pablo.cas...@microsoft.com> wrote:
>
> From: keean.schu...@googlemail.com [mailto:keean.schu...@googlemail.com] On 
> Behalf Of Keean Schupke
> Sent: Tuesday, May 31, 2011 11:51 PM
>
>>> On 1 June 2011 01:37, Pablo Castro <pablo.cas...@microsoft.com> wrote:
>>>
>>> -----Original Message-----
>>> From: simetri...@gmail.com [mailto:simetri...@gmail.com] On Behalf Of Aryeh 
>>> Gregor
>>> Sent: Tuesday, May 31, 2011 3:49 PM
>>>
>>> >> On Tue, May 31, 2011 at 6:39 PM, Pablo Castro
>>> >> <pablo.cas...@microsoft.com> wrote:
>>> >> > No, that was poor wording on my part, I keep using "locale" in the 
>>> >> > wrong context. I meant to have the API take a proper collation 
>>> >> > identifier. The identifier can be as specific as the caller wants it 
>>> >> > to be. The implementation could choose to not honor some specific 
>>> >> > detail if it can't handle it (to the extent that doing so is allowed 
>>> >> > by the specification of collation names), or fail because it considers 
>>> >> > that not handling a particular aspect of the collation identifier 
>>> >> > would severely deviate from the caller's expectations.
>>> >>
>>> >> I'm not sure I understand you.  My personal opinion is that there
>>> >> should be no undefined behavior here.  If authors are allowed to pass
>>> >> collation identifiers, the spec needs to say exactly how they're to be
>>> >> interpreted, so the same identifier passed to two different browsers
>>> >> will result in the same collation, i.e., the same strings need to sort
>>> >> the same cross-browser.  Having only binary collation is better than
>>> >> having non-binary collations but not defining them, IMO.
>>> I thought BCP47 allowed implementations to drop subtags if needed. I just 
>>> re-read the spec and it seems that it only allows to do that in constrained 
>>> cases where you can't fit the whole name in your buffer (which wouldn't 
>>> apply to the context discussed here). My first instinct is that this is 
>>> quite a bit to guarantee (full consistency in collation), but it seems that 
>>> that's what the spec is shooting for.
>>>
>>> >> > Given the amount of debate on this, could we at least agree that we 
>>> >> > can do binary for v1? We can then have an open item for v2 on taking 
>>> >> > collation names and sort according to UCA or taking callbacks and such.
>>> >>
>>> >> I'm okay with supporting only binary to start with.
>>> Great. I'll still wait a bit to see what other folks think, and then update 
>>> the bug one way or the other.
>>>
>>> Thanks
>>> -pablo
>>>
>>> The discussion sounds like it is headed in the right direction. Are there 
>>> any issues with non-unicode encodings that need to be dealt with (HTTP 
>>> headers default to ISO-8859 I think). Would people be expected to convert 
>>> on read into UTF-16 strings or use typed-arrays?
>
> I asked around here and folks actually pointed out that the JavaScript spec 
> seems to be describing exactly what we needed. Looking at here [1], section 
> 11.8.5, the relevant fragment starting at step 4 goes:
>
> Else, both px and py are Strings
>    a. If py is a prefix of px, return false. (A String value p is a prefix of 
> String value q if q can be the result of concatenating p and some other 
> String r. Note that any String is a prefix of itself, because r may be the 
> empty String.)
>    b. If px is a prefix of py, return true.
>    c. Let k be the smallest nonnegative integer such that the character at 
> position k within px is different from the character at position k within py. 
> (There must be such a k, for neither String is a prefix of the other.)
>    d. Let m be the integer that is the code unit value for the character at 
> position k within px.
>    e. Let n be the integer that is the code unit value for the character at 
> position k within py.
>    f. If m < n, return true. Otherwise, return false.
>
> It also has a note below indicating:
>
> NOTE 2 The comparison of Strings uses a simple lexicographic ordering on 
> sequences of code unit values. There is no attempt to use the more complex, 
> semantically oriented definitions of character or string equality and 
> collating order defined in the Unicode specification. Therefore String values 
> that are canonically equal according to the Unicode standard could test as 
> unequal. In effect this algorithm assumes that both Strings are already in 
> normalised form. Also, note that for strings containing supplementary 
> characters, lexicographic ordering on sequences of UTF-16 code unit values 
> differs from that on sequences of code point values.
>
> Which is very much in line with what we've been discussing, and has the extra 
> feature of being compatible with JavaScript order.
>
> So it looks like we could reference (or inline) this in the spec and have a 
> fully specified order for keys with string content.
>
> Thoughts?

Sounds great! Thanks for doing the research here!

/ Jonas

Reply via email to