On Fri, Jun 17, 2011 at 11:43 AM, Pablo Castro <pablo.cas...@microsoft.com> wrote: > > From: keean.schu...@googlemail.com [mailto:keean.schu...@googlemail.com] On > Behalf Of Keean Schupke > Sent: Tuesday, May 31, 2011 11:51 PM > >>> On 1 June 2011 01:37, Pablo Castro <pablo.cas...@microsoft.com> wrote: >>> >>> -----Original Message----- >>> From: simetri...@gmail.com [mailto:simetri...@gmail.com] On Behalf Of Aryeh >>> Gregor >>> Sent: Tuesday, May 31, 2011 3:49 PM >>> >>> >> On Tue, May 31, 2011 at 6:39 PM, Pablo Castro >>> >> <pablo.cas...@microsoft.com> wrote: >>> >> > No, that was poor wording on my part, I keep using "locale" in the >>> >> > wrong context. I meant to have the API take a proper collation >>> >> > identifier. The identifier can be as specific as the caller wants it >>> >> > to be. The implementation could choose to not honor some specific >>> >> > detail if it can't handle it (to the extent that doing so is allowed >>> >> > by the specification of collation names), or fail because it considers >>> >> > that not handling a particular aspect of the collation identifier >>> >> > would severely deviate from the caller's expectations. >>> >> >>> >> I'm not sure I understand you. My personal opinion is that there >>> >> should be no undefined behavior here. If authors are allowed to pass >>> >> collation identifiers, the spec needs to say exactly how they're to be >>> >> interpreted, so the same identifier passed to two different browsers >>> >> will result in the same collation, i.e., the same strings need to sort >>> >> the same cross-browser. Having only binary collation is better than >>> >> having non-binary collations but not defining them, IMO. >>> I thought BCP47 allowed implementations to drop subtags if needed. I just >>> re-read the spec and it seems that it only allows to do that in constrained >>> cases where you can't fit the whole name in your buffer (which wouldn't >>> apply to the context discussed here). My first instinct is that this is >>> quite a bit to guarantee (full consistency in collation), but it seems that >>> that's what the spec is shooting for. >>> >>> >> > Given the amount of debate on this, could we at least agree that we >>> >> > can do binary for v1? We can then have an open item for v2 on taking >>> >> > collation names and sort according to UCA or taking callbacks and such. >>> >> >>> >> I'm okay with supporting only binary to start with. >>> Great. I'll still wait a bit to see what other folks think, and then update >>> the bug one way or the other. >>> >>> Thanks >>> -pablo >>> >>> The discussion sounds like it is headed in the right direction. Are there >>> any issues with non-unicode encodings that need to be dealt with (HTTP >>> headers default to ISO-8859 I think). Would people be expected to convert >>> on read into UTF-16 strings or use typed-arrays? > > I asked around here and folks actually pointed out that the JavaScript spec > seems to be describing exactly what we needed. Looking at here [1], section > 11.8.5, the relevant fragment starting at step 4 goes: > > Else, both px and py are Strings > a. If py is a prefix of px, return false. (A String value p is a prefix of > String value q if q can be the result of concatenating p and some other > String r. Note that any String is a prefix of itself, because r may be the > empty String.) > b. If px is a prefix of py, return true. > c. Let k be the smallest nonnegative integer such that the character at > position k within px is different from the character at position k within py. > (There must be such a k, for neither String is a prefix of the other.) > d. Let m be the integer that is the code unit value for the character at > position k within px. > e. Let n be the integer that is the code unit value for the character at > position k within py. > f. If m < n, return true. Otherwise, return false. > > It also has a note below indicating: > > NOTE 2 The comparison of Strings uses a simple lexicographic ordering on > sequences of code unit values. There is no attempt to use the more complex, > semantically oriented definitions of character or string equality and > collating order defined in the Unicode specification. Therefore String values > that are canonically equal according to the Unicode standard could test as > unequal. In effect this algorithm assumes that both Strings are already in > normalised form. Also, note that for strings containing supplementary > characters, lexicographic ordering on sequences of UTF-16 code unit values > differs from that on sequences of code point values. > > Which is very much in line with what we've been discussing, and has the extra > feature of being compatible with JavaScript order. > > So it looks like we could reference (or inline) this in the spec and have a > fully specified order for keys with string content. > > Thoughts?
Sounds great! Thanks for doing the research here! / Jonas