On 23/06/2009, at 11:43 PM, Paul Davis wrote:
Interesting point. I take this as a pretty clear reason for discarding
UCA for member ordering. Normalization isn't affected by locale right?
I haven't seen anything to suggest as such so I assume not.
IIUC each normalization form is strictly functional.
IMO, given that ICU provides normalization functions, CouchDB should
use them in this case, exposing the canonicalisation transformation,
and a shortcut producing the hash, as a client-accessible endpoint. I
say this from a 'why not do it right?' perspective.
Member ordering could be binary, over either the code points (e.g. 32
bits) or the bytes of the UTF-8 representation. Given the ease of
creating a UTF-8 iterator that is probably best. UTF-16 is the most
common native encoding, but you don't want to do a byte-level
collation over a UTF-16/32 encoding because the result is dependent on
byte ordering.
The problem with this is that the canonical form might look bizarre
for a non-ASCII document, but a canonical collation is by definition
always going to look wrong to someone. For the current use, as an
intermediate form destined only for hashing, this doesn't matter anyway.
Having said that, IMO it would be a good i18n feature to be able set
the locale of a database, maybe even at the granularity of a view,
defaulting to the database's locale. The key ordering should respect
that locale. An option to normalize keys would also be a good idea.
The reason for setting a locale at the view level is that it might be
useful to create multiple views with different locales, to present
different localized result orderings to end users. One immediate issue
is that the local would have to be injected into view servers to
prevent possible weirdness.
I think it's easier and better to do these kind of things on the
server because you know you have the facilities to do it there (e.g.
ICU), whereas making it a client issue impedes use of the data by
different clients.
Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
On the other side, you have the customer and/or user, and they tend to
do what we call "automating the pain." They say, "What is it we're
doing now? How would that look if we automated it?" Whereas, what the
design process should properly be is one of saying, "What are the
goals we're trying to accomplish and how can we get rid of all this
task crap?"
-- Alan Cooper