On 29/12/2008, at 2:15 PM, Chris Anderson wrote:

Especially once CouchDB handles Unicode
collation properly.

I wasn't aware there was a problem with CouchDB's unicode collation.
Is there a ticket you can point me to?

No, I haven't raised it. The issue is that collation cannot be specified per db, which IMO it needs to be, and I haven't seen anything in the code that does anything wrt collation i.e. I suspect it simply relies on the OS locale and icu's default handling. I haven't thought about it enough to know whether persisted strings should be stored in a normalized form, but certainly comparison needs to use both normalisation and a specified collation order.

It also affects what end-of-collation-order character one uses for prefix key searching, and would affect the computation of succ(string). That issue alone leads me to think that CouchDB needs to do more in that area because it's quite difficult to fix in the client, whereas CouchDB is already fully unicode with icu. As an example, I think the key boundary testing API could be richer, eliminating the need for the current key hacks, especially the use of a high-numeric-value unicode character for prefix ranges.

As I say, I haven't thought enough about it to raise a ticket, but I feel strongly that it needs to be dealt with, and I suspect it's more obvious to me because I'm deploying for an Asian/Arabic-script localised environment.

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

One should respect public opinion insofar as is necessary to avoid starvation and keep out of prison, but anything that goes beyond this is voluntary submission to an unnecessary tyranny.
  -- Bertrand Russell


Reply via email to