On 29/12/2008, at 2:15 PM, Chris Anderson wrote:
Especially once CouchDB handles Unicode
collation properly.
I wasn't aware there was a problem with CouchDB's unicode collation.
Is there a ticket you can point me to?
No, I haven't raised it. The issue is that collation cannot be
specified per db, which IMO it needs to be, and I haven't seen
anything in the code that does anything wrt collation i.e. I suspect
it simply relies on the OS locale and icu's default handling. I
haven't thought about it enough to know whether persisted strings
should be stored in a normalized form, but certainly comparison needs
to use both normalisation and a specified collation order.
It also affects what end-of-collation-order character one uses for
prefix key searching, and would affect the computation of
succ(string). That issue alone leads me to think that CouchDB needs to
do more in that area because it's quite difficult to fix in the
client, whereas CouchDB is already fully unicode with icu. As an
example, I think the key boundary testing API could be richer,
eliminating the need for the current key hacks, especially the use of
a high-numeric-value unicode character for prefix ranges.
As I say, I haven't thought enough about it to raise a ticket, but I
feel strongly that it needs to be dealt with, and I suspect it's more
obvious to me because I'm deploying for an Asian/Arabic-script
localised environment.
Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
One should respect public opinion insofar as is necessary to avoid
starvation and keep out of prison, but anything that goes beyond this
is voluntary submission to an unnecessary tyranny.
-- Bertrand Russell