On Fri, Aug 04, 2006 at 10:02:58PM -0700, Cory Nelson wrote:
> On 8/4/06, Trevor Talbot <[EMAIL PROTECTED]> wrote:
> >On 8/4/06, Cory Nelson <[EMAIL PROTECTED]> wrote:
> >
> >> But, since you brought it up - I have no expectations of SQLite
> >> integrating a full Unicode locale library, however it would be a great
> >> improvement if it would respect the current locale and use wcs*
> >> functions when available, or at least order by standard Unicode order
> >> instead of completely mangling things on UTF-8 codes.
> >
> >What do you mean by "standard Unicode order" in this context?
> >
> 
> Convert UTF-8 to UTF-16 (or both to UCS-4 if you want to be entirely
> correct) while sorting, to at least make them follow the same pattern.

Huh?

UTF-8 handled in the naive way (using "memcmp", like sqlite does) will
automagically give you sorting by unicode codepoint (probably the only
useful meaning of "standard Unicode order" here).

UTF-16 handled in the naive way (either using "memcmp" or
lexicographically on 2-byte integers) will sort things by codepoint,
mostly, sort of, and otherwise by a weird order that falls out of
details of the UTF-16 standard accidentally.[1]

Perhaps you're using a legacy system that standardized on UTF-16
before the BMP ran out, and want to be compatible with its
idiosyncratic sorting -- then converting things to UTF-16 before
comparing makes sense.  But that's not really appropriate to make as a
general recommendation... better to convert UTF-16 to UTF-8, if you
want to be entirely correct :-).

[1] see e.g. http://icu.sourceforge.net/docs/papers/utf16_code_point_order.html

-- Nathaniel

-- 
Details are all that matters; God dwells there, and you never get to
see Him if you don't struggle to get them right. -- Stephen Jay Gould

Reply via email to