Re: [sqlite] Unicode support

Beau Wilkinson Tue, 17 Nov 2009 14:06:38 -0800

>> On 17 Nov 2009, at 5:52pm, Igor Tandetnik wrote:
>>
>>> But for your goals, it has to be sortable, right? In a proper
>>> Unicode collation, U+0041 U+0301 would behave quite differently from
>>> U+0301 U+0041. Consider "A ' E" (where ' stands for a combining
>>> acute accent). In most locales, this would sort between AE and BE.
>>> Now, if we reverse it naively, we'll end up with "E ' A", with the
>>> accent now attached to E and not A. The result would sort between EA
>>> and FA, rather than between EA and EB as you would probably want.
>>

I think that as a general rule, the "combining" accents should be disregared 
during collation.

For example: if a string contains the letter "a" plus a "combining acute 
accent," to me that seems like a hint that what we have is basically a letter 
"a," not a distinct letter with its own place in the collation sequence. This 
should be collated as an "a" that just happens to be accented, for whatever 
reason.

In Spanish, for example, a diaresis is sometimes placed over the letter "U." 
This indicates that the preceding consonant is hard. It does not make the "U" 
into a different letter, or signficantly affect the collation sequence. (At 
most, it is a tie-breaker between two otherwise identical words.)

So, I think the Spanish diaresis thus represents a legitimate use of the 
Uniciode "combining diaresis." In fact, I would submit that encoding Spanish's 
"U with diaresis" using code point U+00FC is just wrong, in the same way as 
coding letter "O" as ASCII 0x30 (zero) is wrong. We do not need to worry about 
cleaning up such a mistake in our collation code.

In German, and the Scandinavian languages, the opposite is true. Putting a 
diaresis over a letter makes a new letter, which collates differently. 
"Combining accents" code points are not appropriate in these languages and 
their use should not be supported by a collation algorithm. Rather, these 
letters should be encoded using single code points.

I think a better approach (to the design of Unicode) would have been for 
Spanish and German (for instance) to share absolutely nothing in the encoding 
standards. Each language ought to have its own little span of letters, 
immortalized into the standard in correct order-of-collation, with no sharing 
of "code points," "characters," or anything else.

Unicode screws this up, as it does with so many things, and this is a big 
reason why it's widely reviled (or, ignored) by many programmers. This is 
editorial commentary, but I do not necessarily think it is irrelevant. I get 
the feeling that something better than Unicode must be brewing somewhere.

Of course, sometimes bad standards have a life of their own, because they give 
us license to refuse to implement things and still look smart in so refusing. I 
suggest that this a very detrimental pattern, though.

________________________________________
From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] On 
Behalf Of Igor Tandetnik [itandet...@mvps.org]
Sent: Tuesday, November 17, 2009 1:01 PM
To: sqlite-users@sqlite.org
Subject: Re: [sqlite] Unicode support

Simon Slavin <slav...@bigfraud.org> wrote:
> On 17 Nov 2009, at 6:37pm, Igor Tandetnik wrote:
>
>> Simon Slavin <slav...@bigfraud.org> wrote:
>>> First split the string into characters, then reassemble them in
>>> reverse order.
>>
>> The problem is, in Unicode it's not quite clear what constitutes a
>> "character". Are we talking about codepoints, sort elements,
>> graphemes? Depending on the application, either definition might
>> make sense.
>
> I agree about the problem, but sort elements is the obvious answer in
> this case.

This would mean that the result of the hypothetical flip() function would be 
locale-dependent. E.g. in Spanish Traditional sort, a combination 'ch' sorts as 
if it were a single letter between 'c' and 'd', forming a single sort element 
(a so-called contraction). So should 'a ch b' reverse to 'b ch a' under Spanish 
Traditional sort, and to 'b hc a' otherwise? Would you pass a desired locale as 
a parameter to flip(), in order to achieve that?

Igor Tandetnik

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

The information contained in this e-mail is privileged and confidential 
information intended only for the use of the individual or entity named.  If 
you are not the intended recipient, or the employee or agent responsible for 
delivering this message to the intended recipient, you are hereby notified that 
any disclosure, dissemination, distribution, or copying of this communication 
is strictly prohibited.  If you have received this e-mail in error, please 
immediately notify the sender and delete any copies from your system.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Unicode support

Reply via email to