On Mon, Nov 30, 2009 at 08:15:58AM +0000, Simon Slavin wrote:
> On 30 Nov 2009, at 1:58am, Igor Tandetnik wrote:
> > Note that Unicode collation is not as simple as you might think. Did
> > you know that in Estonian, 'y' sorts between 'i' and 'j'? Or that in
> > German phonebook sort, 'oe' sorts as if it were a single letter
> > between 'o' and 'p'? Basically, your simplistic approach would only
> > work for plain unaccented Latin letters and English collation rules.
>
> I spent a lot of time annoyed about this, and ended up deciding that
> the only way to do Unicode sorting correctly is to store the language
> (or collation method) with each piece of Unicode text. Of course,
> this still gives you the problem of working out which order two pieces
> of text go in if they are in two different languages. Perhaps you
> also need a 'default language' marker for the entire column.
Consider a column that contains a person's last name. Q: do proper
names have a language? A: No, since people can be from all over and
even within a single country may have last names of various radically
different origins.
Now consider a column that contains a person's self-description. Q:
what language will it be in? A: If you don't specify it, it could be in
one (or more!) language(s) of the person's choice. You might not need
to sort by such a column though...
IMO you'll have two types of text to sort: a) generic text (e.g., proper
names), b) localized text (e.g., message catalogs). For (a) you'll want
to pick a collation, _any_ collation. For (b) you'll know the language
of the relevant text and can then sort with a specific collation. If
you think there is a third class of text then you'll be best off forcing
it into (a) or (b), IMO, as statements like this make no sense:
CREATE TABLE foo (txt, lang);
...
-- this makes no sense:
SELECT txt FROM foo ORDER BY txt COLLATE lang; -- where lang is a
-- column of foo;
Unfortunately you can't parametrize collations in COLLATE clauses in
SQLite3! I.e., this gives an error:
sqlite> CREATE TABLE toy(a);
sqlite> EXPLAIN SELECT a FROM toy ORDER BY a COLLATE :a ASC;
SQL error: near ":a": syntax error
Therefore for (b) you'll have to compile statements for each collation
that you want to use. (Say you have per-language tables: you'll need
separate SELECT statements for each table. Say you have a single table
with a text column and another column indicating language: you'll need
to compile a statement for each language since the collation cannot be
parametrized.)
Making collation functions parametrizable might be a useful extension,
so you could then write:
sqlite> CREATE TABLE toy(txt, lang);
sqlite> explain SELECT txt FROM toy WHERE lang = :a ORDER BY txt COLLATE :b ASC;
where :a specifies some language and :b specifies a collation for that
language.
Note too that Unicode has codepoints for specifying the language that
the subsequent text is written in. Such codepoints could be used for
deriving a collation from some text. But again, I don't think this will
prove useful, both, for the reasons given above (SELECT ... ORDER BY
... COLLATE lang_of(...) makes no sense) and also because users won't
know how to ensure that such language tags are embedded in the text that
they write.
Nico
--
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users