On Mon, Apr 2, 2012 at 4:24 PM, Simon Slavin <[email protected]> wrote:
> On 2 Apr 2012, at 9:58pm, Alexey Pechnikov <[email protected]> wrote:
>> Description: Unicode string library for C
>> The 'libunistring' library implements Unicode strings (in the UTF-8,
>> UTF-16, and UTF-32 encodings), together with functions for
>> Unicode characters (character names, classifications, properties) and
>> functions for string processing (formatted output, width, word
>> breaks, line breaks, normalization, case folding, regular expressions).
>
> Trying to figure out what SQLite would want from Unicode characters I don't
> end up with any of those. I think all it wants is sorting, so SQLite can
> make an index properly. And I don't really care whether it's case-sensitive
> or not since my software can do case conversion on input. Because they're in
> standard functions, string length and substring substitution would be nice
> but I can live without them working properly.
SQLite3 needs:
- string comparison with normalization-insensitivity, unless
SQLite3 were to normalize TEXT values on INSERT/UPDATE,
but I don't recommend that, except that for indexes it's
required; see below
- string comparison with case-insensitivity as an option (for LIKE)
- string normalization and case-folding functions, which are
needed for computing index key prefixes for LIKE and GLOB
patterns that use globbing, so that the index cursors can be
positioned correctly
- preferably a way to specify a collation for Unicode (i.e., a
language, since collation rules may vary by language)
- preferably a way to specify not to use locale environment
variables (see Igor's comments)
- functionality needed to implement SQLite3's built-in string functions
- i.e., trim(), ltrim(), rtrim(), replace(), substr(), lower(),
upper(), min(), max(), and length()
Incidentally, length() claims to return a count of characters, but it
actually counts *codepoints*. Counting characters is a lot harder
than counting codepoints... Codepoint counting in UTF-* is trivial;
character counting requires tables of combining codepoint ranges and
code to skip combining codepoints. Counting graphemes is harder
still. Getting these things right is non-trivial. Ideally there
would be an option to the length() function to request counts of
different possible things: UTF-8 units (bytes), UTF-16 units,
codepoints, characters, glyphs, and graphemes, though just stopping at
characters would do.
Similar comments apply to string indices in functions like substr()!
In practice one should want to count characters when dealing with
sub-string operations, but storage units when dealing with
transmission. Using codepoint counts in substr() risks breaking
combining codepoint sequences and thus producing garbage.
I think the OpenSolaris u8_textprep code is good enough for the
collation requirements, but it probably isn't sufficient for the
SQLite3 string functions, but I'd have to look carefully. I suspect
that ICU and libunistring meet all the requirements.
> One problem is that, as someone explained to me last year, sorting of unicode
> characters depends on which language you're using (and other things if you're
> fussy). So for every index you make you'd have to declare the language, and
> SQLite would have to store it.
SQLite3 allows you to specify collations though, so that's not that
big a deal. For a web application, say, it's very difficult to
implement sorting that satisfies all possible users because indexes
can't provide a globally sufficient collation, not unless you were
willing to have a multitude of indexes. Sorting, then, has to be done
on result sets -- that is, without the benefit of indexes in most
cases, which means it will be slow for any queries that return large
row sets.
In practice though this is not that big a deal. And there will be a
tendency to simplify collations. For example the Royal Spanish
Academy no longer requires that 'ch' sort after 'c', nor that 'll'
sort after 'l' [*]. I suspect most users won't really care, but
whether they do will depend on the application and the user.
> I was trying to figure out whether SQLite could make use of the OS's unicode
> library (using different compilation directives for each platform which
> supports unicode) but I'm only really familiar with the Mac operating system
> and I don't know how Windows or Linux does these things.
There's no standard C libraries that deal with Unicode in sufficient
detail. In particular the wchar_t functions are useless for the
purposes of SQLite3 because they try to hide too much detail, and
because in some cases they attempt to hide even the codeset.
[*] http://servicios.larioja.com/romanpaladino/g02.htm claims that the
Academy changed this in 1994, and that people started noticing this in
phone books in 1996, and that they complained.
http://es.wikipedia.org/wiki/Ortograf%C3%ADa_del_espa%C3%B1ol goes
into more detail. The 'ch' and 'll' digraphs stopped sorting as
separate letters in 1994, and from 2010 forwards are no longer
considered distinct letters in the alphabet (though they'd always been
encoded as codepoint pairs in all codesets, so this only really
affects primary school teachers).
And the normative reference for this change is
http://www.rae.es/rae/gestores/gespub000018.nsf/%28voAnexos%29/arch8100821B76809110C12571B80038BA4A/$File/CuestionesparaelFAQdeconsultas.htm#novOrto1
Nico
--
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users