Re: [sqlite] Using ICU case folding support

Dan Kennedy Wed, 19 Mar 2014 08:34:06 -0700

On 03/19/2014 09:44 PM, Aleksey Tulinov wrote:


I've created test database:

sqlite> CREATE TABLE test (x COLLATE NOCASE);
sqlite> INSERT INTO test VALUES ('s');
sqlite> INSERT INTO test VALUES ('S');
sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic
sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic

Then created index in ICU-disabled SQLite version:

sqlite> SELECT 'ё' LIKE 'Ё';
0
sqlite> .schema
CREATE TABLE test (x COLLATE NOCASE);
sqlite> CREATE INDEX idx_x ON test (x);

Then tried it in ICU-enabled SQLite version:


ICU-enabled or nunicode-enabled?

ICU does not modify the behaviour of existing collation sequences. Sothere is no problem there (apart from the original problem - that theICU extension does not provide anything that can be used to create acase-independent collation sequence).


An index is a sorted list. And queries like this:

sqlite> SELECT * FROM test WHERE x = 'ё';

do a binary search of that list to find keys equal to 'ё'. But to do abinary search of an ordered list, you need to be using a comparisonfunction compatible with that used to sort the list in the first place.Say I have the following list, sorted using a unicode aware NOCASEcollation:


  (Ä, ä, Ë, ë, f)

Also assume that all characters in the list have umlauts adorning them.

Then I open the db using regular SQLite and try searching for "ä".Obviously the binary search fails - the first comparison compares theseek key "ä" with "Ë", incorrectly concludes that the key "ä" is largerthan "Ë" and goes on to search the right-hand side of the index. Thesearch fails.

Then say this search is part of a delete operation to remove a row fromthe database. The table row itself might be removed correctly, but thecorresponding index key is not - because a search fails to find it. Atthat point you have an inconsistent table and index. A corrupt database.

In the future, we might have a similar problem in FTS. FTS offers ahome-grown tokenizer named "unicode61" that folds case in the sameunicode-aware way as nunicode. If the unicode standard changes to definemore pairs of case equivalent characters, we will not be able simplyupgrade "unicode61". For the same reasons - modifying the comparisonfunction creates an incompatible system. Instead, we would name it"unicode62" or similar, to be sure that databases created using the oldversion continue to use it.


Dan.

_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Using ICU case folding support

Reply via email to