On 03/19/2014 09:44 PM, Aleksey Tulinov wrote:

I've created test database:

sqlite> CREATE TABLE test (x COLLATE NOCASE);
sqlite> INSERT INTO test VALUES ('s');
sqlite> INSERT INTO test VALUES ('S');
sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic
sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic

Then created index in ICU-disabled SQLite version:

sqlite> SELECT 'ё' LIKE 'Ё';
0
sqlite> .schema
CREATE TABLE test (x COLLATE NOCASE);
sqlite> CREATE INDEX idx_x ON test (x);

Then tried it in ICU-enabled SQLite version:

ICU-enabled or nunicode-enabled?

ICU does not modify the behaviour of existing collation sequences. So there is no problem there (apart from the original problem - that the ICU extension does not provide anything that can be used to create a case-independent collation sequence).

An index is a sorted list. And queries like this:

sqlite> SELECT * FROM test WHERE x = 'ё';

do a binary search of that list to find keys equal to 'ё'. But to do a binary search of an ordered list, you need to be using a comparison function compatible with that used to sort the list in the first place. Say I have the following list, sorted using a unicode aware NOCASE collation:

  (Ä, ä, Ë, ë, f)

Also assume that all characters in the list have umlauts adorning them.

Then I open the db using regular SQLite and try searching for "ä". Obviously the binary search fails - the first comparison compares the seek key "ä" with "Ë", incorrectly concludes that the key "ä" is larger than "Ë" and goes on to search the right-hand side of the index. The search fails.

Then say this search is part of a delete operation to remove a row from the database. The table row itself might be removed correctly, but the corresponding index key is not - because a search fails to find it. At that point you have an inconsistent table and index. A corrupt database.

In the future, we might have a similar problem in FTS. FTS offers a home-grown tokenizer named "unicode61" that folds case in the same unicode-aware way as nunicode. If the unicode standard changes to define more pairs of case equivalent characters, we will not be able simply upgrade "unicode61". For the same reasons - modifying the comparison function creates an incompatible system. Instead, we would name it "unicode62" or similar, to be sure that databases created using the old version continue to use it.

Dan.

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to