On Wed, Mar 19, 2014 at 8:36 AM, Alex Loukissas <a...@maginatics.com> wrote:
> On Wed, Mar 19, 2014 at 8:32 AM, Dan Kennedy <danielk1...@gmail.com>wrote: > >> On 03/19/2014 09:44 PM, Aleksey Tulinov wrote: >> >>> >>> I've created test database: >>> >>> >>> sqlite> CREATE TABLE test (x COLLATE NOCASE); >>> sqlite> INSERT INTO test VALUES ('s'); >>> sqlite> INSERT INTO test VALUES ('S'); >>> sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic >>> sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic >>> >>> Then created index in ICU-disabled SQLite version: >>> >>> sqlite> SELECT 'ё' LIKE 'Ё'; >>> 0 >>> sqlite> .schema >>> >>> CREATE TABLE test (x COLLATE NOCASE); >>> sqlite> CREATE INDEX idx_x ON test (x); >>> >>> Then tried it in ICU-enabled SQLite version: >>> >> >> ICU-enabled or nunicode-enabled? >> >> ICU does not modify the behaviour of existing collation sequences. So >> there is no problem there (apart from the original problem - that the ICU >> extension does not provide anything that can be used to create a >> case-independent collation sequence). >> >> An index is a sorted list. And queries like this: >> >> >> sqlite> SELECT * FROM test WHERE x = 'ё'; >>> >> >> do a binary search of that list to find keys equal to 'ё'. But to do a >> binary search of an ordered list, you need to be using a comparison >> function compatible with that used to sort the list in the first place. Say >> I have the following list, sorted using a unicode aware NOCASE collation: >> >> (Ä, ä, Ë, ë, f) >> >> Also assume that all characters in the list have umlauts adorning them. >> >> Then I open the db using regular SQLite and try searching for "ä". >> Obviously the binary search fails - the first comparison compares the seek >> key "ä" with "Ë", incorrectly concludes that the key "ä" is larger than "Ë" >> and goes on to search the right-hand side of the index. The search fails. >> >> Then say this search is part of a delete operation to remove a row from >> the database. The table row itself might be removed correctly, but the >> corresponding index key is not - because a search fails to find it. At that >> point you have an inconsistent table and index. A corrupt database. >> >> In the future, we might have a similar problem in FTS. FTS offers a >> home-grown tokenizer named "unicode61" that folds case in the same >> unicode-aware way as nunicode. If the unicode standard changes to define >> more pairs of case equivalent characters, we will not be able simply >> upgrade "unicode61". For the same reasons - modifying the comparison >> function creates an incompatible system. Instead, we would name it >> "unicode62" or similar, to be sure that databases created using the old >> version continue to use it. >> >> >> Dan. >> >> _______________________________________________ >> sqlite-users mailing list >> sqlite-users@sqlite.org >> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users >> > > Thanks everyone for your comments. IIUC, the correct way of going about > what I want to do is to use BINARY collation on the column I'm interested > in and when I want to do unicode-aware case-insensitive lookups, they > should look something like SELECT * FROM table WHERE LOWER(col_name) = > LOWER(key), correct? It seems like with ICU support, LOWER( ) will call > u_foldCase under the covers, which is what I want. > > Alex > > Actually, it looks like the ICU extension doesn't provide what I want here and the preferred way forward is to define my own collation sequence, as described in http://www.sqlite.org/datatype3.html and use direct calls to ICU caseCompare (which does case folding underneath). I would suggest that a new collation sequence is added to sqlite with ICU extension enabled (e.g. ICU_NOCASE) - will be very useful IMHO. Is there a proper way of filing such feature requests? Thanks! Alex -- Alex Loukissas www.maginatics.com _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users