On Wed, Mar 19, 2014 at 8:36 AM, Alex Loukissas <a...@maginatics.com> wrote:

> On Wed, Mar 19, 2014 at 8:32 AM, Dan Kennedy <danielk1...@gmail.com>wrote:
>
>> On 03/19/2014 09:44 PM, Aleksey Tulinov wrote:
>>
>>>
>>> I've created test database:
>>>
>>>
>>> sqlite> CREATE TABLE test (x COLLATE NOCASE);
>>> sqlite> INSERT INTO test VALUES ('s');
>>> sqlite> INSERT INTO test VALUES ('S');
>>> sqlite> INSERT INTO test VALUES ('ё'); -- Russian e with diacritic
>>> sqlite> INSERT INTO test VALUES ('Ё'); -- Russian E with diacritic
>>>
>>> Then created index in ICU-disabled SQLite version:
>>>
>>> sqlite> SELECT 'ё' LIKE 'Ё';
>>> 0
>>> sqlite> .schema
>>>
>>> CREATE TABLE test (x COLLATE NOCASE);
>>> sqlite> CREATE INDEX idx_x ON test (x);
>>>
>>> Then tried it in ICU-enabled SQLite version:
>>>
>>
>> ICU-enabled or nunicode-enabled?
>>
>> ICU does not modify the behaviour of existing collation sequences. So
>> there is no problem there (apart from the original problem - that the ICU
>> extension does not provide anything that can be used to create a
>> case-independent collation sequence).
>>
>> An index is a sorted list. And queries like this:
>>
>>
>>  sqlite> SELECT * FROM test WHERE x = 'ё';
>>>
>>
>> do a binary search of that list to find keys equal to 'ё'. But to do a
>> binary search of an ordered list, you need to be using a comparison
>> function compatible with that used to sort the list in the first place. Say
>> I have the following list, sorted using a unicode aware NOCASE collation:
>>
>>   (Ä, ä, Ë, ë, f)
>>
>> Also assume that all characters in the list have umlauts adorning them.
>>
>> Then I open the db using regular SQLite and try searching for "ä".
>> Obviously the binary search fails - the first comparison compares the seek
>> key "ä" with "Ë", incorrectly concludes that the key "ä" is larger than "Ë"
>> and goes on to search the right-hand side of the index. The search fails.
>>
>> Then say this search is part of a delete operation to remove a row from
>> the database. The table row itself might be removed correctly, but the
>> corresponding index key is not - because a search fails to find it. At that
>> point you have an inconsistent table and index. A corrupt database.
>>
>> In the future, we might have a similar problem in FTS. FTS offers a
>> home-grown tokenizer named "unicode61" that folds case in the same
>> unicode-aware way as nunicode. If the unicode standard changes to define
>> more pairs of case equivalent characters, we will not be able simply
>> upgrade "unicode61". For the same reasons - modifying the comparison
>> function creates an incompatible system. Instead, we would name it
>> "unicode62" or similar, to be sure that databases created using the old
>> version continue to use it.
>>
>>
>> Dan.
>>
>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>>
>
> Thanks everyone for your comments. IIUC, the correct way of going about
> what I want to do is to use BINARY collation on the column I'm interested
> in and when I want to do unicode-aware case-insensitive lookups, they
> should look something like SELECT * FROM table WHERE LOWER(col_name) =
> LOWER(key), correct? It seems like with ICU support, LOWER( ) will call
> u_foldCase under the covers, which is what I want.
>
> Alex
>
>
Actually, it looks like the ICU extension doesn't provide what I want here
and the preferred way forward is to define my own collation sequence, as
described in http://www.sqlite.org/datatype3.html and use direct calls to
ICU caseCompare (which does case folding underneath).

I would suggest that a new collation sequence is added to sqlite with ICU
extension enabled (e.g. ICU_NOCASE) - will be very useful IMHO. Is there a
proper way of filing such feature requests?

Thanks!
Alex

-- 
Alex Loukissas

www.maginatics.com
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to