On Wed, May 23, 2012 at 2:11 PM, Courtney Grimland <
cgriml...@cfa.harvard.edu> wrote:

> I've downloaded and compiled icu.c according to the instructions in the
> included README (though I had to add -fPIC to the compiler options).
>
> Now, when searching a table, I'm not getting the kind of
> diacritic-insensitive behavior I was expecting:
>

The ICU extension makes use of the u_foldCase() function inside of the LIKE
routine.  u_foldCase() only does "simple" case folding.  It does not do
"full" case folding (hence "'FUSSBALL' LIKE 'fußball'" is FALSE) nor does
u_foldCase() remove diacritics.  Nor does u_foldCase() handle
locale-specific case folding associated with Turkish.  Nor does it do
context-sensitive case folding as is sometimes required in Greek.

I have no idea what other database engines do here?  Does anybody else know?

One can easily see the need to do matching that ignores diacritics.  In
fact, Dan and I were both hard at work on that problems when your email
arrived.  But it is truly a hard problem.

What if the strings are not in Unicode NFC (Normal Form C)?  Should LIKE
convert them to NFC first?  (Can you say "runs slower and uses more
memory"?)

Should LIKE do full case folding, rather than just the simple case folding
that u_foldCase() provides?  Understand that full case folding will
sometimes cause single code points to be translated into three or four code
points.  This adds interesting complications when trying to match the "_"
wildcard character in LIKE.

And why stop with just diacritic removal?  Why not do full transliteration
after the fashion of unidecode?

As you can see, this can get arbitrarily complex.  We still don't have a
good answer.  Your input is welcomed.


>
>
> sqlite> .load lib/libSQLiteICU.so
> sqlite> select * from owner where firstname like '%dré%';
> id    firstname      last  emai  phon  netid
> ----  -------------  ----  ----  ----  -------------
> 2     André-Marie   Ampère  amp...@example.com  555-2222  ampere
> sqlite> select * from owner where firstname like '%dre%';
> sqlite>
>
>
> I expected both statements to return the same result.  Am I overlooking
> something or do I misunderstand the capabilities of ICU's "unicode-aware
> LIKE operator"?
> ______________________________**_________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-**bin/mailman/listinfo/sqlite-**users<http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users>
>



-- 
D. Richard Hipp
d...@sqlite.org
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to