On Wed, May 23, 2012 at 2:11 PM, Courtney Grimland < cgriml...@cfa.harvard.edu> wrote:
> I've downloaded and compiled icu.c according to the instructions in the > included README (though I had to add -fPIC to the compiler options). > > Now, when searching a table, I'm not getting the kind of > diacritic-insensitive behavior I was expecting: > The ICU extension makes use of the u_foldCase() function inside of the LIKE routine. u_foldCase() only does "simple" case folding. It does not do "full" case folding (hence "'FUSSBALL' LIKE 'fußball'" is FALSE) nor does u_foldCase() remove diacritics. Nor does u_foldCase() handle locale-specific case folding associated with Turkish. Nor does it do context-sensitive case folding as is sometimes required in Greek. I have no idea what other database engines do here? Does anybody else know? One can easily see the need to do matching that ignores diacritics. In fact, Dan and I were both hard at work on that problems when your email arrived. But it is truly a hard problem. What if the strings are not in Unicode NFC (Normal Form C)? Should LIKE convert them to NFC first? (Can you say "runs slower and uses more memory"?) Should LIKE do full case folding, rather than just the simple case folding that u_foldCase() provides? Understand that full case folding will sometimes cause single code points to be translated into three or four code points. This adds interesting complications when trying to match the "_" wildcard character in LIKE. And why stop with just diacritic removal? Why not do full transliteration after the fashion of unidecode? As you can see, this can get arbitrarily complex. We still don't have a good answer. Your input is welcomed. > > > sqlite> .load lib/libSQLiteICU.so > sqlite> select * from owner where firstname like '%dré%'; > id firstname last emai phon netid > ---- ------------- ---- ---- ---- ------------- > 2 André-Marie Ampère amp...@example.com 555-2222 ampere > sqlite> select * from owner where firstname like '%dre%'; > sqlite> > > > I expected both statements to return the same result. Am I overlooking > something or do I misunderstand the capabilities of ICU's "unicode-aware > LIKE operator"? > ______________________________**_________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-**bin/mailman/listinfo/sqlite-**users<http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users> > -- D. Richard Hipp d...@sqlite.org _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users