Andreas Karlsson wrote: > > Nondeterministic collations do address this by allowing canonically > > equivalent code point sequences to compare as equal. You still need a > > collation implementation that actually does compare them as equal; ICU > > does this, glibc does not AFAICT. > > Ah, right! You could use -ks-identic[1] for this.
Strings that differ like that are considered equal even at this level: postgres=# create collation identic (locale='und-u-ks-identic', provider='icu', deterministic=false); CREATE COLLATION postgres=# select 'é' = E'e\u0301' collate "identic"; ?column? ---------- t (1 row) There's a separate setting "colNormalization", or "kk" in BCP 47 From http://www.unicode.org/reports/tr35/tr35-collation.html#Normalization_Setting "The UCA always normalizes input strings into NFD form before the rest of the algorithm. However, this results in poor performance. With normalization=off, strings that are in [FCD] and do not contain Tibetan precomposed vowels (U+0F73, U+0F75, U+0F81) should sort correctly. With normalization=on, an implementation that does not normalize to NFD must at least perform an incremental FCD check and normalize substrings as necessary" But even setting this to false does not mean that NFD and NFC forms of the same text compare as different: postgres=# create collation identickk (locale='und-u-ks-identic-kk-false', provider='icu', deterministic=false); CREATE COLLATION postgres=# select 'é' = E'e\u0301' collate "identickk"; ?column? ---------- t (1 row) AFAIU such strings may only compare as different when they're not in FCD form (http://unicode.org/notes/tn5/#FCD) There are also ICU-specific explanations about FCD here: http://source.icu-project.org/repos/icu/icuhtml/trunk/design/collation/ICU_collation_design.htm#Normalization It looks like setting colNormalization to false might provide a performance benefit when you know your contents are in FCD form, which is mostly the case according to ICU: "Note that all NFD strings are in FCD, and in practice most NFC strings will also be in FCD; for that matter most strings (of whatever ilk) will be in FCD. We guarantee that if any input strings are in FCD, that we will get the right results in collation without having to normalize". Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite