I wrote: > Consider matching '\d' in a regexp. With C.UTF-8 (glibc-2.35), we > only match ASCII characters 0-9, or 10 codepoints. With > "en-US-u-va-posix-x-icu" we match 660 codepoints comprising all the > digit characters in all languages, plus a bunch of variants for > mathematical symbols.
BTW this not specifically a C.UTF-8 versus "en-US-u-va-posix-x-icu" difference. If think that any glibc-based locale will consider that \d in a regexp means [0-9], and that any ICU locale will make \d match a much larger variety of characters. While moving to ICU by default, we should expect that differences like that will affect apps in a way that might be more or less disruptive. Another known difference it that upper() with ICU does not do a character-by-character conversion, for instance: WITH words(w) as (values('muß'),('final')) SELECT w, length(w), upper(w collate "C.utf8") as "upper (libc)", length(upper(w collate "C.utf8")), upper(w collate "en-x-icu") as "upper (ICU)", length(upper(w collate "en-x-icu")) FROM words; w | length | upper libc | length | upper ICU | length ------+--------+------------+--------+-----------+-------- muß | 3 | MUß | 3 | MUSS | 4 final | 4 | fiNAL | 4 | FINAL | 5 The fact that the resulting string is larger that the original might cause problems. In general, we can't abstract from the fact that ICU semantics are different. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite