Jeff Davis wrote: > The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8 > collation vs '123Abc' in PG_UNICODE_FAST. > > The reason for the latter behavior is that the Unicode Default Case > Conversion algorithm for toTitlecase() advances to the next Cased > character before mapping to titlecase, and digits are not Cased. ICU > has a configurable adjustment, and defaults in a way that produces > '123abc'.
Even aside from ICU, there's a different behavior between glibc and pg_c_utf8 glibc for codepoints in the decimal digit category outside of the US-ASCII range '0'..'9', select initcap(concat(chr(0xff11), 'a') collate "C.utf8"); -- glibc 2.35 initcap --------- 1a select initcap(concat(chr(0xff11), 'a') collate "pg_c_utf8"); initcap --------- 1A Both collations consider that chr(0xff11) is not a digit (isdigit()=>false) but C.utf8 says that it's alpha, whereas pg_c_utf8 says it's neither digit nor alpha. AFAIU this is why in the above initcap() call, pg_c_utf8 considers that 'a' is the first alphanumeric, whereas C.utf8 considers that '1' is the first alphanumeric, leading to different capitalizations. Comparing the 3 providers: WITH v(provider,type,result) AS (values ('ICU', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "unicode"), ('glibc', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "C.utf8"), ('builtin', 'isalpha', chr(0xff11) ~ '[[:alpha:]]' collate "pg_c_utf8"), ('ICU', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "unicode"), ('glibc', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "C.utf8"), ('builtin', 'isdigit', chr(0xff11) ~ '[[:digit:]]' collate "pg_c_utf8") ) select * from v \crosstabview provider | isalpha | isdigit ----------+---------+--------- ICU | f | t glibc | t | f builtin | f | f Are we fine with pg_c_utf8 differing from both ICU's point of view (U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is not digit, but it's alpha)? Aside from initcap(), this is going to be significant for regular expressions. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite