Thomas Munro wrote: > It looks like for technical reasons > inside glibc, that couldn't be done before 2.35: > > https://sourceware.org/bugzilla/show_bug.cgi?id=17318 > > That strengthens my opinion that C.UTF-8 (the real C.UTF-8 supplied > by the glibc project) isn't supposed to be versioned, but it's > extremely unfortunate that a bunch of OSes (Debian and maybe more) > have been sorting text in some other order under that name for > years.
Yes. This is consistent with Debian/Ubuntu patches in glibc/localedata/locales/C glibc-2.35 is not patched, and upstream has this: LC_COLLATE % The keyword 'codepoint_collation' in any part of any LC_COLLATE % immediately discards all collation information and causes the % locale to use strcmp/wcscmp for collation comparison. This is % exactly what is needed for C (ASCII) or C.UTF-8. codepoint_collation END LC_COLLATE But in older versions, glibc doesn't have the locales/C data file. Debian adds it in debian/patches/localedata/C with that kind of content: * glibc 2.31 Debian 11 LC_COLLATE order_start forward <U0000> .. <U007F> <U0080> .. <U00FF> etc... But as explained in the above-linked bugzilla entry, that did not result in true byte-comparison semantics, for several reasons that got fixed in 2.35. So this looks like a solved problem for anyone starting to use these collation with glibc 2.35 or newer (or other OSes that don't have a compatibility issue with them in the first place). But Debian/Ubuntu users upgrading from the older C.* to 2.35+ will not be having the normal warning about the need to reindex. I understand that my proposal to version C.* like any other collation might be erring on the side of caution, but ignoring these collation changes on at least one major OS does not feel right either. Maybe we should consider doing platform-dependent checks? Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite