On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote: > One of the frustrations with using the "C" locale (or any deterministic > locale) is that the following returns false: > > SELECT 'á' = 'á'; -- false > > because those are the unicode sequences U&'\0061\0301' and U&'\00E1', > respectively, so memcmp() returns non-zero. But it's really the same > character with just a different representation, and if you normalize > them they are equal: > > SELECT normalize('á') = normalize('á'); -- true
I think you misunderstand Unicode normalization and equivalence. There is no standard Unicode `normalize()` that would cause the above equality predicate to be true. If you normalize to NFD (normal form decomposed) then a _prefix_ of those two strings will be equal, but that's clearly not what you're looking for. PostgreSQL already has Unicode normalization support, though it would be nice to also have form-insensitive indexing and equality predicates. There are two ways to write 'á' in Unicode: one is pre-composed (one codepoint) and the other is decomposed (two codepoints in this specific case), and it would be nice to be able to preserve input form when storing strings but then still be able to index and match them form-insensitively (in the case of 'á' both equivalent representations should be considered equal, and for UNIQUE indexes they should be considered the same). You could also have functions that perform lossy normalization in the sort of way that soundex does, such as first normalizing to NFD then dropping all combining codepoints which then could allow 'á' to be eq to 'a'. But this would not be a Unicode normalization function. Nico --