Re: Pre-proposal: unicode normalized text

Nico Williams Mon, 02 Oct 2023 13:29:35 -0700

On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote:
> One of the frustrations with using the "C" locale (or any deterministic
> locale) is that the following returns false:
> 
>   SELECT 'á' = 'á'; -- false
> 
> because those are the unicode sequences U&'\0061\0301' and U&'\00E1',
> respectively, so memcmp() returns non-zero. But it's really the same
> character with just a different representation, and if you normalize
> them they are equal:
> 
>   SELECT normalize('á') = normalize('á'); -- true


I think you misunderstand Unicode normalization and equivalence.  There
is no standard Unicode `normalize()` that would cause the above equality
predicate to be true.  If you normalize to NFD (normal form decomposed)
then a _prefix_ of those two strings will be equal, but that's clearly
not what you're looking for.

PostgreSQL already has Unicode normalization support, though it would be
nice to also have form-insensitive indexing and equality predicates.

There are two ways to write 'á' in Unicode: one is pre-composed (one
codepoint) and the other is decomposed (two codepoints in this specific
case), and it would be nice to be able to preserve input form when
storing strings but then still be able to index and match them
form-insensitively (in the case of 'á' both equivalent representations
should be considered equal, and for UNIQUE indexes they should be
considered the same).

You could also have functions that perform lossy normalization in the
sort of way that soundex does, such as first normalizing to NFD then
dropping all combining codepoints which then could allow 'á' to be eq to
'a'.  But this would not be a Unicode normalization function.

Nico
--

Re: Pre-proposal: unicode normalized text

Reply via email to