Question regarding Unicode normalization

Collin Funk Sun, 24 Aug 2025 22:48:35 -0700

I noticed that mcel does not see the following characters as equal in a
UTF-8 locale:


   è (U+0065 + U+0300)
   è (U+00E8)

This is because mcel_isbasic (U+0065) sees an ASCII character and does
not normalize it using the following U+0300.

Is this intentional or not?

I had a look at implementing multibyte 'uniq --ignore-case' and it is
fairly easy. If we assume normalized Unicode we can even keep it
optimized in the UTF-8 case by using memcasecmp and memcmp:

static bool
different (char *old, char *new, idx_t oldlen, idx_t newlen)
{
  if (1 < MB_CUR && ignore_case)
    {
      /* Scan using mcel and c32tolower.  */
      return result;
    }
  if (ignore_case)
    return oldlen != newlen || memcasecmp (old, new, oldlen);
  else
    return oldlen != newlen || memcmp (old, new, oldlen);
}

Collin

Question regarding Unicode normalization

Reply via email to