Re: Question regarding Unicode normalization

Pádraig Brady Mon, 25 Aug 2025 03:44:58 -0700

On 25/08/2025 06:47, Collin Funk wrote:

I noticed that mcel does not see the following characters as equal in a
UTF-8 locale:


    è (U+0065 + U+0300)
    è (U+00E8)

This is because mcel_isbasic (U+0065) sees an ASCII character and does
not normalize it using the following U+0300.

Is this intentional or not?

I had a look at implementing multibyte 'uniq --ignore-case' and it is
fairly easy. If we assume normalized Unicode we can even keep it
optimized in the UTF-8 case by using memcasecmp and memcmp:

static bool
different (char *old, char *new, idx_t oldlen, idx_t newlen)
{
   if (1 < MB_CUR && ignore_case)
     {
       /* Scan using mcel and c32tolower.  */
       return result;
     }
   if (ignore_case)
     return oldlen != newlen || memcasecmp (old, new, oldlen);
   else
     return oldlen != newlen || memcmp (old, new, oldlen);
}


Yes this is the first question posed 
at:https://www.pixelbeat.org/docs/coreutils_i18n/
Whatever we decide we should be consistent across all utils.

I'm inclined to leave normalization to external tools like iconv and uconv.

Note this is also related to how we deal with invalid encodings.
In that regard I'm inclined that we should fall back to unibyte
interpretation of invalid multi-byte chars internally.

How the existing i18n patch deals with this matters too,
since we want to avoid changes / regressions wrt that.

cheers,
Padraig

Re: Question regarding Unicode normalization

Reply via email to