I noticed that mcel does not see the following characters as equal in a
UTF-8 locale:
è (U+0065 + U+0300)
è (U+00E8)
This is because mcel_isbasic (U+0065) sees an ASCII character and does
not normalize it using the following U+0300.
Is this intentional or not?
I had a look at implementing multibyte 'uniq --ignore-case' and it is
fairly easy. If we assume normalized Unicode we can even keep it
optimized in the UTF-8 case by using memcasecmp and memcmp:
static bool
different (char *old, char *new, idx_t oldlen, idx_t newlen)
{
if (1 < MB_CUR && ignore_case)
{
/* Scan using mcel and c32tolower. */
return result;
}
if (ignore_case)
return oldlen != newlen || memcasecmp (old, new, oldlen);
else
return oldlen != newlen || memcmp (old, new, oldlen);
}
Collin