On 25/08/2025 06:47, Collin Funk wrote:
I noticed that mcel does not see the following characters as equal in a UTF-8 locale:è (U+0065 + U+0300) è (U+00E8) This is because mcel_isbasic (U+0065) sees an ASCII character and does not normalize it using the following U+0300. Is this intentional or not? I had a look at implementing multibyte 'uniq --ignore-case' and it is fairly easy. If we assume normalized Unicode we can even keep it optimized in the UTF-8 case by using memcasecmp and memcmp: static bool different (char *old, char *new, idx_t oldlen, idx_t newlen) { if (1 < MB_CUR && ignore_case) { /* Scan using mcel and c32tolower. */ return result; } if (ignore_case) return oldlen != newlen || memcasecmp (old, new, oldlen); else return oldlen != newlen || memcmp (old, new, oldlen); }
Yes this is the first question posed at:https://www.pixelbeat.org/docs/coreutils_i18n/ Whatever we decide we should be consistent across all utils. I'm inclined to leave normalization to external tools like iconv and uconv. Note this is also related to how we deal with invalid encodings. In that regard I'm inclined that we should fall back to unibyte interpretation of invalid multi-byte chars internally. How the existing i18n patch deals with this matters too, since we want to avoid changes / regressions wrt that. cheers, Padraig
