On 23/09/2025 16:46, Collin Funk wrote:
Collin Funk <[email protected]> writes:We discussed this patch off list and are going to leave it for a future release. But I figured I would post it here for others to try and so I do not lose it. The patch handles multi-byte characters when invoking 'uniq --ignore-case' while perserving performance in the case of LC_ALL=C and the case without --ignore-case. $ yes abcdefghijklmnopqrstuvwxyz | head -n 10000000 > test.txt $ export LC_ALL=en_US.UTF-8 $ time ./src/uniq-new test.txt real 0m0.420s $ time ./src/uniq-new --ignore-case test.txt real 0m0.761s $ export LC_ALL=C $ time ./src/uniq-new test.txt real 0m0.425s $ time ./src/uniq-new --ignore-case test.txt real 0m0.485s$ export LC_ALL=en_US.UTF-8$ time ./src/uniq-old test.txt real 0m0.420s $ time ./src/uniq-old --ignore-case test.txt real 0m0.437s $ export LC_ALL=C $ time ./src/uniq-old test.txt real 0m0.416s $ time ./src/uniq-old --ignore-case test.txt real 0m0.626sOkay to push this after 'sed s/framework_failure/&_/' in the test to fix syntax-check and a NEWS entry? It should be the only thing needed for 'uniq' to handle multi-byte characters. The only delimiters used are '\n' and '\0' which cannot be in multi-byte characters (assuming a sane encoding). Therefore the linebuffer.h functions work and are efficient.
I know I'm like a broken record, but testing is especially important for any multi-byte changes, even as a way to document what we don't support. Note the downstream I18N patch does not consider uniq, so we don't have to worry about compat in that regard. In this case (pardon the pun) the tests could be expanded to cover a sampling of cases from https://unicode.org/Public/UNIDATA/SpecialCasing.txt Perhaps covering: locale specific issues (turkish variants of lower/upper i) $ echo istanbul | LC_ALL=tr_TR.UTF-8 sed -e 's/.*/\U&/' İSTANBUL $ echo istanbul | LC_ALL=fr_FR.UTF-8 sed -e 's/.*/\U&/' ISTANBUL asymmetric lower/upper: comment we don't handle string context cases like German Sharp S lower ß -> upper SS But probably include the unicode chars ẞ (upper) -> ß (lower) This is also interesting because the upper is 3 utf-8 bytes, while the lower is 2 utf-8 bytes Also the greek multiple lower sigmas should be tested $ echo σς | LC_ALL=el_GR.UTF-8 sed -e 's/.*/\U&/' ΣΣ contextual case comparison: character by character processing does not always suffice. See libunistring's unicase routines. At least documenting what we don't compare as equal would be useful if we do move to using libunistring in uniq etc. It's worth looking at the old discussion re join (where it was mentioned that join/sort/uniq should be treated as a unit so that there is consistent interaction between them): https://crashcourse.housegordon.org/coreutils-multibyte-support.html https://lists.gnu.org/archive/html/bug-coreutils/2009-03/msg00102.html https://lists.gnu.org/archive/html/coreutils/2010-09/msg00029.html thanks! Padraig
