On 23/09/2025 16:46, Collin Funk wrote:
Collin Funk <[email protected]> writes:

We discussed this patch off list and are going to leave it for a future
release. But I figured I would post it here for others to try and so I
do not lose it.

The patch handles multi-byte characters when invoking
'uniq --ignore-case' while perserving performance in the case of
LC_ALL=C and the case without --ignore-case.

     $ yes abcdefghijklmnopqrstuvwxyz | head -n 10000000 > test.txt

     $ export LC_ALL=en_US.UTF-8
     $ time ./src/uniq-new test.txt
     real       0m0.420s
     $ time ./src/uniq-new --ignore-case test.txt
     real       0m0.761s

     $ export LC_ALL=C
     $ time ./src/uniq-new test.txt
     real       0m0.425s
     $ time ./src/uniq-new --ignore-case test.txt
     real       0m0.485s
$ export LC_ALL=en_US.UTF-8
     $ time ./src/uniq-old test.txt
     real       0m0.420s
     $ time ./src/uniq-old --ignore-case test.txt
     real       0m0.437s

     $ export LC_ALL=C
     $ time ./src/uniq-old test.txt
     real       0m0.416s
     $ time ./src/uniq-old --ignore-case test.txt
     real       0m0.626s

Okay to push this after 'sed s/framework_failure/&_/' in the test to fix
syntax-check and a NEWS entry?

It should be the only thing needed for 'uniq' to handle multi-byte
characters. The only delimiters used are '\n' and '\0' which cannot be
in multi-byte characters (assuming a sane encoding). Therefore the
linebuffer.h functions work and are efficient.

I know I'm like a broken record, but testing is especially
important for any multi-byte changes, even as a way
to document what we don't support.

Note the downstream I18N patch does not consider uniq,
so we don't have to worry about compat in that regard.

In this case (pardon the pun) the tests could be expanded
to cover a sampling of cases from
https://unicode.org/Public/UNIDATA/SpecialCasing.txt

Perhaps covering:

locale specific issues (turkish variants of lower/upper i)

  $ echo istanbul | LC_ALL=tr_TR.UTF-8 sed -e 's/.*/\U&/'  İSTANBUL
  $ echo istanbul | LC_ALL=fr_FR.UTF-8 sed -e 's/.*/\U&/'
  ISTANBUL


asymmetric lower/upper:

  comment we don't handle string context cases like
  German Sharp S lower ß -> upper SS
  But probably include the unicode chars
  ẞ (upper) -> ß (lower)
  This is also interesting because the upper is
  3 utf-8 bytes, while the lower is 2 utf-8 bytes

  Also the greek multiple lower sigmas should be tested
  $ echo σς | LC_ALL=el_GR.UTF-8 sed -e 's/.*/\U&/'
  ΣΣ

contextual case comparison:

  character by character processing does not always suffice.
  See libunistring's unicase routines.
  At least documenting what we don't compare as equal
  would be useful if we do move to using libunistring in uniq etc.

It's worth looking at the old discussion re join
(where it was mentioned that join/sort/uniq should be treated as a unit
so that there is consistent interaction between them):
  https://crashcourse.housegordon.org/coreutils-multibyte-support.html
  https://lists.gnu.org/archive/html/bug-coreutils/2009-03/msg00102.html
  https://lists.gnu.org/archive/html/coreutils/2010-09/msg00029.html

thanks!
Padraig


Reply via email to