This patch set updates cut(1) to be multi-byte aware.
It is also an attempt to reduce interface divergence across implementations.

I've put the 60 patches here due to the quantity:
https://github.com/pixelb/coreutils/compare/cut-mb

multi-byte awareness was added to the existing -c, n, and -d options.
Also considered for compatibility are the -w, -F, and -O options,
as these are present on at least two other common implementations.

# Interface / New functionality

    macOS,  i18n, uutils, Toybox, Busybox, GNU
-c    x      x       x      x        x      x
-n    x      x                              x
-w    x              x                      x
-F                          x        x      x
-O                          x        x      x

-c is needed anyway as specified by all, including POSIX.
-n is needed also as specified by i18n/macOS/POSIX
-w is somewhat less important, but seeing as it's
on two other common platforms (and its functionality is
provided on two more), providing it is worthwhile for compat.
-F and -O are really just aliases to other options
so trivial to add, and probably worthwhile for compatibility.

Interface / functionality notes:

There is a slight divergence between -n implementations.
There was already a difference between FreeBSD and i18n, and
we've aligned with the more sensible FreeBSD implementation.
Note the i18n -n implementation is otherwise buggy in any case,
so I doubt this will be a practical compatibility concern.
Actually -n is specified by POSIX, and it matches FreeBSD.
Specifically our -n will not output a character unless the
byte range encompasses _the end_ of the multi-byte character.
I.e. the -b is a limit that is not passed, and thus ensures
we don't output overlapping characters for separate cut
invocations that do not have overlapping byte ranges.

-d <regex> from toybox is not implemented.
That's edge case functionality IMHO and not well suited to cut(1).
This functionality is supported by awk, and regex functionality
is best restricted to awk I think.

cut is a significant part of the i18n patch, so it will be good
to avoid that downstream divergence.  Unfortunately there were
no tests with the cut i18n implementation.


# Performance

General performance notes:

We prefer byte searching (with -d) as that can be much faster
than character by character processing, and it's supported
on single byte and UTF-8 charsets.
We also use byte searching with -w on uni-byte locales.
This was seen to give up to 100x performance increase over the i18n patch.

Where we do use per character processing, we avoid conversion
to wide char when processing ASCII data (mcel provides this optimization).
This was seen to give a 14x performance increase over the i18n patch.

We prefer memchr() and strstr() as these are tuned for specific platforms
on glibc, even if memchr2() or memmem() are algorithmically better.

We maintain the important memory behavior of only buffering when necessary.

Performance testing:

There are _lots_ of combinations and optimization opportunities.
I performance tested this patch set with the following setup:

$ yes | head -n10M > sl.in
$ yes $(yes eeeaae | head -n10K | paste -s -d,) | head -n10K > ll.in
$ yes $(yes eeeaae | head -n9 | paste -s -d,) | head -n1M > as.in
$ yes $(yes éééááé | head -n9 | paste -s -d,) | head -n1M > mb.in

$ for type in sl ll as mb; do
    cat $type.in >/dev/null;
    for imp in '' src/; do  # '' maps to the system i18n variant on Fedora
      echo ============ "${imp:-i18n}" $type ==============;
      for d in -d, -dc -d, -dç -w -b -c; do
        fields='-f1 -f10 -f100'
        test "$d" = "-b" && { fields='-b1 -b10 -b100'; d=''; }
        test "$d" = "-c" && { fields='-c1 -c10 -c100'; d=''; }
        for f in $fields; do
          for loc in C C.UTF-8; do
            # SKip -b for UTF-8 as no different
            test "$loc" = C.UTF-8 && echo "$f" | grep -q -- -b && continue
            # Skip multi-byte delimiter for C and not allowed
            test "$loc" = C && test $(echo -n "$d" | wc -c) -ge 4 && continue
            LC_ALL=$loc ${imp}cut $f $d /dev/null 2>/dev/null &&
            hyperfine -m2 -M4 \
             "LC_ALL=$loc ${imp}cut $f $d $type.in >/dev/null" ||
            printf 'Benchmark 1: %s\n  unsupported\n\n' \
             "LC_ALL=$loc ${imp}cut $f $d $type.in >/dev/null"
          done;
        done;
      done;
    done;
  done

After a little post-processing of the results, we get:

## cut-i18n

| command         |       sl |       ll |       as |       mb |
| --------------- | -------- | -------- | -------- | -------- |
| C -f1 -d,       |  66.3 ms |  1.605 s | 145.9 ms | 366.4 ms |
| UTF8 -f1 -d,    |  65.8 ms |  1.593 s | 145.8 ms | 370.0 ms |
| C -f10 -d,      | 301.4 ms |  1.590 s | 161.8 ms | 126.7 ms |
| UTF8 -f10 -d,   | 303.5 ms |  1.599 s | 161.8 ms | 124.6 ms |
| C -f100 -d,     | 300.6 ms |  1.596 s | 162.1 ms | 126.7 ms |
| UTF8 -f100 -d,  | 301.3 ms |  1.595 s | 162.0 ms | 124.9 ms |
| C -f1 -dc       |  66.6 ms |  1.845 s | 179.1 ms | 365.7 ms |
| UTF8 -f1 -dc    |  73.8 ms |  1.878 s | 179.1 ms | 363.1 ms |
| C -f10 -dc      | 300.7 ms | 349.8 ms |  76.0 ms | 125.3 ms |
| UTF8 -f10 -dc   | 300.4 ms | 347.2 ms |  75.7 ms | 124.8 ms |
| C -f100 -dc     | 300.1 ms | 348.1 ms |  76.5 ms | 125.5 ms |
| UTF8 -f100 -dc  | 300.8 ms | 348.7 ms |  76.4 ms | 125.8 ms |
| UTF8 -f1 -d,   | 563.5 ms | 21.775 s |  1.963 s |  1.665 s |
| UTF8 -f10 -d,  | 833.6 ms | 20.504 s |  2.022 s |  1.612 s |
| UTF8 -f100 -d, | 825.2 ms | 20.448 s |  2.009 s |  1.616 s |
| UTF8 -f1 -dç    | 563.7 ms | 21.827 s |  1.964 s |  2.319 s |
| UTF8 -f10 -dç   | 825.3 ms | 21.713 s |  2.011 s |  2.248 s |
| UTF8 -f100 -dç  | 831.6 ms | 20.505 s |  2.019 s |  2.276 s |
| C -f1 -w        |        - |        - |        - |        - |
| UTF8 -f1 -w     |        - |        - |        - |        - |
| C -f10 -w       |        - |        - |        - |        - |
| UTF8 -f10 -w    |        - |        - |        - |        - |
| C -f100 -w      |        - |        - |        - |        - |
| UTF8 -f100 -w   |        - |        - |        - |        - |
| C -b1           |  60.8 ms |  1.596 s | 154.8 ms | 313.7 ms |
| C -b10          |  51.6 ms |  1.594 s | 154.3 ms | 310.8 ms |
| C -b100         |  51.4 ms |  1.594 s | 153.0 ms | 312.2 ms |
| C -c1           |  60.7 ms |  1.597 s | 153.8 ms | 313.0 ms |
| UTF8 -c1        | 526.5 ms | 14.662 s |  1.362 s |  1.573 s |
| C -c10          |  51.8 ms |  1.591 s | 153.3 ms | 311.4 ms |
| UTF8 -c10       | 436.9 ms | 14.450 s |  1.336 s |  1.563 s |
| C -c100         |  51.0 ms |  1.593 s | 152.7 ms | 313.2 ms |
| UTF8 -c100      | 426.7 ms | 14.429 s |  1.344 s |  1.551 s |

## src/cut

| command         |       sl |       ll |       as |       mb |
| --------------- | -------- | -------- | -------- | -------- |
| C -f1 -d,       |   4.6 ms | 108.2 ms |  45.4 ms |  24.2 ms |
| UTF8 -f1 -d,    |   4.8 ms | 108.4 ms |  45.4 ms |  24.5 ms |
| C -f10 -d,      |   4.5 ms | 109.3 ms | 123.7 ms |  24.3 ms |
| UTF8 -f10 -d,   |   4.9 ms | 114.1 ms | 124.1 ms |  24.5 ms |
| C -f100 -d,     |   4.7 ms | 119.2 ms | 124.1 ms |  24.5 ms |
| UTF8 -f100 -d,  |   4.8 ms | 120.0 ms | 125.1 ms |  24.5 ms |
| C -f1 -dc       |   4.4 ms | 120.5 ms |  11.9 ms |  24.1 ms |
| UTF8 -f1 -dc    |   4.9 ms | 120.5 ms |  12.1 ms |  24.6 ms |
| C -f10 -dc      |   4.7 ms | 125.3 ms |  11.8 ms |  24.1 ms |
| UTF8 -f10 -dc   |   4.8 ms | 126.7 ms |  12.0 ms |  24.4 ms |
| C -f100 -dc     |   4.6 ms | 127.0 ms |  11.9 ms |  24.3 ms |
| UTF8 -f100 -dc  |   4.7 ms | 126.4 ms |  12.0 ms |  24.4 ms |
| UTF8 -f1 -d,   |   6.0 ms | 169.4 ms |  15.6 ms |  67.4 ms |
| UTF8 -f10 -d,  |   6.1 ms | 173.9 ms |  15.6 ms | 237.2 ms |
| UTF8 -f100 -d, |   6.1 ms | 174.0 ms |  15.6 ms | 237.8 ms |
| UTF8 -f1 -dç    |   6.3 ms | 170.8 ms |  15.7 ms |  32.2 ms |
| UTF8 -f10 -dç   |   6.0 ms | 172.9 ms |  15.9 ms |  32.1 ms |
| UTF8 -f100 -dç  |   6.7 ms | 173.1 ms |  15.5 ms |  32.3 ms |
| C -f1 -w        | 159.6 ms | 170.1 ms |  69.1 ms |  98.9 ms |
| UTF8 -f1 -w     | 128.1 ms |  2.525 s | 246.5 ms |  1.086 s |
| C -f10 -w       | 183.3 ms | 199.2 ms |  74.6 ms | 105.0 ms |
| UTF8 -f10 -w    | 130.3 ms |  2.659 s | 276.5 ms |  1.099 s |
| C -f100 -w      | 183.8 ms | 202.5 ms |  74.1 ms | 103.6 ms |
| UTF8 -f100 -w   | 130.1 ms |  2.663 s | 276.6 ms |  1.097 s |
| C -b1           |  65.0 ms | 110.2 ms |  22.4 ms |  35.6 ms |
| C -b10          |  48.7 ms | 109.6 ms |  24.2 ms |  36.7 ms |
| C -b100         |  48.7 ms | 110.6 ms |  19.0 ms |  36.6 ms |
| C -c1           |  65.8 ms | 109.5 ms |  22.4 ms |  35.6 ms |
| UTF8 -c1        |  63.2 ms |  1.130 s | 116.9 ms | 610.2 ms |
| C -c10          |  48.7 ms | 109.8 ms |  24.3 ms |  36.8 ms |
| UTF8 -c10       |  39.7 ms |  1.133 s | 118.7 ms | 610.0 ms |
| C -c100         |  48.3 ms | 110.7 ms |  18.9 ms |  36.7 ms |
| UTF8 -c100      |  39.4 ms |  1.141 s | 115.0 ms | 598.8 ms |


In summary, compared to the i18n patch we're now as fast in all cases,
and much faster in most cases.

We can see the -f byte searching performing well,
being 120x faster in the no matching delimiter case,
to at least 3x faster in the matching delimiter case.

When we resort to per character processing we also compare well,
being 14x faster in the ASCII processing case
(due to mcel short-circuiting the wide char conversion).
Note the processing mb.in results above also show a 2x win
in per character processing cases, but the i18n patch would
have also picked that win up as it's achieved separately
to this patch set:
https://lists.gnu.org/r/coreutils/2026-03/msg00117.html

cheers,
Padraig

Reply via email to