This patch set updates cut(1) to be multi-byte aware. It is also an attempt to reduce interface divergence across implementations.
I've put the 60 patches here due to the quantity: https://github.com/pixelb/coreutils/compare/cut-mb multi-byte awareness was added to the existing -c, n, and -d options. Also considered for compatibility are the -w, -F, and -O options, as these are present on at least two other common implementations. # Interface / New functionality macOS, i18n, uutils, Toybox, Busybox, GNU -c x x x x x x -n x x x -w x x x -F x x x -O x x x -c is needed anyway as specified by all, including POSIX. -n is needed also as specified by i18n/macOS/POSIX -w is somewhat less important, but seeing as it's on two other common platforms (and its functionality is provided on two more), providing it is worthwhile for compat. -F and -O are really just aliases to other options so trivial to add, and probably worthwhile for compatibility. Interface / functionality notes: There is a slight divergence between -n implementations. There was already a difference between FreeBSD and i18n, and we've aligned with the more sensible FreeBSD implementation. Note the i18n -n implementation is otherwise buggy in any case, so I doubt this will be a practical compatibility concern. Actually -n is specified by POSIX, and it matches FreeBSD. Specifically our -n will not output a character unless the byte range encompasses _the end_ of the multi-byte character. I.e. the -b is a limit that is not passed, and thus ensures we don't output overlapping characters for separate cut invocations that do not have overlapping byte ranges. -d <regex> from toybox is not implemented. That's edge case functionality IMHO and not well suited to cut(1). This functionality is supported by awk, and regex functionality is best restricted to awk I think. cut is a significant part of the i18n patch, so it will be good to avoid that downstream divergence. Unfortunately there were no tests with the cut i18n implementation. # Performance General performance notes: We prefer byte searching (with -d) as that can be much faster than character by character processing, and it's supported on single byte and UTF-8 charsets. We also use byte searching with -w on uni-byte locales. This was seen to give up to 100x performance increase over the i18n patch. Where we do use per character processing, we avoid conversion to wide char when processing ASCII data (mcel provides this optimization). This was seen to give a 14x performance increase over the i18n patch. We prefer memchr() and strstr() as these are tuned for specific platforms on glibc, even if memchr2() or memmem() are algorithmically better. We maintain the important memory behavior of only buffering when necessary. Performance testing: There are _lots_ of combinations and optimization opportunities. I performance tested this patch set with the following setup: $ yes | head -n10M > sl.in $ yes $(yes eeeaae | head -n10K | paste -s -d,) | head -n10K > ll.in $ yes $(yes eeeaae | head -n9 | paste -s -d,) | head -n1M > as.in $ yes $(yes éééááé | head -n9 | paste -s -d,) | head -n1M > mb.in $ for type in sl ll as mb; do cat $type.in >/dev/null; for imp in '' src/; do # '' maps to the system i18n variant on Fedora echo ============ "${imp:-i18n}" $type ==============; for d in -d, -dc -d, -dç -w -b -c; do fields='-f1 -f10 -f100' test "$d" = "-b" && { fields='-b1 -b10 -b100'; d=''; } test "$d" = "-c" && { fields='-c1 -c10 -c100'; d=''; } for f in $fields; do for loc in C C.UTF-8; do # SKip -b for UTF-8 as no different test "$loc" = C.UTF-8 && echo "$f" | grep -q -- -b && continue # Skip multi-byte delimiter for C and not allowed test "$loc" = C && test $(echo -n "$d" | wc -c) -ge 4 && continue LC_ALL=$loc ${imp}cut $f $d /dev/null 2>/dev/null && hyperfine -m2 -M4 \ "LC_ALL=$loc ${imp}cut $f $d $type.in >/dev/null" || printf 'Benchmark 1: %s\n unsupported\n\n' \ "LC_ALL=$loc ${imp}cut $f $d $type.in >/dev/null" done; done; done; done; done After a little post-processing of the results, we get: ## cut-i18n | command | sl | ll | as | mb | | --------------- | -------- | -------- | -------- | -------- | | C -f1 -d, | 66.3 ms | 1.605 s | 145.9 ms | 366.4 ms | | UTF8 -f1 -d, | 65.8 ms | 1.593 s | 145.8 ms | 370.0 ms | | C -f10 -d, | 301.4 ms | 1.590 s | 161.8 ms | 126.7 ms | | UTF8 -f10 -d, | 303.5 ms | 1.599 s | 161.8 ms | 124.6 ms | | C -f100 -d, | 300.6 ms | 1.596 s | 162.1 ms | 126.7 ms | | UTF8 -f100 -d, | 301.3 ms | 1.595 s | 162.0 ms | 124.9 ms | | C -f1 -dc | 66.6 ms | 1.845 s | 179.1 ms | 365.7 ms | | UTF8 -f1 -dc | 73.8 ms | 1.878 s | 179.1 ms | 363.1 ms | | C -f10 -dc | 300.7 ms | 349.8 ms | 76.0 ms | 125.3 ms | | UTF8 -f10 -dc | 300.4 ms | 347.2 ms | 75.7 ms | 124.8 ms | | C -f100 -dc | 300.1 ms | 348.1 ms | 76.5 ms | 125.5 ms | | UTF8 -f100 -dc | 300.8 ms | 348.7 ms | 76.4 ms | 125.8 ms | | UTF8 -f1 -d, | 563.5 ms | 21.775 s | 1.963 s | 1.665 s | | UTF8 -f10 -d, | 833.6 ms | 20.504 s | 2.022 s | 1.612 s | | UTF8 -f100 -d, | 825.2 ms | 20.448 s | 2.009 s | 1.616 s | | UTF8 -f1 -dç | 563.7 ms | 21.827 s | 1.964 s | 2.319 s | | UTF8 -f10 -dç | 825.3 ms | 21.713 s | 2.011 s | 2.248 s | | UTF8 -f100 -dç | 831.6 ms | 20.505 s | 2.019 s | 2.276 s | | C -f1 -w | - | - | - | - | | UTF8 -f1 -w | - | - | - | - | | C -f10 -w | - | - | - | - | | UTF8 -f10 -w | - | - | - | - | | C -f100 -w | - | - | - | - | | UTF8 -f100 -w | - | - | - | - | | C -b1 | 60.8 ms | 1.596 s | 154.8 ms | 313.7 ms | | C -b10 | 51.6 ms | 1.594 s | 154.3 ms | 310.8 ms | | C -b100 | 51.4 ms | 1.594 s | 153.0 ms | 312.2 ms | | C -c1 | 60.7 ms | 1.597 s | 153.8 ms | 313.0 ms | | UTF8 -c1 | 526.5 ms | 14.662 s | 1.362 s | 1.573 s | | C -c10 | 51.8 ms | 1.591 s | 153.3 ms | 311.4 ms | | UTF8 -c10 | 436.9 ms | 14.450 s | 1.336 s | 1.563 s | | C -c100 | 51.0 ms | 1.593 s | 152.7 ms | 313.2 ms | | UTF8 -c100 | 426.7 ms | 14.429 s | 1.344 s | 1.551 s | ## src/cut | command | sl | ll | as | mb | | --------------- | -------- | -------- | -------- | -------- | | C -f1 -d, | 4.6 ms | 108.2 ms | 45.4 ms | 24.2 ms | | UTF8 -f1 -d, | 4.8 ms | 108.4 ms | 45.4 ms | 24.5 ms | | C -f10 -d, | 4.5 ms | 109.3 ms | 123.7 ms | 24.3 ms | | UTF8 -f10 -d, | 4.9 ms | 114.1 ms | 124.1 ms | 24.5 ms | | C -f100 -d, | 4.7 ms | 119.2 ms | 124.1 ms | 24.5 ms | | UTF8 -f100 -d, | 4.8 ms | 120.0 ms | 125.1 ms | 24.5 ms | | C -f1 -dc | 4.4 ms | 120.5 ms | 11.9 ms | 24.1 ms | | UTF8 -f1 -dc | 4.9 ms | 120.5 ms | 12.1 ms | 24.6 ms | | C -f10 -dc | 4.7 ms | 125.3 ms | 11.8 ms | 24.1 ms | | UTF8 -f10 -dc | 4.8 ms | 126.7 ms | 12.0 ms | 24.4 ms | | C -f100 -dc | 4.6 ms | 127.0 ms | 11.9 ms | 24.3 ms | | UTF8 -f100 -dc | 4.7 ms | 126.4 ms | 12.0 ms | 24.4 ms | | UTF8 -f1 -d, | 6.0 ms | 169.4 ms | 15.6 ms | 67.4 ms | | UTF8 -f10 -d, | 6.1 ms | 173.9 ms | 15.6 ms | 237.2 ms | | UTF8 -f100 -d, | 6.1 ms | 174.0 ms | 15.6 ms | 237.8 ms | | UTF8 -f1 -dç | 6.3 ms | 170.8 ms | 15.7 ms | 32.2 ms | | UTF8 -f10 -dç | 6.0 ms | 172.9 ms | 15.9 ms | 32.1 ms | | UTF8 -f100 -dç | 6.7 ms | 173.1 ms | 15.5 ms | 32.3 ms | | C -f1 -w | 159.6 ms | 170.1 ms | 69.1 ms | 98.9 ms | | UTF8 -f1 -w | 128.1 ms | 2.525 s | 246.5 ms | 1.086 s | | C -f10 -w | 183.3 ms | 199.2 ms | 74.6 ms | 105.0 ms | | UTF8 -f10 -w | 130.3 ms | 2.659 s | 276.5 ms | 1.099 s | | C -f100 -w | 183.8 ms | 202.5 ms | 74.1 ms | 103.6 ms | | UTF8 -f100 -w | 130.1 ms | 2.663 s | 276.6 ms | 1.097 s | | C -b1 | 65.0 ms | 110.2 ms | 22.4 ms | 35.6 ms | | C -b10 | 48.7 ms | 109.6 ms | 24.2 ms | 36.7 ms | | C -b100 | 48.7 ms | 110.6 ms | 19.0 ms | 36.6 ms | | C -c1 | 65.8 ms | 109.5 ms | 22.4 ms | 35.6 ms | | UTF8 -c1 | 63.2 ms | 1.130 s | 116.9 ms | 610.2 ms | | C -c10 | 48.7 ms | 109.8 ms | 24.3 ms | 36.8 ms | | UTF8 -c10 | 39.7 ms | 1.133 s | 118.7 ms | 610.0 ms | | C -c100 | 48.3 ms | 110.7 ms | 18.9 ms | 36.7 ms | | UTF8 -c100 | 39.4 ms | 1.141 s | 115.0 ms | 598.8 ms | In summary, compared to the i18n patch we're now as fast in all cases, and much faster in most cases. We can see the -f byte searching performing well, being 120x faster in the no matching delimiter case, to at least 3x faster in the matching delimiter case. When we resort to per character processing we also compare well, being 14x faster in the ASCII processing case (due to mcel short-circuiting the wide char conversion). Note the processing mb.in results above also show a 2x win in per character processing cases, but the i18n patch would have also picked that win up as it's achieved separately to this patch set: https://lists.gnu.org/r/coreutils/2026-03/msg00117.html cheers, Padraig
