On 3/30/26 20:04, Collin Funk wrote:
Hi Pádraig,

Pádraig Brady <[email protected]> writes:

This patch set updates cut(1) to be multi-byte aware.
It is also an attempt to reduce interface divergence across implementations.

I've put the 60 patches here due to the quantity:
https://github.com/pixelb/coreutils/compare/cut-mb

Thanks for working on this!

# Interface / New functionality

     macOS,  i18n, uutils, Toybox, Busybox, GNU
-c    x      x       x      x        x      x
-n    x      x                              x
-w    x              x                      x
-F                          x        x      x
-O                          x        x      x

Yay compatibility! (The Android maintainer asked me to try to push it for consistency between command implementations some years back...)

-c is needed anyway as specified by all, including POSIX.
-n is needed also as specified by i18n/macOS/POSIX
-w is somewhat less important, but seeing as it's
on two other common platforms (and its functionality is
provided on two more), providing it is worthwhile for compat.

"man cut" on debian 12 doesn't have -w and -n says "ignored"? Let's see...

https://man.freebsd.org/cgi/man.cgi?cut

Whitespace. So cut -F without specifying -d. Eh, easy enough to add...

-F and -O are really just aliases to other options
so trivial to add, and probably worthwhile for compatibility.

If I'd found other options that did this nine years ago, I wouldn't have bothered...

I guess people like -w since it has been requested at least a few times,
IIRC. I never really cared for it since 'awk' is easy enough to use to
split at multiple blanks.

It pulls in a dependency on an entire programming language. It's not as heavyweight as perl or python, but it's up there. ("The AWK Programming Language" from 1988 is 228 pages: K&R C second edition is only 236.)

You get "cut" in coreutils as part of the standard set, but awk is its own package with multiple _standalone_ implementations. Gnu has gawk, debian's using mawk, android's using Brian Kernighan's one-true-awk from 1974 (still maintained apparently, although Kernighan seems to have handed it off to Oz Yigit in 2023)...

I got an awk implementation contributed to toybox (which can't use busybox's because licensing) which is twice the size of sed+tar+grep _combined_ (or at least twice the line count).

I don't think -F and -O are that useful, but there is only so much 'cut'
can do. I don't think someone will come up with divergent behavior for
them. So I guess it is okay.

Interface / functionality notes:

There is a slight divergence between -n implementations.
There was already a difference between FreeBSD and i18n, and
we've aligned with the more sensible FreeBSD implementation.

Oh goddess, what did _they_ do about combining characters...

Note the i18n -n implementation is otherwise buggy in any case,
so I doubt this will be a practical compatibility concern.
Actually -n is specified by POSIX, and it matches FreeBSD.
Specifically our -n will not output a character unless the
byte range encompasses _the end_ of the multi-byte character.
I.e. the -b is a limit that is not passed, and thus ensures
we don't output overlapping characters for separate cut
invocations that do not have overlapping byte ranges.

Huh, I read the man page differently:

  -n  Do not split multi-byte characters.  Characters   will  only  be
      output if at least one byte is selected,  and, after a prefix of
      zero  or  more unselected bytes, the rest of the bytes that form
      the character are selected.

I thought "the rest of the bytes that form the character are selected" meant the selection was expanded to include the end of a partially selected character. (But that was a quick glance, not testing the implementation. I need to set up ssh in my FreeBSD vm so I'm not manually typing every test through the graphical window but can actually script and paste stuff...)

What do they mean by "prefix" there, anyway? I thought combining characters in unicode went _after_ the printable character (so you can never be sure you're done until you overshoot or hit EOF, because Microsoft was on the committee).

I hadn't directly opened the multibyte can of worms yet because "does the range specify bytes or characters" and "does that mean visible characters or combining characters" seemed like a design headache requiring multiple new options I wasn't interested in unilaterally declaring. That said, I'd vaguely assumed the regex engine could be aware of that stuff and it handling iswspace() for me in the "[[:space::]]" stuff was part of the appeal of doing it that way. The old -f was bytes, regex could be unicode aware via libc, and it wasn't MY immediate problem. :)

-d <regex> from toybox is not implemented.
>> That's edge case functionality IMHO and not well suited to cut(1).>> This functionality is supported by awk, and regex functionality
is best restricted to awk I think.
Agreed.

Ok, I'll bite. What do you think -F does?

  $ toybox --help cut | toybox cut -d $'\n' -f 15-18
  -d  Input delimiter (default is TAB for -f, run of whitespace for -F)
  -D  Don't sort/collate selections or match -fF lines without delimiter
  -f  Select fields (words) separated by single DELIM character
  -F  Select fields separated by DELIM regex

"cut -F" works like "cut -f" except it treats -d's argument as a regex and changes its default value to "[[:space:]][[:space:]]*". (You have to also specify -D to make "echo one two three | cut -D -d ' ' -f 2,2,1" actually do what was asked of it, but that's a separate issue.)

Rob

P.S. is cut -d $'\n' actually documented in the man page?

Reply via email to