Re: cut multi-byte update and interface consolidation

Rob Landley Tue, 07 Apr 2026 13:16:57 -0700

On 3/30/26 20:04, Collin Funk wrote:

Hi Pádraig,


Pádraig Brady <[email protected]> writes:

This patch set updates cut(1) to be multi-byte aware.
It is also an attempt to reduce interface divergence across implementations.

I've put the 60 patches here due to the quantity:
https://github.com/pixelb/coreutils/compare/cut-mb


Thanks for working on this!

# Interface / New functionality

     macOS,  i18n, uutils, Toybox, Busybox, GNU
-c    x      x       x      x        x      x
-n    x      x                              x
-w    x              x                      x
-F                          x        x      x
-O                          x        x      x

Yay compatibility! (The Android maintainer asked me to try to push itfor consistency between command implementations some years back...)

-c is needed anyway as specified by all, including POSIX.
-n is needed also as specified by i18n/macOS/POSIX
-w is somewhat less important, but seeing as it's
on two other common platforms (and its functionality is
provided on two more), providing it is worthwhile for compat.


"man cut" on debian 12 doesn't have -w and -n says "ignored"? Let's see...

https://man.freebsd.org/cgi/man.cgi?cut

Whitespace. So cut -F without specifying -d. Eh, easy enough to add...

-F and -O are really just aliases to other options
so trivial to add, and probably worthwhile for compatibility.

If I'd found other options that did this nine years ago, I wouldn't havebothered...

I guess people like -w since it has been requested at least a few times,
IIRC. I never really cared for it since 'awk' is easy enough to use to
split at multiple blanks.

It pulls in a dependency on an entire programming language. It's not asheavyweight as perl or python, but it's up there. ("The AWK ProgrammingLanguage" from 1988 is 228 pages: K&R C second edition is only 236.)

You get "cut" in coreutils as part of the standard set, but awk is itsown package with multiple _standalone_ implementations. Gnu has gawk,debian's using mawk, android's using Brian Kernighan's one-true-awk from1974 (still maintained apparently, although Kernighan seems to havehanded it off to Oz Yigit in 2023)...

I got an awk implementation contributed to toybox (which can't usebusybox's because licensing) which is twice the size of sed+tar+grep_combined_ (or at least twice the line count).

I don't think -F and -O are that useful, but there is only so much 'cut'
can do. I don't think someone will come up with divergent behavior for
them. So I guess it is okay.

Interface / functionality notes:

There is a slight divergence between -n implementations.
There was already a difference between FreeBSD and i18n, and
we've aligned with the more sensible FreeBSD implementation.


Oh goddess, what did _they_ do about combining characters...

Note the i18n -n implementation is otherwise buggy in any case,
so I doubt this will be a practical compatibility concern.
Actually -n is specified by POSIX, and it matches FreeBSD.
Specifically our -n will not output a character unless the
byte range encompasses _the end_ of the multi-byte character.
I.e. the -b is a limit that is not passed, and thus ensures
we don't output overlapping characters for separate cut
invocations that do not have overlapping byte ranges.


Huh, I read the man page differently:

  -n  Do not split multi-byte characters.  Characters   will  only  be
      output if at least one byte is selected,  and, after a prefix of
      zero  or  more unselected bytes, the rest of the bytes that form
      the character are selected.

I thought "the rest of the bytes that form the character are selected"meant the selection was expanded to include the end of a partiallyselected character. (But that was a quick glance, not testing theimplementation. I need to set up ssh in my FreeBSD vm so I'm notmanually typing every test through the graphical window but can actuallyscript and paste stuff...)

What do they mean by "prefix" there, anyway? I thought combiningcharacters in unicode went _after_ the printable character (so you cannever be sure you're done until you overshoot or hit EOF, becauseMicrosoft was on the committee).

I hadn't directly opened the multibyte can of worms yet because "doesthe range specify bytes or characters" and "does that mean visiblecharacters or combining characters" seemed like a design headacherequiring multiple new options I wasn't interested in unilaterallydeclaring. That said, I'd vaguely assumed the regex engine could beaware of that stuff and it handling iswspace() for me in the"[[:space::]]" stuff was part of the appeal of doing it that way. Theold -f was bytes, regex could be unicode aware via libc, and it wasn'tMY immediate problem. :)

-d <regex> from toybox is not implemented.

>> That's edge case functionality IMHO and not well suited to cut(1).>>This functionality is supported by awk, and regex functionality

is best restricted to awk I think.
Agreed.


Ok, I'll bite. What do you think -F does?

  $ toybox --help cut | toybox cut -d $'\n' -f 15-18
  -d  Input delimiter (default is TAB for -f, run of whitespace for -F)
  -D  Don't sort/collate selections or match -fF lines without delimiter
  -f  Select fields (words) separated by single DELIM character
  -F  Select fields separated by DELIM regex

"cut -F" works like "cut -f" except it treats -d's argument as a regexand changes its default value to "[[:space:]][[:space:]]*". (You have toalso specify -D to make "echo one two three | cut -D -d ' ' -f 2,2,1"actually do what was asked of it, but that's a separate issue.)


Rob

P.S. is cut -d $'\n' actually documented in the man page?

Re: cut multi-byte update and interface consolidation

Reply via email to