Package: coreutils
Version: 8.32-4
Severity: normal
File: /usr/bin/cut

Dear Maintainer,

POSIX.1-2008 says:
-- >8 --
-n
    Do not split characters. When specified with the -b option,
        each element in list of the form low-high
        (<hyphen-minus>-separated numbers) shall be modified as follows:

        *       If the byte selected by low is not the first byte
                of a character, low shall be decremented to select
                the first byte of the character originally selected by low.
                If the byte selected by high is not the last byte of a 
character,
                high shall be decremented to select the last byte of the 
character
                prior to the character originally selected by high,
                or zero if there is no prior character.
                If the resulting range element has high equal to zero
                or low greater than high, the list element shall be
                dropped from list for that input line without causing an error.

        Each element in list of the form low- shall be treated as above
        with high set to the number of bytes in the current line,
        not including the terminating <newline>.
        Each element in list of the form -high shall be treated as above
        with low set to 1.
        Each element in list of the form num (a single number)
        shall be treated as above with low set to num and high set to num.
-- >8 --


With a more succinct exemplary text driving the point home:
-- >8 --
Earlier versions of the cut utility worked in an environment
where bytes and characters were considered equivalent
(modulo <backspace> and <tab> processing in some implementations).
In the extended world of multi-byte characters,
the new -b option has been added.
The -n option (used with -b) allows it to be used to act on bytes
rounded to character boundaries.
The algorithm specified for -n guarantees that:
        cut -b 1-500 -n file > file1
        cut -b 501- -n file > file2
ends up with all the characters in file appearing exactly once
in file1 or file2. (There is, however, a <newline> in both
file1 and file2 for each <newline> in file.)
-- >8 --


So, compare a conforming implementation:
-- >8 --
$ printf 'яйцо\nЯЙЦО' | ./out/cmd/cut -nb 1-5
яй
ЯЙ
$ printf 'яйцо\nЯЙЦО' | ./out/cmd/cut -nb 6-
цо
ЦО
$ printf 'яйцо\nЯЙЦО' | ./out/cmd/cut -nb 1-4
яй
ЯЙ
$ printf 'яйцо\nЯЙЦО' | ./out/cmd/cut -nb 5-
цо
ЦО
$ printf 'яйцо\nЯЙЦО' | ./out/cmd/cut -nb 1-3
я
Я
$ printf 'яйцо\nЯЙЦО' | ./out/cmd/cut -nb 4-
йцо
ЙЦО
-- >8 --

With the garbage that GNU cut spews:
-- >8 --
$ printf 'яйцо\nЯЙЦО' | cut -nb 1-5
яй�
ЯЙ�
$ printf 'яйцо\nЯЙЦО' | cut -nb 6-
�о
�О
$ printf 'яйцо\nЯЙЦО' | cut -nb 1-4
яй
ЯЙ
$ printf 'яйцо\nЯЙЦО' | cut -nb 5-
цо
ЦО
$ printf 'яйцо\nЯЙЦО' | cut -nb 1-3
я�
Я�
$ printf 'яйцо\nЯЙЦО' | cut -nb 4-
�цо
�ЦО
-- >8 --

Or, without the luxury of REPLACEMENT CHARACTER:
-- >8 --
$ printf 'яйцо\nЯЙЦО' | cut -nb 1-5 | hexdump -C
00000000  d1 8f d0 b9 d1 0a d0 af  d0 99 d0 0a              |............|
0000000c
$ printf 'яйцо\nЯЙЦО' | cut -nb 6-  | hexdump -C
00000000  86 d0 be 0a a6 d0 9e 0a                           |........|
00000008
$ printf 'яйцо\nЯЙЦО' | cut -nb 1-4 | hexdump -C
00000000  d1 8f d0 b9 0a d0 af d0  99 0a                    |..........|
0000000a
$ printf 'яйцо\nЯЙЦО' | cut -nb 5-  | hexdump -C
00000000  d1 86 d0 be 0a d0 a6 d0  9e 0a                    |..........|
0000000a
$ printf 'яйцо\nЯЙЦО' | cut -nb 1-3 | hexdump -C
00000000  d1 8f d0 0a d0 af d0 0a                           |........|
00000008
$ printf 'яйцо\nЯЙЦО' | cut -nb 4-  | hexdump -C
00000000  b9 d1 86 d0 be 0a 99 d0  a6 d0 9e 0a              |............|
0000000c
-- >8 --


If we consult the manual, we can see:
-- >8 --
$ man cut | grep -C3 -- -n
              select  only  these fields;  also print any line that contains no 
delimiter character, unless the -s op‐
              tion is specified

       -n     (ignored)

       --complement
              complement the set of selected bytes, characters or fields
-- >8 --

If I hadn't seen the dog-water I was given I would've assumed this
a joke; a bad one. But I have, and I don't think I can classify this
as anything but "actively malicious".

Either don't recognise -n at all or implement it.
Don't destroy the input while actively flaunting defying the standard.

наб

-- System Information:
Debian Release: 11.0
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: x32 (x86_64)
Foreign Architectures: amd64, i386

Kernel: Linux 5.10.0-8-amd64 (SMP w/2 CPU threads)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE, 
TAINT_UNSIGNED_MODULE
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages coreutils depends on:
ii  libacl1      2.2.53-10
ii  libattr1     1:2.4.48-6
ii  libc6        2.31-16
ii  libgmp10     2:6.2.1+dfsg-1
ii  libselinux1  3.1-3

coreutils recommends no packages.

coreutils suggests no packages.

-- no debconf information

Attachment: signature.asc
Description: PGP signature

Reply via email to