On 15/10/2025 13:34, Michael Cornelison wrote:
The Linux shell command: $ cut -c6- de.text > de2.text
outputs 2114 correct lines with first 5 characters removed.
 From line 2115, the two characters (hex 80, hex AF) are prepended to every
output line.
The rest of each output line is correct.

I have attached the file "de.text" which triggers this bug.

I am using Ubuntu 25.04 in case that matters.

regards
Mike Cornelison

The issue is that cut(1) does not support multi-byte characters yet,
and is treating -c like -b.  This can cause cut(1) to
output a partial multi-byte character. In your case,
the following shows it starts outputting in the middle of the
UTF-8 Narrow non-breaking space character:

  LC_ALL=de_DE.UTF-8 git/coreutils/src/cut -c1-10 de.text |
   head -n2115 | tail -n1 | od -Ax -tx1z -v
  000000 33 31 30 30 e2 80 af c3 9c 62 0a                 >3100.....b.<

This is already on our TODO list.

thank you,
Padraig




Reply via email to