Re: fold: add the --characters option

Collin Funk Sat, 30 Aug 2025 08:21:44 -0700

Pádraig Brady <[email protected]> writes:

> 1. coreutils printf is more restrictive than bash in what \u characters it 
> supports.
> I.e. bash supports "unpaired surrogates" and "unicode noncharacters".


Good catch. In lib/sh/unicode.c there is the following:

    int
    u32toutf8 (u_bits32_t wc, char *s)
    {
      [...]
      else if (wc < 0x10000)
        {
          /* Technically, we could return 0 here if 0xd800 <= wc <= 0x0dfff */
          s[0] = (wc >> 12) | 0xe0;
          s[1] = ((wc >> 6) & 0x3f) | 0x80;
          s[2] = (wc & 0x3f) | 0x80;
          l = 3;
        }
      [...]
      /* Strictly speaking, UTF-8 doesn't have characters longer than 4 bytes */
      else if (wc < 0x04000000)
        {
          s[0] = (wc >> 24) | 0xf8;
          s[1] = ((wc >> 18) & 0x3f) | 0x80;
          s[2] = ((wc >> 12) & 0x3f) | 0x80;
          s[3] = ((wc >>  6) & 0x3f) | 0x80;
          s[4] = (wc & 0x3f) | 0x80;
          l = 5;
        }
      [...]
      return l;
    }

I'll have a look at changing that.

> So the following should be better:
>
> diff --git a/tests/fold/fold-characters.sh b/tests/fold/fold-characters.sh
> index cd17aa176..c29b4bdd6 100755
> --- a/tests/fold/fold-characters.sh
> +++ b/tests/fold/fold-characters.sh
> @@ -83,9 +83,13 @@ compare exp3 out3 || fail=1
>   # Sequence derived from <https://datatracker.ietf.org/doc/rfc9839>.
>   bad_unicode ()
>   {
> -  printf '\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n' || framework_failure_
> +  # invalid UTF8|unpaired surrogate|NUL|C1 control|noncharacter
> +  env printf '\xC3|\xED\xBA\xAD|\u0000|\u0089|\xED\xA6\xBF\xED\xBF\xBF\n'
>   }
> +bad_unicode > /dev/null || framework_failure_
>   test $({ bad_unicode | fold; bad_unicode; } | uniq | wc -l) = 1 || fail=1
> +# Check bad character at EOF
> +test $(env printf '\xC3' | fold | wc -c) = 1 || fail=1
>
>   # Ensure bounded memory operation
>   vm=$(get_min_ulimit_v_ fold /dev/null) && {

Thanks for the help with this. I applied my original patch with these
changes [1].

Collin

[1] 
https://github.com/coreutils/coreutils/commit/89b9115da67d819e01b4aa541e4672b21e48b250

Re: fold: add the --characters option

Reply via email to