Pádraig Brady <[email protected]> writes:
> 1. coreutils printf is more restrictive than bash in what \u characters it
> supports.
> I.e. bash supports "unpaired surrogates" and "unicode noncharacters".
Good catch. In lib/sh/unicode.c there is the following:
int
u32toutf8 (u_bits32_t wc, char *s)
{
[...]
else if (wc < 0x10000)
{
/* Technically, we could return 0 here if 0xd800 <= wc <= 0x0dfff */
s[0] = (wc >> 12) | 0xe0;
s[1] = ((wc >> 6) & 0x3f) | 0x80;
s[2] = (wc & 0x3f) | 0x80;
l = 3;
}
[...]
/* Strictly speaking, UTF-8 doesn't have characters longer than 4 bytes */
else if (wc < 0x04000000)
{
s[0] = (wc >> 24) | 0xf8;
s[1] = ((wc >> 18) & 0x3f) | 0x80;
s[2] = ((wc >> 12) & 0x3f) | 0x80;
s[3] = ((wc >> 6) & 0x3f) | 0x80;
s[4] = (wc & 0x3f) | 0x80;
l = 5;
}
[...]
return l;
}
I'll have a look at changing that.
> So the following should be better:
>
> diff --git a/tests/fold/fold-characters.sh b/tests/fold/fold-characters.sh
> index cd17aa176..c29b4bdd6 100755
> --- a/tests/fold/fold-characters.sh
> +++ b/tests/fold/fold-characters.sh
> @@ -83,9 +83,13 @@ compare exp3 out3 || fail=1
> # Sequence derived from <https://datatracker.ietf.org/doc/rfc9839>.
> bad_unicode ()
> {
> - printf '\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n' || framework_failure_
> + # invalid UTF8|unpaired surrogate|NUL|C1 control|noncharacter
> + env printf '\xC3|\xED\xBA\xAD|\u0000|\u0089|\xED\xA6\xBF\xED\xBF\xBF\n'
> }
> +bad_unicode > /dev/null || framework_failure_
> test $({ bad_unicode | fold; bad_unicode; } | uniq | wc -l) = 1 || fail=1
> +# Check bad character at EOF
> +test $(env printf '\xC3' | fold | wc -c) = 1 || fail=1
>
> # Ensure bounded memory operation
> vm=$(get_min_ulimit_v_ fold /dev/null) && {
Thanks for the help with this. I applied my original patch with these
changes [1].
Collin
[1]
https://github.com/coreutils/coreutils/commit/89b9115da67d819e01b4aa541e4672b21e48b250