On 28/08/2025 11:09, Pádraig Brady wrote:
On 28/08/2025 02:43, Collin Funk wrote:
Collin Funk <[email protected]> writes:
000000
LC_ALL=en_US.UTF-8 src/fold
000000
LC_ALL=C /bin/fold
000000 c3 >.<
LC_ALL=en_US.UTF-8 /bin/fold
000000 c3 >.<
I suppose a concrete way to test that might be:
# https://datatracker.ietf.org/doc/rfc9839/ bad_unicode() { printf
'\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n'; } test $({ bad_unicode | fold;
bad_unicode; } | uniq | wc -l) = 1 || fail=1
Thanks, I'll have a look at it later today.
Good catch, I see what the issue is.
I kept reading the file if we had a byte that may be an invalid
multi-byte sequence. I did not handle the case of an invalid multi-byte
character being at the end of the file. Therefore, \xC3 was buffered but
never printed.
This patch fixes it. The test feels a bit small for it's own file. But
maybe it should be done anyways though so we can add more test cases?
WDYT?
Looks good. I'd include these test tweaks:
thanks!
Padraig
diff --git a/tests/fold/fold-characters.sh b/tests/fold/fold-characters.shindex
cd17aa176..e8facb224 100755
--- a/tests/fold/fold-characters.sh
+++ b/tests/fold/fold-characters.sh
@@ -83,9 +83,11 @@ compare exp3 out3 || fail=1
# Sequence derived from <https://datatracker.ietf.org/doc/rfc9839>.
bad_unicode ()
{
- printf '\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n' || framework_failure_
+ env printf '\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n' || framework_failure_
}
test $({ bad_unicode | fold; bad_unicode; } | uniq | wc -l) = 1 || fail=1
+# Check bad character at EOF
+test $(env printf '\xC3' | fold | wc -c) = 1 || fail=1
# Ensure bounded memory operation
vm=$(get_min_ulimit_v_ fold /dev/null) && {
Actually the above shows that:
1. coreutils printf is more restrictive than bash in what \u characters it
supports.
I.e. bash supports "unpaired surrogates" and "unicode noncharacters".
2. framework_failure_ is ineffective within a function.
So the following should be better:
diff --git a/tests/fold/fold-characters.sh b/tests/fold/fold-characters.sh
index cd17aa176..c29b4bdd6 100755
--- a/tests/fold/fold-characters.sh
+++ b/tests/fold/fold-characters.sh
@@ -83,9 +83,13 @@ compare exp3 out3 || fail=1
# Sequence derived from <https://datatracker.ietf.org/doc/rfc9839>.
bad_unicode ()
{
- printf '\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n' || framework_failure_
+ # invalid UTF8|unpaired surrogate|NUL|C1 control|noncharacter
+ env printf '\xC3|\xED\xBA\xAD|\u0000|\u0089|\xED\xA6\xBF\xED\xBF\xBF\n'
}
+bad_unicode > /dev/null || framework_failure_
test $({ bad_unicode | fold; bad_unicode; } | uniq | wc -l) = 1 || fail=1
+# Check bad character at EOF
+test $(env printf '\xC3' | fold | wc -c) = 1 || fail=1
# Ensure bounded memory operation
vm=$(get_min_ulimit_v_ fold /dev/null) && {