Re: fold: add the --characters option

Pádraig Brady Thu, 28 Aug 2025 03:57:14 -0700

On 28/08/2025 11:09, Pádraig Brady wrote:

On 28/08/2025 02:43, Collin Funk wrote:

Collin Funk <[email protected]> writes:

000000
LC_ALL=en_US.UTF-8 src/fold
000000
LC_ALL=C /bin/fold
000000 c3                                               >.<
LC_ALL=en_US.UTF-8 /bin/fold
000000 c3                                               >.<


I suppose a concrete way to test that might be:

    # https://datatracker.ietf.org/doc/rfc9839/  bad_unicode() { printf 
'\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n'; }  test $({ bad_unicode | fold; 
bad_unicode; } | uniq | wc -l) = 1 || fail=1


Thanks, I'll have a look at it later today.


Good catch, I see what the issue is.

I kept reading the file if we had a byte that may be an invalid
multi-byte sequence. I did not handle the case of an invalid multi-byte
character being at the end of the file. Therefore, \xC3 was buffered but
never printed.

This patch fixes it. The test feels a bit small for it's own file. But
maybe it should be done anyways though so we can add more test cases?
WDYT?


Looks good. I'd include these test tweaks:

thanks!
Padraig

diff --git a/tests/fold/fold-characters.sh b/tests/fold/fold-characters.shindex 
cd17aa176..e8facb224 100755
--- a/tests/fold/fold-characters.sh
+++ b/tests/fold/fold-characters.sh
@@ -83,9 +83,11 @@ compare exp3 out3 || fail=1
   # Sequence derived from <https://datatracker.ietf.org/doc/rfc9839>.
   bad_unicode ()
   {
-  printf '\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n' || framework_failure_
+  env printf '\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n' || framework_failure_
   }
   test $({ bad_unicode | fold; bad_unicode; } | uniq | wc -l) = 1 || fail=1
+# Check bad character at EOF
+test $(env printf '\xC3' | fold | wc -c) = 1 || fail=1

   # Ensure bounded memory operation
   vm=$(get_min_ulimit_v_ fold /dev/null) && {


Actually the above shows that:

1. coreutils printf is more restrictive than bash in what \u characters it 
supports.
I.e. bash supports "unpaired surrogates" and "unicode noncharacters".

2. framework_failure_ is ineffective within a function.

So the following should be better:

diff --git a/tests/fold/fold-characters.sh b/tests/fold/fold-characters.sh
index cd17aa176..c29b4bdd6 100755
--- a/tests/fold/fold-characters.sh
+++ b/tests/fold/fold-characters.sh
@@ -83,9 +83,13 @@ compare exp3 out3 || fail=1
 # Sequence derived from <https://datatracker.ietf.org/doc/rfc9839>.
 bad_unicode ()
 {
-  printf '\xC3|\u0000|\u0089|\uDEAD|\uD9BF\uDFFF\n' || framework_failure_
+  # invalid UTF8|unpaired surrogate|NUL|C1 control|noncharacter
+  env printf '\xC3|\xED\xBA\xAD|\u0000|\u0089|\xED\xA6\xBF\xED\xBF\xBF\n'
 }
+bad_unicode > /dev/null || framework_failure_
 test $({ bad_unicode | fold; bad_unicode; } | uniq | wc -l) = 1 || fail=1
+# Check bad character at EOF
+test $(env printf '\xC3' | fold | wc -c) = 1 || fail=1

 # Ensure bounded memory operation
 vm=$(get_min_ulimit_v_ fold /dev/null) && {

Re: fold: add the --characters option

Reply via email to