On 24/02/2024 20:44, Aearil via GNU coreutils Bug Reports wrote:
Hi,
wc -w doesn't seem to recognize whitespace characters with a codepoint
over UCHAR_MAX (255) as word separators. For example, using the
character EM SPACE U+2003:
$ printf "foo\u2003bar" | ./wc -w
1
I should get a word count of 2, but instead the space is ignored while
counting words. Meanwhile, wc v9.4 gives the correct answer:
$ printf "foo\u2003bar" | wc -w
2
It looks like the regression has been introduced by [f40c6b5] and
would be fixed by something like the following change:
diff --git a/src/wc.c b/src/wc.c
index f5a921534..9d456f8c0 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus,
off_t current_pos)
if (width > 0)
linepos += width;
}
- in_word2 = !iswnbspace (wide_char);
+ in_word2 = !iswspace (wide_char) && !iswnbspace
(wide_char);
}
/* Count words by counting word starts, i.e., each
Nice one.
Great to catch this before release.
I've augmented your patch with a test,
and will push the attached later.
Marking this as done.
thanks!
Pádraig
From ced8c64c986b79c0bfa74028a9581e07d5df1974 Mon Sep 17 00:00:00 2001
From: Aearil <aea...@paranoici.org>
Date: Sat, 24 Feb 2024 21:44:24 +0100
Subject: [PATCH] wc: fix -w with breaking space over UCHAR_MAX
* src/wc.c (wc): Fix regression introduced in commit v9.4-48-gf40c6b5cf.
* tests/wc/wc-nbsh.sh: Add test cases for "standard" spaces.
Fixes https://bugs.gnu.org/69369
---
src/wc.c | 2 +-
tests/wc/wc-nbsp.sh | 5 +++++
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/src/wc.c b/src/wc.c
index f5a921534..9d456f8c0 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
if (width > 0)
linepos += width;
}
- in_word2 = !iswnbspace (wide_char);
+ in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char);
}
/* Count words by counting word starts, i.e., each
diff --git a/tests/wc/wc-nbsp.sh b/tests/wc/wc-nbsp.sh
index 371cc8b5b..39a8baccc 100755
--- a/tests/wc/wc-nbsp.sh
+++ b/tests/wc/wc-nbsp.sh
@@ -38,10 +38,15 @@ fi
export LC_ALL=en_US.UTF-8
if test "$(locale charmap 2>/dev/null)" = UTF-8; then
+ #non breaking space class
check_word_sep '\u00A0'
check_word_sep '\u2007'
check_word_sep '\u202F'
check_word_sep '\u2060'
+
+ #sampling of "standard" space class
+ check_word_sep '\u0020'
+ check_word_sep '\u2003'
fi
export LC_ALL=ru_RU.KOI8-R
--
2.43.0