On 18/02/19 00:12, vampyre...@gmail.com wrote: > $ wc --version > wc (GNU coreutils) 8.29 > Packaged by Gentoo (8.29-r1 (p1.0)) > > The man page for wc states: "A word is a... sequence of characters delimited > by white space." > > But its concept of white space only seems to include ASCII white space. > U+00A0 NO-BREAK SPACE, for instance, is not recognized. > > If your terminal displays UTF-8 encoding: > > printf 'how are\xC2\xA0you\n' > > or if your terminal displays ISO 8859-1 encoding: > > printf 'how are\xA0you\n' > > the visible output of this printf is "how are you". In either case, wc does > not recognize the second space as white space, resulting in an incorrect word > count: > > $ printf 'how are\xC2\xA0you\n' | LC_ALL=en_US.utf8 wc -w > 2 > $ printf 'how are\xA0you\n' | LC_ALL=en_US.iso88591 wc -w > 2
wc does support multi-byte locales well and we use iswspace() to test whether it's a separator or not. Though on glibc, NBSP is not considered a space. I wrote a little prog to output what is considered a space on glibc locales: 0009 HORIZONTAL TAB 000A NEW LINE (not blank) 000B VERTICAL TAB (not blank) 000C FORM FEED (not blank) 000D CARRIAGE RETURN (not blank) 0020 SPACE 1680 OGHAM SPACE MARK 2000 EN QUAD 2001 EM QUAD 2002 EN SPACE 2003 EM SPACE 2004 THREE-PER-EM SPACE 2005 FOUR-PER-EM SPACE 2006 SIX-PER-EM SPACE 2008 PUNCTUATION SPACE 2009 THIN SPACE 200A HAIR SPACE 2028 LINE SEPARATOR (not blank) 2029 PARAGRAPH SEPARATOR (not blank) 205F MEDIUM MATHEMATICAL SPACE 3000 IDEOGRAPHIC SPACE In the non breaking space class we have: 00A0 NON BREAKING SPACE 2007 FIGURE SPACE 202F NARROW NO-BREAK SPACE 2060 WORD JOINER Maybe we should consider these as word separators? I pasted `printf '=\u00A0=\u2007=\u202F=\u2060=\n'` into libreoffice writer and it treated all but the last as a word separator in its word count tool. There is some discussion of POSIX and unicode classes at: http://unicode.org/L2/L2003/03139-posix-classes.htm I guess POSIX is defining lower level functionality and has to be compat with all uses of iswspace() which might be used for line reformatting etc. but wc(1) being higher level, perhaps should consider the non breaking variants as word separators? The following change would do that: diff --git a/src/wc.c b/src/wc.c index 179abbe..ca990b4 100644 --- a/src/wc.c +++ b/src/wc.c @@ -147,6 +147,13 @@ the following order: newline, word, character, byte, maximum line length.\n\ exit (status); } +static int _GL_ATTRIBUTE_PURE +iswnbspace (wint_t wc) +{ + return wc == L'\u00A0' || wc == L'\u2007' \ + || wc == L'\u202F' || wc == L'\u2060'; +} + /* FILE is the name of the file (or NULL for standard input) associated with the specified counters. */ static void @@ -455,7 +462,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) if (width > 0) linepos += width; } - if (iswspace (wide_char)) + if (iswspace (wide_char) || iswnbspace (wide_char)) goto mb_word_separator; in_word = true; } Note general word boundary handling is complicated: https://www.unicode.org/reports/tr29/#Word_Boundaries Consider this number with figure space: $ printf "1\u2007234,56\n" 1 234,56 That would be considered as one word rather than two. For more sophisticated contextual processing we would need to use some of the word break functionality from libunistring. cheers, Pádraig