Re: character classification on FreeBSD

Bruno Haible via GNU coreutils General Discussion Mon, 22 Sep 2025 09:38:45 -0700

[CCing bug-gnulib]
Pádraig Brady wrote in
<https://lists.gnu.org/archive/html/coreutils/2025-09/msg00201.html>:
> It seems to be treating the non-breaking space char as non printable.
> I'd already noted that for FreeBSD 11 in tests/wc/wc-nbsp.sh


and later in
<https://lists.gnu.org/archive/html/coreutils/2025-09/msg00207.html>:
> Anyway this is not an issue for current FreeBSD,
> so let's be conservative before the release
> and just apply the test fix which is needed in either case
> (since wc -L treats all non printable as zero width).

Character classification (and wcwidth()) of non-ASCII characters
is pretty broken in FreeBSD.

With the attached test program I get the following results on
FreeBSD 11 and 12:

============ U+00A0 in fr_FR.ISO8859-1 locale
ret = 1, wc = 0xa0
iscntrl -> 0
iswcntrl -> 0
isgraph -> 0
iswgraph -> 0
isprint -> 0
iswprint -> 0
wcwidth -> -1
============ U+00A0 in fr_FR.UTF-8 locale
ret = 2, wc = 0xa0
iswcntrl -> 0
iswgraph -> 0
iswprint -> 0
wcwidth -> -1
============ U+00DF in fr_FR.ISO8859-1 locale
ret = 1, wc = 0xdf
iscntrl -> 0
iswcntrl -> 0
isgraph -> 0
iswgraph -> 1
isprint -> 0
iswprint -> 1
wcwidth -> 1
============ U+00DF in fr_FR.UTF-8 locale
ret = 2, wc = 0xdf
iswcntrl -> 0
iswgraph -> 1
iswprint -> 1
wcwidth -> 1

and on FreeBSD 13 and 14:

============ U+00A0 in fr_FR.ISO8859-1 locale
ret = 1, wc = 0xa0
iscntrl -> 0
iswcntrl -> 0
isgraph -> 0
iswgraph -> 0
isprint -> 0
iswprint -> 0
wcwidth -> -1
============ U+00A0 in fr_FR.UTF-8 locale
ret = 2, wc = 0xa0
iswcntrl -> 0
iswgraph -> 0
iswprint -> 1
wcwidth -> 1
============ U+00DF in fr_FR.ISO8859-1 locale
ret = 1, wc = 0xdf
iscntrl -> 0
iswcntrl -> 0
isgraph -> 0
iswgraph -> 1
isprint -> 0
iswprint -> 1
wcwidth -> 1
============ U+00DF in fr_FR.UTF-8 locale
ret = 2, wc = 0xdf
iswcntrl -> 0
iswgraph -> 1
iswprint -> 1
wcwidth -> 1

So it means:
  1) In the fr_FR.ISO8859-1 locale, the <ctype.h> functions
     are inconsistent with the <wctype.h> functions.
  2) U+00A0 is considered an invalid character in the
     fr_FR.ISO8859-1 locale and, in FreeBSD 12 and older,
     also in the fr_FR.UTF-8 locale.
  3) U+00A0 is considered a space character in the fr_FR.UTF-8 locale,
     in FreeBSD 13 and newer.

Generally, Gnulib does not try to fix these kinds of issues because
  - POSIX does not mandate specific values for specific characters,
    that is, it is a "quality of implementation" issue.
  - There are many locales, and as such, the amount of data to
    maintain and to package within Gnulib would be large.

So, it's best handled by adjusting the unit tests accordingly.
Like gnulib/tests/test-c32isgraph.c does:

        #if !((defined __APPLE__ && defined __MACH__) || defined __FreeBSD__ || 
defined __DragonFly__ || defined __NetBSD__ || defined __sun || defined 
__CYGWIN__ || (defined _WIN32 && !defined __CYGWIN__))
          /* U+00A0 NO-BREAK SPACE */
          is = for_character ("\302\240", 2);
          ASSERT (is != 0);
        #endif

If anyone wants functions with decent quality on all platforms,
they can use the <unictype.h> functions (libunistring).

Should I create an analogon to mbsnwidth (with flags) in
<uniwidth.h> ? As a generalization of u8_strwidth.

Bruno

#define _GNU_SOURCE 1
#include <locale.h>
#include <ctype.h>
#include <wchar.h>
#include <wctype.h>
#include <stdlib.h>
#include <stdio.h>
int main ()
{
 {
  printf ("============ U+00A0 in fr_FR.ISO8859-1 locale\n");
  if (setlocale (LC_ALL, "fr_FR.ISO8859-1") == NULL)
    return 1;
  mbstate_t st = { 0 };
  wchar_t wc;
  int ret = mbrtowc (&wc, "\xa0", 2, &st);
  printf ("ret = %d, wc = 0x%x\n", ret, wc);
  printf ("iscntrl -> %d\n", !!iscntrl (0xA0));
  printf ("iswcntrl -> %d\n", !!iswcntrl (wc));
  printf ("isgraph -> %d\n", !!isgraph (0xA0));
  printf ("iswgraph -> %d\n", !!iswgraph (wc));
  printf ("isprint -> %d\n", !!isprint (0xA0));
  printf ("iswprint -> %d\n", !!iswprint (wc));
  printf ("wcwidth -> %d\n", wcwidth (wc));
 }
 {
  printf ("============ U+00A0 in fr_FR.UTF-8 locale\n");
  if (setlocale (LC_ALL, "fr_FR.UTF-8") == NULL)
    return 1;
  mbstate_t st = { 0 };
  wchar_t wc;
  int ret = mbrtowc (&wc, "\xc2\xa0", 2, &st);
  printf ("ret = %d, wc = 0x%x\n", ret, wc);
  printf ("iswcntrl -> %d\n", !!iswcntrl (wc));
  printf ("iswgraph -> %d\n", !!iswgraph (wc));
  printf ("iswprint -> %d\n", !!iswprint (wc));
  printf ("wcwidth -> %d\n", wcwidth (wc));
 }
 {
  printf ("============ U+00DF in fr_FR.ISO8859-1 locale\n");
  if (setlocale (LC_ALL, "fr_FR.ISO8859-1") == NULL)
    return 1;
  mbstate_t st = { 0 };
  wchar_t wc;
  int ret = mbrtowc (&wc, "\xdf", 2, &st);
  printf ("ret = %d, wc = 0x%x\n", ret, wc);
  printf ("iscntrl -> %d\n", !!iscntrl (0xA0));
  printf ("iswcntrl -> %d\n", !!iswcntrl (wc));
  printf ("isgraph -> %d\n", !!isgraph (0xA0));
  printf ("iswgraph -> %d\n", !!iswgraph (wc));
  printf ("isprint -> %d\n", !!isprint (0xA0));
  printf ("iswprint -> %d\n", !!iswprint (wc));
  printf ("wcwidth -> %d\n", wcwidth (wc));
 }
 {
  printf ("============ U+00DF in fr_FR.UTF-8 locale\n");
  if (setlocale (LC_ALL, "fr_FR.UTF-8") == NULL)
    return 1;
  mbstate_t st = { 0 };
  wchar_t wc;
  int ret = mbrtowc (&wc, "\xc3\x9f", 2, &st);
  printf ("ret = %d, wc = 0x%x\n", ret, wc);
  printf ("iswcntrl -> %d\n", !!iswcntrl (wc));
  printf ("iswgraph -> %d\n", !!iswgraph (wc));
  printf ("iswprint -> %d\n", !!iswprint (wc));
  printf ("wcwidth -> %d\n", wcwidth (wc));
 }
}

Re: character classification on FreeBSD

Reply via email to