Hello Ingo,
On Tue, Aug 19, 2025 at 05:39:13PM +0200, Ingo Schwarze wrote:
> Hi Walter,
>
> Walter Alejandro Iglesias wrote on Mon, Aug 18, 2025 at 06:40:04PM +0200:
>
> > Question for the experts. Let's take the following example:
> >
> > ----->8------------->8--------------------
> > #include <stdio.h>
> > #include <string.h>
> > #include <wchar.h>
> >
> > #define period 0x2e
> > #define question 0x3f
> > #define exclam 0x21
> > #define ellipsis L'\u2026'
> >
> > const wchar_t p[] = { period, question, exclam, ellipsis };
>
> In addition to what otto@ said, this is bad style for more than one
> reason.
>
> First of all, that data type of the constant "0x2e" is "int",
> see for example C11 6.4.4.1 (Integer constants). Casting "int"
> to "wchar_t" doesn't really make sense. On OpenBSD, it only
> works because UTF-8 is the only supported character encoding *and*
> wchar_t stores Unicode codepoints. But neither of these choices
> are portable. What you want is (C11 6.4.4.4 Character constants):
>
> #define period L'.'
> #define question L'?'
> #define exclam L'!'
As I explain below I did that in a program I wrote to work with UTF-8
only. But I'll follow your advice and adopt this practice from now on.
>
> > int
> > main()
> > {
> > const wchar_t s[] = L". Hello.";
> >
> > printf("%ls\n", s);
> > printf("%lu\n", wcsspn(s, p));
>
> The return value of wcsspn(3) is size_t, so this should use %zu.
Yeah, the compiler warned me about this. I wrote the example
carelessly.
>
> Besides, given that the second argument of wcsspn(3)
> takes "const wchar_t *", why not simply:
>
> const wchar_t *p = L".?!\u2026";
I'd tried this:
const wchar_t p[] = L".?!\u2026";
and saw that it solved the problem, *but I didn't undesrtand why*. My
mistake was assuming that since this syntax didn't require specifying
the length in the brakets, neither did the one I used.
By the way, the program where I experienced the failures is this:
https://en.roquesor.com/Downloads/fmtroff.c
As you can see in the code, my intention was to define all the
characters in a legible, clear, and practical way but, after
encountering this problem, I seriously wondered if I'd made my life
complicated by writing it like this.
>
> And finally, if you want wchar_t to store UTF-8 strings, you need
> something like
>
> #include <err.h>
> #include <locale.h>
>
> if (setlocale(LC_CTYPE, "C.UTF-8") == NULL)
> errx(1, "setlocale failed");
>
> Otherwise, the C library function operating on wide strings
> assume that wchar_t only stores ASCII character numbers.
> Even printf(3) %ls won't work for UTF-8 characters without
> setting the locale properly.
Yes, it was an oversight on my part not to include setlocale() in the
example. By the way, If you take a look to fmtroff.c you'll see this
line:
setlocale(LC_CTYPE, "");
My intention with fmtroff was to have it work only with UTF-8, so first
I'd used the UTF-8 specification in setlocale() as in your example.
Later I decided to leave that field empty because, after testing under
Linux, I found that with other locales, except for that it doesn't take
advantage of UTF-8 hardcoded punctuation, the program also does its job.
As it happens with wide character functions, the problem comes when,
under UTF-8 locale, you edit a file containing non valid UTF-8
characters. My previous version of the program was written without
wide-char functions and, as fmt(1) from base, it hasn't this problem.
Each version has its pro an cons. I use it as a more suitable version
of fmt(1) to edit my novels in Spanish with Groff.
>
> Yours,
> Ingo
>
--
Walter