On Wed, Aug 20, 2025 at 04:33:47PM +0200, Ingo Schwarze wrote:
> Hell Walter,
>
> Walter Alejandro Iglesias wrote on Wed, Aug 20, 2025 at 09:18:52AM +0200:
> > On Tue, Aug 19, 2025 at 05:39:13PM +0200, Ingo Schwarze wrote:
> >> Walter Alejandro Iglesias wrote on Mon, Aug 18, 2025 at 06:40:04PM +0200:
>
> >>> #define period 0x2e
> >>> #define question 0x3f
> >>> #define exclam 0x21
> >>> #define ellipsis L'\u2026'
> >>> const wchar_t p[] = { period, question, exclam, ellipsis };
>
> >> In addition to what otto@ said, this is bad style for more than one
> >> reason.
> >>
> >> First of all, that data type of the constant "0x2e" is "int",
> >> see for example C11 6.4.4.1 (Integer constants). Casting "int"
> >> to "wchar_t" doesn't really make sense. On OpenBSD, it only
> >> works because UTF-8 is the only supported character encoding *and*
> >> wchar_t stores Unicode codepoints. But neither of these choices
> >> are portable. What you want is (C11 6.4.4.4 Character constants):
> >>
> >> #define period L'.'
> >> #define question L'?'
> >> #define exclam L'!'
>
> > As I made this change to my code (https://en.roquesor.com/fmtroff.html)
> > the following reminded me why, at some point, I decided to switch to
> > hexadecimal notation.
> >
> > #define backslash L'\\'
> > #define apostrophe L'\''
> >
> > It isn't very confusing there, but among the arguments of a function or
> > a conditional...
>
> Making code look nice is nice to have and can even make code more
> readable and hence reduce the likelihood of bugs. But even if you
> are coding with narrow strings for ASCII only, whether
>
> char mychar = 0x5c;
> char mychar = 92;
> char mychar = 0134;
>
> is more readable than
>
> char mychar = '\\';
>
> is debateable; at least i would find reading the latter easier than
> the former, even in a conditional or function call argument.
If it weren't because I don't like using UTF-8 characters in the code (I
use vi(1) from base to code), I would write the characters themselves
directly, both narrow and wide. That's undoubtely the most human
readable option. :-)
>
> For narrow characters, the portability argument is weak; writing
> code that is portable to EBCDIC machines is the kind of excessive
> portability that provokes bugs rather than prevent them. But still,
> i'd recommend against specifying narrow characters numerically.
> Even mandoc_char(7) says:
>
> NUMBERED CHARACTERS
> For backward compatibility with existing manuals, mandoc(1)
> also supports the
> \N'number' and \[charnumber]
> escape sequences, inserting the character number from the
> current character set into the output. Of course, this is
> inherently non-portable and is already marked as deprecated
> in the Heirloom roff manual; on top of that, the second form
> is a GNU extension. For example, do not use \N'34' or
> \[char34], use \(dq, or even the plain `"' character where
> possible.
In my Groff files, for Spanish, except for a definition I added to my
macros for the UTF-8 ellipsis (out of the reach of preconv(1)), I write
all UTF-8 characters as is.
>
> A similar recommendation makes sense for C code.
>
> What *is* portable is specifying wide characters by Unicode
> codepoint numbers, for example:
>
> wchar_t mywide = L'\u2026'; /* horizontal ellipsis */
>
> But note that the C standard (C11 6.4.3.2 Universal character names)
> explicitly requires the argument to \u to be at least 00A0,
> with only three exceptions:
>
> L'\u0024' == L'$'
> L'\u0040' == L'@'
> L'\u0060' == L'`'
>
> Being so specific is a weird quirk of the standard, but it means
> you should better not abuse \u to obfuscate ASCII codepoints -
> apart from being very ugly, it may not even work. For example,
> current base clang dies like this:
>
> error: character 'A' cannot be specified by a universal character name
> 13 | wchar_t mywide = L'\u0041';
> 1 error generated.
>
> So there is no real alternative to L'\\'. While L'\x5c' and L'\134'
> work for UTF-8 (and hence on OpenBSD), even that is not guaranteed
> to be portable, and what those two produce may depend both on the
> implementation and on the locale.
I already changed all my ASCII character definitions to the notation you
advice and left the UTF-8 ones with the L'\u????' code:
https://en.roquesor.com/Downloads/fmtroff.c
Here I mention your help:
https://en.roquesor.com/fmtroff.html
Andando y aprendiendo. :-)
>
> Yours,
> Ingo
>
--
Walter