On Mon, 26 Oct 2020 at 00:35:45 -0400, Nick Black wrote: > Thanks for the quick response, Felix. You say that "[you] will > probably start setting $LANG in that part of Lintian." what LANG > will you be using? Attempting to set LANG=en_US.UTF-8 in my > salsa ci variables resulted in setlocale(3) failing all over the > place, presumably due to the locale not having been generated.
C.UTF-8 is available on all Debian systems. It's the standard C/POSIX locale, except that in the C locale the meaning of bytes 0x80-0xFF is undefined, while in C.UTF-8 they are assumed/defined to be part of a character encoded in UTF-8. If you care about portability to non-Debian systems, note that C.UTF-8 is a somewhat popular extension (I think it originated in the Fedora/Red Hat family before it was adopted by Debian and other distros) but is far from universally available. In particular, I'm aware of Arch Linux specifically *not* having it. The glibc maintainers consider the implementation used in e.g. Fedora and Debian to be a hack rather than something they want to maintain forever, but my understanding is that they would be willing to accept a better implementation. en_US.UTF-8 is indeed not portable. Some OSs (Fedora, I think?) always generate the en_US.UTF-8 locale regardless of any other configuration that might exist, but Debian does not: if you chose a non-English locale like fr_FR.UTF-8 or a non-American English locale like en_GB.UTF-8 during installation, then you will normally only have three locales, your chosen national locale plus the international locales C and C.UTF-8. Minimal container/chroot environments, and in particular the official Debian buildds, will normally only have C and C.UTF-8. See src:gtk+4.0 for an example of how to generate additional locales on-demand if your unit tests need them. Third-party software from outside Debian frequently assumes that the en_US.UTF-8 locale does exist - in particular, it's common enough for Steam games to want it to exist that Steam's diagnostic tool now checks for it. This is mostly because it's semi-frequently (ab)used as a way to parse and serialize C-syntax floating point in programming languages or configuration files without getting confused by non-English decimal points (e.g. 1.23 in English locales is 1,23 in French locales, which means a naive implementation might write {"x": 1,23, "y": 4,56} into a JSON file, which is of course a syntax error). The portable way to read/write configuration files and C-like source code is to avoid the POSIX locale-sensitive functions completely, and use something like GLib's g_ascii_strtod() or CPython's PyOS_string_to_double() (lots of libraries and frameworks will have an equivalent, those are just the ones I'm most familiar with). This also has the advantage of being thread-safe, unlike temporarily switching POSIX locales, which is normally process-wide and therefore not thread-safe. Another correct way to do this since POSIX.1-2008 is to use POSIX uselocale() and the C locale, but that's unlikely to be portable to Windows or to exotic Unix implementations, so widely-portable software generally ends up having to reinvent something equivalent to g_ascii_strtod() anyway. smcv