On 03/19/2018 07:49 AM, Ingo Schwarze wrote:
Hi,

Stefan answered this completely, except for one minor detail
which i'm adding:

Stefan Sperling wrote on Mon, Mar 19, 2018 at 12:16:20PM +0100:

Invalid input can only occur in the first argument:
   setlocale(0xdeadbeef, NULL) returned 'NULL'

Not true.  In addition to that, the CHARSET (i.e. the part after
the last dot if any) is also checked for validity - ".UTG-8" is
valid, any other suffix is invalid and causes NULL to be returned.

Besides, setlocale(3) can return NULL in case of ENOMEM.

Yours,
   Ingo

Thanks for your quick replies. I'm cc'ing in Andrew Fresh who expressed an interest in this after seeing scrollback of an #irc chat I participated in on this issue.

I can understand, given your philosophy, why it mostly works the way it does.

But your man page doesn't describe any of this. It doesn't say that UTF-8 is a legal locale, for example. It does say that LC_CTYPE is the only category that can be other than C or POSIX, but it doesn't say the only other possible one is UTF-8. I think it should. If your replies to me were slightly repackaged and placed into the man page, that would help a lot.

I still believe that in my program the setlocale() returning C for LC_ALL is a bug. LC_CTYPE should have successfully been set to Romanian UTF-8, and so LC_ALL isn't C. Instead, it is a combination of C for all the other categories, and UTF-8 for LC_CTYPE. A return of just "C" doesn't reflect that complexity. There is a footnote in the ANSI/ISO 9899-1990 C standard that the returned string must support that heterogeneity, and that the return value be able to be used in a future setlocale to get back to the original state. Your setlocale violates the standard therefore, and harms your portability goal. (And to be consistent, LC_ALL should have been the heterogenous composition of all the locale strings that the program sets, even if they just boil down to C or UTF-8. Programs that run on other OS's are expecting this, and again your portability goal is compromised)

I don't know what would happen if one were to call setlocale(LC_ALL, "ro_RO.UTF-8"); I do not have direct access to an openbsd machine. I help support the perl5 programming language that runs on it, and one of our users reported the problem. I hate to ask him to run another experiment, just for this issue.

BTW, There is some variance actually in real UTF-8 locales, which you may not have considered. Unicode, contrary to their claims, is not completely locale-independent in LC_CTYPE. Some Turkish locales that are UTF-8 use alternate casing rules for the dotless and dotted i characters. And some, especially earlier, UTF-8 locales consider various ASCII characters that are mandated by POSIX to be ispunct() to not be punctuation. The affected characters are things Unicode considers to be symbols, for which POSIX has no equivalent classification. Things like '$', '<', ....

I urge you to update your man page. If it had set out what you've stated in your replies, it would have saved our project a bunch of hours of work.




  $ make setlocale
cc -O2 -pipe    -o setlocale setlocale.c
  $ LC_CTYPE=en_US.UTF-8 ./setlocale
en_US.UTF-8
  $ LC_CTYPE=en_US.UTF-9 ./setlocale
setlocale: setlocale failed
  $ cat setlocale.c                                     
#include <err.h>
#include <locale.h>
#include <stdio.h>

int
main(void)
{
        char *retval;

        retval = setlocale(LC_CTYPE, "");
        if (retval == NULL)
                errx(1, "setlocale failed");
        puts(retval);
        return 0;
}


Reply via email to