Re: setlocale() bugs

Karl Williamson Tue, 20 Mar 2018 07:45:16 -0700

On 03/19/2018 07:49 AM, Ingo Schwarze wrote:

Hi,


Stefan answered this completely, except for one minor detail
which i'm adding:

Stefan Sperling wrote on Mon, Mar 19, 2018 at 12:16:20PM +0100:

Invalid input can only occur in the first argument:
   setlocale(0xdeadbeef, NULL) returned 'NULL'


Not true.  In addition to that, the CHARSET (i.e. the part after
the last dot if any) is also checked for validity - ".UTG-8" is
valid, any other suffix is invalid and causes NULL to be returned.

Besides, setlocale(3) can return NULL in case of ENOMEM.

Yours,
   Ingo

Thanks for your quick replies. I'm cc'ing in Andrew Fresh who expressedan interest in this after seeing scrollback of an #irc chat Iparticipated in on this issue.

I can understand, given your philosophy, why it mostly works the way itdoes.

But your man page doesn't describe any of this. It doesn't say thatUTF-8 is a legal locale, for example. It does say that LC_CTYPE is theonly category that can be other than C or POSIX, but it doesn't say theonly other possible one is UTF-8. I think it should. If your repliesto me were slightly repackaged and placed into the man page, that wouldhelp a lot.

I still believe that in my program the setlocale() returning C forLC_ALL is a bug. LC_CTYPE should have successfully been set to RomanianUTF-8, and so LC_ALL isn't C. Instead, it is a combination of C for allthe other categories, and UTF-8 for LC_CTYPE. A return of just "C"doesn't reflect that complexity. There is a footnote in the ANSI/ISO9899-1990 C standard that the returned string must support thatheterogeneity, and that the return value be able to be used in a futuresetlocale to get back to the original state. Your setlocale violatesthe standard therefore, and harms your portability goal. (And to beconsistent, LC_ALL should have been the heterogenous composition of allthe locale strings that the program sets, even if they just boil down toC or UTF-8. Programs that run on other OS's are expecting this, andagain your portability goal is compromised)

I don't know what would happen if one were to call setlocale(LC_ALL,"ro_RO.UTF-8"); I do not have direct access to an openbsd machine. Ihelp support the perl5 programming language that runs on it, and one ofour users reported the problem. I hate to ask him to run anotherexperiment, just for this issue.

BTW, There is some variance actually in real UTF-8 locales, which youmay not have considered. Unicode, contrary to their claims, is notcompletely locale-independent in LC_CTYPE. Some Turkish locales thatare UTF-8 use alternate casing rules for the dotless and dotted icharacters. And some, especially earlier, UTF-8 locales considervarious ASCII characters that are mandated by POSIX to be ispunct() tonot be punctuation. The affected characters are things Unicodeconsiders to be symbols, for which POSIX has no equivalentclassification. Things like '$', '<', ....

I urge you to update your man page. If it had set out what you'vestated in your replies, it would have saved our project a bunch of hoursof work.



  $ make setlocale
cc -O2 -pipe    -o setlocale setlocale.c
  $ LC_CTYPE=en_US.UTF-8 ./setlocale
en_US.UTF-8
  $ LC_CTYPE=en_US.UTF-9 ./setlocale
setlocale: setlocale failed
  $ cat setlocale.c                                     
#include <err.h>
#include <locale.h>
#include <stdio.h>

int
main(void)
{
        char *retval;

        retval = setlocale(LC_CTYPE, "");
        if (retval == NULL)
                errx(1, "setlocale failed");
        puts(retval);
        return 0;
}

Re: setlocale() bugs

Reply via email to