On Sun, Mar 18, 2018 at 07:46:03PM -0600, Karl Williamson wrote:
> On this system:
> 
> $ uname -a
> OpenBSD cjg-openbsd6 6.2 GENERIC#132 amd64
> 
> 
> compiling and running the attached file yields buggy results.
> 
> setlocale(LC_ALL, "C") returned 'C'
> setlocale(LC_CTYPE, "cguevara_and_khw") returned 'cguevara_and_khw'
> setlocale(LC_CTYPE, NULL) returned 'cguevara_and_khw'
> setlocale(LC_TIME, "en_US.UTF-8") returned 'en_US.UTF-8'
> setlocale(LC_TIME, NULL) returned 'en_US.UTF-8'
> setlocale(LC_NUMERIC, "es_ES.UTF-8") returned 'es_ES.UTF-8'
> setlocale(LC_NUMERIC, NULL) returned 'es_ES.UTF-8'
> setlocale(LC_MONETARY, "it_IT.UTF-8") returned 'it_IT.UTF-8'
> setlocale(LC_MONETARY, NULL) returned 'it_IT.UTF-8'
> setlocale(LC_COLLATE, "nl_NL.UTF-8") returned 'nl_NL.UTF-8'
> setlocale(LC_COLLATE, NULL) returned 'nl_NL.UTF-8'
> setlocale(LC_MESSAGES, "de_DE.UTF-8") returned 'de_DE.UTF-8'
> setlocale(LC_MESSAGES, NULL) returned 'de_DE.UTF-8'
> setlocale(LC_CTYPE, "ro_RO.UTF-8") returned 'ro_RO.UTF-8'
> setlocale(LC_CTYPE, NULL) returned 'ro_RO.UTF-8'
> setlocale(LC_ALL, NULL) returned 'C'
> 
> 
> All locales but 'cguevara_and_khw' were listed in the output of 'locale -a'.
> That one was used to be a deliberately bad locale name.
> 
> The man page of setlocale says that it returns NULL on invalid input.

OpenBSD's setlocale() returns NULL if the LC_* category argument is invalid.

As far as the local name is concerned, OpenBSD's setlocale() only cares
about whether a locale name ends in ".UTF-8". If so, then LC_CTYPE will
support UTF-8, otherwise LC_CTYPE only supports ASCII.

Under these rules there is no invalid input in your examples.
Invalid input can only occur in the first argument:
  setlocale(0xdeadbeef, NULL) returned 'NULL'

> Furthermore, it says only LC_CTYPE can be non-C, non-POSIX.  That means that
> almost all the setlocale calls to change things should have returned NULL,
> and similarly almost all the ones with NULL as the parameter should have
> returned "C".

There is no good reason to make such setlocale() calls fail.
The reason is that there are other users, besides setlocale(), of these
environment variables, which use different rules of interpretation.

For instance, GNU gettext(1) from ports might interpret the part before
".UTF-8" to decide which language to print. If it had support for the
language "cguevara_and_khw" in the ASCII encoding, then setting the value
LC_MESSAGES="cguevara_and_khw" would make sense.

As humans, we know that "cguevara_and_khw" is not a language and that
this setting isn't useful. But setlocale() has no way of knowing that.
This is the reason why the locale(1) man page states:
     The list of supported locales is perpetually incomplete.

Furthermore, programs might expect setlocale() and gettext() to agree.
So if setlocale() failed where gettext() does not, then programs coming
from systems like Linux would not work as expected on OpenBSD and would
have to be patched in order to run correctly.

> The Romanian locale should have been legal, but then why doesn't the final
> call to see what LC_ALL is, list LC_CTYPE as being Romanian with the other
> categories being "C".

> setlocale(LC_ALL, NULL) returned 
> 'LC_CTYPE=ro_RO.UTF-8;LC_NUMERIC=es_ES.UTF-8;LC_TIME=en_US.UTF-8;LC_COLLATE=nl_NL.UTF-8;LC_MONETARY=it_IT.UTF-8;LC_MESSAGES=de_DE.UTF-8;LC_PAPER=C;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=C;LC_IDENTIFICATION=C'
> 
> which looks to me like proper functioning

Locale names are not standardized. Implementations can do whatever they want
with these names. Comparing locale names between systems isn't meaningful in
any way,

OpenBSD's current behaviour aims to be a good compromise between a base system
which supports only ASCII and UTF-8, while still allowing code written for
systems like Linux to run without modifications on top of that base system.

So I don't think there is any bug to fix here.

Reply via email to