Hi Hans,

Hans Aberg wrote on Thu, Jun 25, 2020 at 10:15:03AM +0200:

> MacOS sets as default LC_CTYPE=UTF-8, not appearing in the 'locale
> -a' list. Then some software interprets this as though the locale
> is C/POSIX, disregards the UTF-8 encoding, and converts all non-ASCII
> (high bit set) char's into octal escape sequences. What is the
> correct interpretation here?

The correct interpretation of "LC_CTYPE=UTF-8" is whatever the
documentation of the respective operating system says.
All POSIX says is:

  https://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html

  The locale argument is a pointer to a character string containing
  the required setting of category.  The contents of this string are
  implementation-defined.

POSIX only specifies the meaning of the strings "C" and "POSIX";
any others are implementation-defined.

For example, the OpenBSD manual page says:

  https://man.openbsd.org/setlocale.3

  The syntax and semantics of the locale argument are not standardized
  and vary among operating systems.  On OpenBSD, if the locale string
  ends with ".UTF-8", the UTF-8 locale is selected; otherwise, the
  "C" locale is selected, which uses the ASCII character set.  If
  the locale contains a dot but does not end with ".UTF-8", setlocale()
  fails.

Which is indeed true here:

   $ uname -a
  OpenBSD isnote.usta.de 6.7 GENERIC.MP#224 amd64
   $ LC_CTYPE=FOOBAR.UTF-8 locale charmap
  UTF-8
   $ LC_CTYPE=UTF-8 locale charmap  
  US-ASCII

To the best of my knowledge, we are POSIX-compliant in this respect.
Other systenms are of course free to make different choices.

Even though POSIX says this is implementation-defined, which implies
that operating systems are expected to document their specific rules,
some fail to do so, for example:

  https://man.bsd.lv/FreeBSD-12.0/setlocale.3
  https://man.bsd.lv/NetBSD-8.1/setlocale.3

Some do specify it.  For example, according to

  https://man.bsd.lv/Linux-5.06/setlocale.3

the string "UTF-8" would be invalid because it lacks the "language"
part which is mandatory on Linux.

For example, on a very old Linux system i have access to:

   $ uname -a
  Linux donnerwolke.asta.kit.edu 4.9.0-0.bpo.3-686 #1 SMP \
    Debian 4.9.30-2+deb9u5~bpo8+1 (2017-09-28) i686 GNU/Linux
   $ LC_CTYPE=en_US.UTF-8 locale charmap
  UTF-8
   $ LC_CTYPE=UTF-8 locale charmap
  locale: Cannot set LC_CTYPE to default locale: No such file or directory
  locale: Cannot set LC_ALL to default locale: No such file or directory
  ANSI_X3.4-1968

Yours,
  Ingo

Reply via email to