On 31/12/2018 10:18, Stephane Chazelas wrote:
2018-12-31 06:51:13 +0000, Harald van Dijk:
[...]
Changing the value of LC_CTYPE after the shell has started
shall not affect the lexical processing of shell commands in
the current shell execution environment or its subshells.
[...]

Good find.

I wonder where that requirement comes from given that it is in
no historic implementation. I can see it was already there in
SUSv2.

I can't see how that can work.

For instance does that mean that

in

   export LC_CTYPE=zh_TW.big5
   case 'ε' in ([[:alpha:]]) ...; esac

The [[:alpha:]] is matched as per the zh_TW.big5, but that 'ε'
is decoded as per the locale that shell was started as (so £` if
started in a locale using the iso8859-1 charmap)?

Most shells store variables as byte sequences, but they can be character sequences too. I think POSIX does not specify this. In the former case, the parser would see £` and produce a string consisting of the bytes 0xA3 and 0x60, but during the evaluation of the case statement it would be reinterpreted as a single ε character, so it would match the pattern. In the latter case, the parser sees £` and produces a string consisting of the characters £ and `, so it does not match the pattern.

What about the "read"/"getopts" utilities. They are different
utilities from "sh", so surely they're affected by the change of
LC_CTYPE. But still, they're filling up "sh" variables.

You have to know whether variables are stored as byte sequences or character sequences, but once you know that, I do not see the problem in setting them no matter which of those it is.

                                                        "read"
also does special backslash processing and the encoding of
backslash happens to occur in a number of characters in a number
of charsets. The specification of "read" also refers to the
specification of the shell's word splitting...

The shell's word splitting is not "the *lexical* processing of shell commands", I'm pretty sure, so that should pick up any LC_CTYPE changes. The read command can then behave the same way, so also act on the current LC_CTYPE.

One could argue that it would be the same for the "." utility.

The . utility is sort of special in that it is not really specified as parsing any file though, it is just specified that when the . utility is invoked, the shell will parse the file.

That would also mean that you can't set the locale in your $ENV
file.

like:

   case $(stty -a) in
     (*-iutf8*) export LANG=en_GB.iso885915;;
     (*)        export LANG=en_GB.UTF-8;;
   esac

That kind of implies that if you ssh into a system and the
locale's charset ends up being different from that of your
terminal, you need to do something like:

   export LANG=right-one
   exec "$0"

To fix it.

Right, if you need the parsing to be done in the specified locale. If you use a shell that operates on bytes (that's most of them) and is tolerant of invalid bytes (that too is most of them), and you don't use character sets such as Big5, then you can generally get away with having parsing be done in the default locale though.
Having modifications to LC_CTYPE affect parsing of the current shell
environment is hard to combine with the requirement of several commands that
output shall be "suitable for reinput to the shell" -- and indeed, that is
broken in bash, bosh and ksh:

   LC_CTYPE=zh_TW.big5 $SHELL -c '
   export foo=ε
   echo $foo
   export -p >file
   unset LC_CTYPE foo
   . ./file
   echo $foo
   '

yash is the only shell I can find which is able to handle this.
[...]

In any case, the spec should say "suitable for reinput *in the
same locale*". Even in shells like yash that don't support
changing the charset midway through its lifetime, you'll still
have problems if you replace ". ./file" with "sh ./file" above.

Agreed, provided "suitable for reinput in the same locale" implies "suitable for reinput in the current shell" (as POSIX currently specifies).

That would make the output of "locale" utility not useful. The
spec should probably say something (about "locale") along the
lines of:

Implementations of "locale" should make sure its output only
contains characters of the portable character set (assuming it's
invariant across all locales on a system).

LC_*/LANG variables may contain any character though. Unless the shell somehow rejects assignments of invalid locale values, the locale utility must be prepared to deal with them, and I do not see POSIX provide a basis for such rejections. In theory, other characters could be written using $'\nnn' notation once it's officially part of the standard, but in practice, it will take a while until it is supported in enough shells for external utilities to rely on it.
In practice, there are implementations that can output
characters that are not in the portable character set, leading
to command injection vulnerabilities (when the user uses forged
values of the $LANG variable for instance with those characters
that contain the encoding of backslash or backtick).

And this is probably made worse by the fact that no shell implements locale as a built-in command. Because of that, the utility cannot be written to match the exact rules of the shell and cannot know which locale was in use for parsing.

Cheers,
Harald van Dijk

Reply via email to