Re: Alias implementations being invalidated by proposed new wording?

Harald van Dijk Mon, 31 Dec 2018 11:07:50 -0800

On 31/12/2018 10:18, Stephane Chazelas wrote:

2018-12-31 06:51:13 +0000, Harald van Dijk:
[...]

Changing the value of LC_CTYPE after the shell has started
shall not affect the lexical processing of shell commands in
the current shell execution environment or its subshells.

[...]


Good find.

I wonder where that requirement comes from given that it is in
no historic implementation. I can see it was already there in
SUSv2.

I can't see how that can work.

For instance does that mean that

in

   export LC_CTYPE=zh_TW.big5
   case 'ε' in ([[:alpha:]]) ...; esac

The [[:alpha:]] is matched as per the zh_TW.big5, but that 'ε'
is decoded as per the locale that shell was started as (so £` if
started in a locale using the iso8859-1 charmap)?

Most shells store variables as byte sequences, but they can be charactersequences too. I think POSIX does not specify this. In the former case,the parser would see £` and produce a string consisting of the bytes0xA3 and 0x60, but during the evaluation of the case statement it wouldbe reinterpreted as a single ε character, so it would match the pattern.In the latter case, the parser sees £` and produces a string consistingof the characters £ and `, so it does not match the pattern.

What about the "read"/"getopts" utilities. They are different
utilities from "sh", so surely they're affected by the change of
LC_CTYPE. But still, they're filling up "sh" variables.

You have to know whether variables are stored as byte sequences orcharacter sequences, but once you know that, I do not see the problem insetting them no matter which of those it is.

                                                        "read"
also does special backslash processing and the encoding of
backslash happens to occur in a number of characters in a number
of charsets. The specification of "read" also refers to the
specification of the shell's word splitting...

The shell's word splitting is not "the *lexical* processing of shellcommands", I'm pretty sure, so that should pick up any LC_CTYPE changes.The read command can then behave the same way, so also act on thecurrent LC_CTYPE.

One could argue that it would be the same for the "." utility.

The . utility is sort of special in that it is not really specified asparsing any file though, it is just specified that when the . utility isinvoked, the shell will parse the file.

That would also mean that you can't set the locale in your $ENV
file.

like:

   case $(stty -a) in
     (*-iutf8*) export LANG=en_GB.iso885915;;
     (*)        export LANG=en_GB.UTF-8;;
   esac

That kind of implies that if you ssh into a system and the
locale's charset ends up being different from that of your
terminal, you need to do something like:

   export LANG=right-one
   exec "$0"

To fix it.

Right, if you need the parsing to be done in the specified locale. Ifyou use a shell that operates on bytes (that's most of them) and istolerant of invalid bytes (that too is most of them), and you don't usecharacter sets such as Big5, then you can generally get away with havingparsing be done in the default locale though.

Having modifications to LC_CTYPE affect parsing of the current shell
environment is hard to combine with the requirement of several commands that
output shall be "suitable for reinput to the shell" -- and indeed, that is
broken in bash, bosh and ksh:

   LC_CTYPE=zh_TW.big5 $SHELL -c '
   export foo=ε
   echo $foo
   export -p >file
   unset LC_CTYPE foo
   . ./file
   echo $foo
   '

yash is the only shell I can find which is able to handle this.

[...]

In any case, the spec should say "suitable for reinput *in the
same locale*". Even in shells like yash that don't support
changing the charset midway through its lifetime, you'll still
have problems if you replace ". ./file" with "sh ./file" above.

Agreed, provided "suitable for reinput in the same locale" implies"suitable for reinput in the current shell" (as POSIX currently specifies).

That would make the output of "locale" utility not useful. The
spec should probably say something (about "locale") along the
lines of:

Implementations of "locale" should make sure its output only
contains characters of the portable character set (assuming it's
invariant across all locales on a system).

LC_*/LANG variables may contain any character though. Unless the shellsomehow rejects assignments of invalid locale values, the locale utilitymust be prepared to deal with them, and I do not see POSIX provide abasis for such rejections. In theory, other characters could be writtenusing $'\nnn' notation once it's officially part of the standard, but inpractice, it will take a while until it is supported in enough shellsfor external utilities to rely on it.

In practice, there are implementations that can output
characters that are not in the portable character set, leading
to command injection vulnerabilities (when the user uses forged
values of the $LANG variable for instance with those characters
that contain the encoding of backslash or backtick).

And this is probably made worse by the fact that no shell implementslocale as a built-in command. Because of that, the utility cannot bewritten to match the exact rules of the shell and cannot know whichlocale was in use for parsing.


Cheers,
Harald van Dijk

Re: Alias implementations being invalidated by proposed new wording?

Reply via email to