Christoph Anton Mitterer wrote, on 26 Jan 2022:
>
> On Tue, 2022-01-25 at 09:25 +0000, Geoff Clare via austin-group-l at
> The Open Group wrote:
> > You are correct, and a common method of preserving trailing newlines
> > is to append a non-newline character and then strip it, e.g.:
> >
> > output=$(some_command && printf x)
> > output=${output%x}
>
> ... which is exactly where I was starting from.
>
> But it seems to be more tricky than simply that:
>
>
> Several sources[0] explain, that this alone may not work, namely when
> some_command outputs e.g. bytes that are an invalid character encoding
> in the current locale, but together with the 'x', they do in fact
> become valid and subsequently, stripping off the 'x' fails.
>
> A common example given is \x78 alone followed by \x88 in
> zh_HK.big5hkscs.
>
> The workaround given for that is to set LC_ALL=C after the sentinel was
> appended but before it's stripped off again,... and to restore the
> previous state of LC_ALL (old value / unset) afterwards.
>
>
>
> In [1] I was pointed to the fact that POSIX reqiures:
> - The encoded values associated with <period>, <slash>, <newline>, and
> <carriage-return> shall be invariant across all locales supported by
> the implementation.”
> => which means AFAIU, that these will have the same binary
> representation in any locale/encoding.
> - Likewise, the byte values used to encode <period>, <slash>,
> <newline>, and <carriage-return> shall not occur as part of any
> other character in any locale.”
> => which means AFAIU that it cannot happen, that a invalidly
> encoded character + the sentinel form together a valid character
> and thus the sentinel cannot be stripped of, as no partial byte
> sequence could be completed by these bytes/characters to a valid
> character in any locale/encoding.
> (see 6.1 Portable Character Set [1])
>
>
> So my first thought was, that with either . or / as sentinel value,
> none of the LC_* stuff would be necessary and it would be still
> guaranteed that it works as expected (in any conforming shell).
>
>
> However, that may have been crashed again by [2] respectively [3].
> Koichi Murase’s point seem to be that the following could happen:
> '<some invalid character encoding>.'
> with '.' being the sentinel value.
>
> While that sequence doesn't make up a valid character (because of
> POSIX' requirements) it may still cause the decoder of the encoding to
> fail stripping of the sentinel, e.g. if it simply stops working after
> the invalid encoding is encountered or whatever.
>
>
> So my questions would be:
>
> 1) What's the opinion on that from the POSIX side?
> Are locales/encodings required to still handle the above case
> gracefully?
> Or is the POSIX point of view, that one really has to do the LC_ALL
> trick to be 100% sure?
It seems to me that there are three cases to consider:
* The command's output is expected to contain byte sequences that
might not form valid characters.
In this case LC_ALL=C should be used during all handling of the
output.
* The command's output is expected to contain valid characters,
but could be truncated mid-character.
In this case, the encoding issue is only one small aspect of the
potential consequences of truncation. It is more important to
detect and handle any kind of truncation, not just the kind that
causes an encoding error.
* The command's output is expected to contain valid characters,
but the concern is that there could be corruption.
This is similar to the second case but more extreme, as the
consequences of corruption could be many things, including valid
(but wrong) characters. Again, there is no point trying to deal
with one very small aspect of those potential consequences in
isolation.
> 2) Koichi also pointed out earlier[4]:
> > In theory, ISO/IEC 2022 encoding allows to change the meaning of
> > C0 (\x00-\x1F), GL (\x21-\x7E), C1 (\x80-\x9F), and GR
> > (\xA0-\xAF) by locking shift escape sequences. In particular, all
> > the bit combinations (i.e. bytes) in GL which contain ASCII "."
> > and "x" can be used for trailing bytes of 94^n character sets
> > (such as LC_CTYPE=ja_JP.ISO-2022-JP). The only two bit-
> > combinations that are unaffected by the ISO/IEC 2022 shifts
> > are SP (space \x20) and DEL (^? or \x7F). But actually, the
> > encodings that are fully ISO/IEC 2022 have hardly used as user
> > locales because most utilities have problems in dealing with such
> > context-dependent encoding schemes.
>
> I'd assume that POSIX' provisions simply forbid that shifting then
> (at least with respect to NUL . / LF and CR)?
Sort of. The way you have stated it could be taken as meaning that
conforming implementations cannot have locales with ISO/IEC 2022
encodings installed, but I don't think that's true. They are permitted
as an extension, and one of the things a user needs to do in order to
obtain a conforming environment is not use locales which have that
encoding.
> 3) Does POSIX define anywhere which values a shell variable is required
> to be able to store?
> I only found that NUL is excluded, but that alone doesn't mean that
> any other byte value is required to work.
Kind of circular, but POSIX clearly requires that a variable can be
assigned any value obtained from a command substitution that does not
include a NUL byte, and specifies utilities that can be used to
generate arbitrary byte values, therefore a variable can contain any
sequence of bytes that does not include a NUL byte.
--
Geoff Clare <[email protected]>
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England