Re: how do to cmd subst with trailing newlines portable (was: does POSIX mandate whether the output…)

Christoph Anton Mitterer via austin-group-l at The Open Group Tue, 08 Feb 2022 15:20:12 -0800

Hey Eric.

On Tue, 2022-02-08 at 15:21 -0600, Eric Blake wrote:
> Yes. And another fallout of that requirement: you cannot have a
> single
> POSIX system supporting both ASCII and EBCDIC locales.

What does that mean in practise... does e.g. Linux/glibc ship these
locales just for the purpose of iconv and others... and apart from that
*any* glibc system will *always* be based ASCII and *never* on
EBCDIC...
...or could one implementation (like glibc) support actually both, as
long as it doesn't switch from one to the other on one concrete host
and while that is running?

> > Doesn't that also mean that POSIX effectively forbids UTF16 or
> > UTF32
> > and actually any >1-byte fixed-encoding?
> > Cause there it would have to be "padded" with 0x00?
> 
> Correct - a POSIX environment cannot use UTF16 or UTF32 encodings as
> its basis.  Again, iconv and wide-character library calls (such as
> wprintf) can support conversion of files into and out of those
> encodings, but that is only file contents; all file names, syscalls,
> and other aspects of the POSIX environment for cross-process
> communication outside of file contents will use multi-byte encodings
> where no multi-byte sequence has an embedded 0x00 byte, and NOT wide
> character sequences that would represent UTF16 or UTF32 characters
> directly.

I had suspected that, especially also because of:
"The encoded values associated with the members of the portable
character set are each represented in a single byte."

But then stumbled over 3.251 Null Wide-Character Code:
"A wide-character code with all bits set to zero."

What's that then good for? Just for wchars, which *may* very well use
fixed size encodings (with multiple bytes) and in fact are 32 Bits
(UCS-4) in glibc?

But even then, for all syscalls etc... these wide chars would need to
get converted to/from "normal" multibyte chars, which use one byte for
the portable character set chars, and have the invariance of . / CR and
LF.

Does that sound right?

> POSIX states that <period> will be printed.  If that is the byte
> 0x2E, then your POSIX locale is probably ASCII-based.  But it is also
> possible to have a POSIX conforming environment where the POSIX
> locale
> is EBCDIC based, in which case it would print byte 0x4B, but that
> would still be <period> for all file names and syscalls observable
> from that POSIX environment.

I see... and again, within one implementation of these two, POSIX
wouldn't allow to switch from one to the other.

So that means also, that if I have e.g. my shell script (say in UTF8)
which prints the sentinel via 'printf .', I'm always sure - on any
ASCII-based POSIX system, that regardless of the locale (which would
then need to be ASCII-based as well), '.' would give me 0x2E.

Whereas, when I'd use the same script *as is* on an EBCDIC system, it
would anyway not work out of the box and I'd have to iconv it first to
EBCDIC... and once done, my '.' in there, would always (regardless of
which locale - all of which would need to be EBCDIC-based, too) yield
0x4B?!

And effectively I could *never* run in the situation, that the script
itself is parsed with e.g. ASCII and '.' = 0x2E .. while the shell's
internal LC_ALL has changed to something where '.' would be something
else?!

> > b) print the character 'x' according to the currently set locale,
> > e.g.
> >    if that was using UTF16, it would print the bytes 0x2e 0x00
> 
> It is not possible to have a POSIX locale based on the UTF16
> encoding.
> So this answer is not possible.  While you can write a file with
> characters encoded in UTF16, which when recoded to a multibyte locale
> form a shell script, it is only after you use iconv or fscanf or
> similar to perform that encoding conversion before it actually
> becomes
> a shell script (since sh is documented as being able to reject files
> containing NUL bytes as not being a shell script).  POSIX does not
> allow you to execute a file encoded in UTF16 as a shell script.

Okay, clear now... at least or UTF16/32 ...

But say we have a multibyte based locale foo ... in which some
character X (symbolic name, not the literal X) has one encoding A'...
and another multibyte based locale bar in which X has another encoding
A''.

I thought to remember that I read somewhere that then the encoding in
which the shell parses the file (i.e. in which it was started itself)
would be used.
So if the shell was started in A', even if it then switches to A'' and
it's variables and so would be interpreted according to A'',... the
literals would continue to get A'.

But I cannot really find it in POSIX itself.

> 
> > d) Would it in some weird encodings like IBM905 cause the byte 0x4B
> > to
> >    be printed?
> 
> If you are running on an IBM machine where the POSIX locale is based
> on EBCDIC, then it will indeed print the byte 0x4B.  But it will
> still
> be <period>, as detected by all other processes reached from that
> POSIX environment (and that system will necessarily by unable to have
> an ASCII or UTF8 encoding in any of its locales; you are back to
> having to use an extension outside of POSIX if you want to start a
> new
> subtree of processes based on an ASCII base encoding).

Ok clear now... *and* I would had to have my script converted to some
EBCDIC encoding... in order to be able to run it at all.

> > 3) With respect to the command substitution with trailing newlines
> > question:
> > 
> > Because of (2) ... would it be in any way safer to e.g.
> >   printf '\056'
> > (octal for . in ASCII/etc.)
> > and also strip that off... rather than using '.'?
> 
> Actually, it is less portable.  \056 is a particular byte value, but
> unless you know your POSIX locale is ASCII-based, you don't know
> whether that byte value is <period>, or some other character, and
> there are some POSIX-feasible locales where some single-byte
> characters (such as 'A') may also appear in a multibyte-character
> sequence.

So you mean it would be less portable with respect to the property of .
/ LF and CR:
"Likewise, the byte values used to encode <period>, <slash>, <newline>,
and <carriage-return> shall not occur as part of any other character in
any locale."

Because the \056 may simply not be that <period> ... right?

But at least, it should still work portably, when doing the LC_ALL=C
game, because then one would be back to *always* just stripping off
bytes.

Thanks :-)
Chris.

Re: how do to cmd subst with trailing newlines portable (was: does POSIX mandate whether the output…)

Reply via email to