Hey Eric.
On Tue, 2022-02-08 at 15:21 -0600, Eric Blake wrote: > Yes. And another fallout of that requirement: you cannot have a > single > POSIX system supporting both ASCII and EBCDIC locales. What does that mean in practise... does e.g. Linux/glibc ship these locales just for the purpose of iconv and others... and apart from that *any* glibc system will *always* be based ASCII and *never* on EBCDIC... ...or could one implementation (like glibc) support actually both, as long as it doesn't switch from one to the other on one concrete host and while that is running? > > Doesn't that also mean that POSIX effectively forbids UTF16 or > > UTF32 > > and actually any >1-byte fixed-encoding? > > Cause there it would have to be "padded" with 0x00? > > Correct - a POSIX environment cannot use UTF16 or UTF32 encodings as > its basis. Again, iconv and wide-character library calls (such as > wprintf) can support conversion of files into and out of those > encodings, but that is only file contents; all file names, syscalls, > and other aspects of the POSIX environment for cross-process > communication outside of file contents will use multi-byte encodings > where no multi-byte sequence has an embedded 0x00 byte, and NOT wide > character sequences that would represent UTF16 or UTF32 characters > directly. I had suspected that, especially also because of: "The encoded values associated with the members of the portable character set are each represented in a single byte." But then stumbled over 3.251 Null Wide-Character Code: "A wide-character code with all bits set to zero." What's that then good for? Just for wchars, which *may* very well use fixed size encodings (with multiple bytes) and in fact are 32 Bits (UCS-4) in glibc? But even then, for all syscalls etc... these wide chars would need to get converted to/from "normal" multibyte chars, which use one byte for the portable character set chars, and have the invariance of . / CR and LF. Does that sound right? > POSIX states that <period> will be printed. If that is the byte > 0x2E, then your POSIX locale is probably ASCII-based. But it is also > possible to have a POSIX conforming environment where the POSIX > locale > is EBCDIC based, in which case it would print byte 0x4B, but that > would still be <period> for all file names and syscalls observable > from that POSIX environment. I see... and again, within one implementation of these two, POSIX wouldn't allow to switch from one to the other. So that means also, that if I have e.g. my shell script (say in UTF8) which prints the sentinel via 'printf .', I'm always sure - on any ASCII-based POSIX system, that regardless of the locale (which would then need to be ASCII-based as well), '.' would give me 0x2E. Whereas, when I'd use the same script *as is* on an EBCDIC system, it would anyway not work out of the box and I'd have to iconv it first to EBCDIC... and once done, my '.' in there, would always (regardless of which locale - all of which would need to be EBCDIC-based, too) yield 0x4B?! And effectively I could *never* run in the situation, that the script itself is parsed with e.g. ASCII and '.' = 0x2E .. while the shell's internal LC_ALL has changed to something where '.' would be something else?! > > b) print the character 'x' according to the currently set locale, > > e.g. > > if that was using UTF16, it would print the bytes 0x2e 0x00 > > It is not possible to have a POSIX locale based on the UTF16 > encoding. > So this answer is not possible. While you can write a file with > characters encoded in UTF16, which when recoded to a multibyte locale > form a shell script, it is only after you use iconv or fscanf or > similar to perform that encoding conversion before it actually > becomes > a shell script (since sh is documented as being able to reject files > containing NUL bytes as not being a shell script). POSIX does not > allow you to execute a file encoded in UTF16 as a shell script. Okay, clear now... at least or UTF16/32 ... But say we have a multibyte based locale foo ... in which some character X (symbolic name, not the literal X) has one encoding A'... and another multibyte based locale bar in which X has another encoding A''. I thought to remember that I read somewhere that then the encoding in which the shell parses the file (i.e. in which it was started itself) would be used. So if the shell was started in A', even if it then switches to A'' and it's variables and so would be interpreted according to A'',... the literals would continue to get A'. But I cannot really find it in POSIX itself. > > > d) Would it in some weird encodings like IBM905 cause the byte 0x4B > > to > > be printed? > > If you are running on an IBM machine where the POSIX locale is based > on EBCDIC, then it will indeed print the byte 0x4B. But it will > still > be <period>, as detected by all other processes reached from that > POSIX environment (and that system will necessarily by unable to have > an ASCII or UTF8 encoding in any of its locales; you are back to > having to use an extension outside of POSIX if you want to start a > new > subtree of processes based on an ASCII base encoding). Ok clear now... *and* I would had to have my script converted to some EBCDIC encoding... in order to be able to run it at all. > > 3) With respect to the command substitution with trailing newlines > > question: > > > > Because of (2) ... would it be in any way safer to e.g. > > printf '\056' > > (octal for . in ASCII/etc.) > > and also strip that off... rather than using '.'? > > Actually, it is less portable. \056 is a particular byte value, but > unless you know your POSIX locale is ASCII-based, you don't know > whether that byte value is <period>, or some other character, and > there are some POSIX-feasible locales where some single-byte > characters (such as 'A') may also appear in a multibyte-character > sequence. So you mean it would be less portable with respect to the property of . / LF and CR: "Likewise, the byte values used to encode <period>, <slash>, <newline>, and <carriage-return> shall not occur as part of any other character in any locale." Because the \056 may simply not be that <period> ... right? But at least, it should still work portably, when doing the LC_ALL=C game, because then one would be back to *always* just stripping off bytes. Thanks :-) Chris.
