Christoph Anton Mitterer via austin-group-l at The Open Group dixit:
>On Tue, 2022-02-08 at 15:21 -0600, Eric Blake wrote:
>>On Tue, Feb 08, 2022 at 06:53:50AM +0100, Christoph Anton Mitterer via
>>austin-group-l at The Open Group wrote:
>What does that mean in practise... does e.g. Linux/glibc ship these
>locales just for the purpose of iconv and others... and apart from that
>*any* glibc system will *always* be based ASCII and *never* on
>EBCDIC...
You can have nōn-POSIX locales. For example, in mksh, I have a UTF-8
mode, but I specify that only the "C" locale attempts POSIX conformance.
(If POSIX specifies a C.UTF-8, it’ll most likely want [[:alpha:]]
match ä and Й, but mksh’s POSIX character classes hardcode the C
locale as specified in the current standard.)
>> > Doesn't that also mean that POSIX effectively forbids UTF16 or
>> > UTF32
>> > and actually any >1-byte fixed-encoding?
>> > Cause there it would have to be "padded" with 0x00?
Yes, but that’s nothing new.
>But then stumbled over 3.251 Null Wide-Character Code:
>"A wide-character code with all bits set to zero."
That’s wchar_t.
>What's that then good for? Just for wchars, which *may* very well use
>fixed size encodings (with multiple bytes) and in fact are 32 Bits
>(UCS-4) in glibc?
wchar_t must use a fixed-size integer (not “multiple bytes”) encoding.
It is never written because it’s endian-dependent.
>But even then, for all syscalls etc... these wide chars would need to
>get converted to/from "normal" multibyte chars, which use one byte for
Right, all I/O is eventually done on multibyte characters; wide
characters are just one possible way, perhaps convenient, perhaps
less so, and not mandatory, to handle “characters” internally.
For a pure ASCII or UTF-8 system, they may be less convenient.
>So that means also, that if I have e.g. my shell script (say in UTF8)
>which prints the sentinel via 'printf .', I'm always sure - on any
>ASCII-based POSIX system, that regardless of the locale (which would
>then need to be ASCII-based as well), '.' would give me 0x2E.
>
>Whereas, when I'd use the same script *as is* on an EBCDIC system, it
I’ve been working with someone to port mksh to an EBCDIC-based
mainframe system (who hopes to be able to eventually bootstrap
an ASCII-based environment on it), so I’ve done *quite* some
thinking and being educated in that area.
When you transfer a file from your system to an EBCDIC-based
system in text mode, it’s ALWAYS iconv’d, otherwise you could
not run it.
The shell on an EBCDIC system expects its keywords in a suitable
EBCDIC encoding, which usually must be, at runtime, compatible to
the one used at compile time. So your script would be…
00000000 70 72 69 6e 74 66 20 2e 0a |printf ..|
… in UTF-8 (ASCII) before copying, but end up on the host as,
for example…
00000000 97 99 89 95 a3 86 40 4b 15 |......@K.|
… and the shell would read that and then output the byte X'4B'
(which is EBCDIC/mainframe parlance for \x4B).
A shell on an EBCDIC system could not possibly use scripts not
encoded in a suitable EBCDIC encoding.
>And effectively I could *never* run in the situation, that the script
>itself is parsed with e.g. ASCII and '.' = 0x2E .. while the shell's
Exactly.
(UTF-8 is a bit weird. The mainframe world has UTF-EBCDIC, but
nobody uses it. They mostly use UCS-2 in word-oriented I/O, which
basically would be the same as writing wchar_t arrays to disc.
For mksh’s UTF-8 mode we decided on using “nega-UTF-8”: UTF-8 on
the “extended ASCII” side is converted to UTF-8 byte by byte as
if the “extended ASCII” side was not using UTF-8 but a single-byte
encoding (the one available from the OS’ __etoa_l() function, and
the one the SSH, FTP, etc. transfer methods also use). When now
mksh/EBCDIC is switched to “UTF-8” mode and is asked to print \u20AC
for example, it’ll output X'42' X'22' X'B0' which your ssh/telnet/…
session iconvs to \xE2\x82\xAC byte by byte (try iconv -f cp1047 -t
latin1 but mind that the newline character differs, glibc gives X'25'
but X'15' is used by “his” mainframe), as if you were using ISO-8859-1,
which then your UTF-8 terminal reinterprets as UTF-8 and outputs € to
show you. Confused yet? But it’s the “least not-making-any-sense” way
to do this on EBCDIC systems. “Normal” scripts would just never enable
UTF-8 mode there.)
>> containing NUL bytes as not being a shell script). POSIX does not
>> allow you to execute a file encoded in UTF16 as a shell script.
What I wrote above: your scripts must be encoded in an encoding
compatible with the current POSIX locale, or what the shell thinks
that is, anyway, considering mksh’s bilocalety.
>So if the shell was started in A', even if it then switches to A'' and
>it's variables and so would be interpreted according to A'',... the
>literals would continue to get A'.
Switching the locale during shell runtime is not allowed to change
the way the script is parsed, so the variables etc. are all that is
permitted to “change”, by means of reinterpretation.
But here you’re lucky again that <period> has to have the exact same
encoding across *all* locales supported in one POSIX “universe”, and
that it must not occur as part of a multibyte encoding in a supported
locale on the same universe.
>> If you are running on an IBM machine where the POSIX locale is based
>> on EBCDIC, then it will indeed print the byte 0x4B.
Interestingly enough not because the shell reads . and outputs 0x4B
but because the script itself would have been recoded so the shell
reads 0x4B and outputs that. It’s the interpretation that changes.
>> But it will still be <period>, as detected by all other processes
>> reached from that POSIX environment (and that system will necessarily
>> by unable to have an ASCII or UTF8 encoding in any of its locales;
>> you are back to having to use an extension outside of POSIX if you
>> want to start a new subtree of processes based on an ASCII base
>> encoding).
Exactly.
>Ok clear now... *and* I would had to have my script converted to some
>EBCDIC encoding... in order to be able to run it at all.
Yes, but the file transfer utility does that on ingress for you,
using the system’s global EBCDIC and “extended ASCII” codepages.
If you need something else, transfer in binary mode (or undo the
conversion, it’s bijective) and use iconv. But the default “DWIMs”.
>> > Because of (2) ... would it be in any way safer to e.g.
>> > printf '\056'
>> > (octal for . in ASCII/etc.)
>> > and also strip that off... rather than using '.'?
>>
>> Actually, it is less portable. \056 is a particular byte value, but
>> unless you know your POSIX locale is ASCII-based, you don't know
>> whether that byte value is <period>, or some other character, and
Right, that would be fatal. This would output the ACKNOWLEDGE control
character on an EBCDIC system. (Adjusting mksh’s regression testsuite
for EBCDIC was really fun!!!!!!!1111einself)
>But at least, it should still work portably, when doing the LC_ALL=C
No, absolutely not.
In all supporta̲b̲l̲e̲ scenarios (i.e. those in which you’re not entering
unspecified behaviour already anyway), you’ll be safe with:
x=$(command; echo .); x=${x%.}
(Or a variant that carries over $?, of course.)
>I'm still trying to write up some in-depth documentation (e.g. for
>StackOverflow) why things (i.e. the command substitution with trailing
>newlines) work the way (and have to be the way), as it was described
Might be best to link to the archive of this thread as well…
[…]
>So (a) means, when executing a shell script (or interactively entering
>the commands),... all it's content must consist of validly encoded
>characters.. and that (because of (c)) with respect to the locale in
>which the shell itself was started.
Right, unless you’re relying on shell extensions. Most (except yash,
I’m told) accept arbitrary binary content except NUL, which POSIX
actively forbids. But that’s at the liberty of the shell; it’s up to
the script author to not use it, or deal with unspecified/implementa‐
tion-specified behaviour occurrring from it.
>Which in turn means, the script itself must be converted (e.g. iconv)
>should it's encoding not match the encoding used with the shell that
>executes it.)
Match or be compatible. Some encodings are generally compatible; e.g.
pairs of EBCDIC codepages pre/post Euro, or latin1 and latin9 (a.k.a.
ISO-8859-15). For shell purposes, all ASCII-based SBCS codepages are
compatible (as those high-bit7 characters are passed through without
change). But given how yash converts to wide characters on entry… iff
that’s valid, then, yes, you’d have to use the exact same encoding
(or a superset, say latin1 → cp1252).
>It should also mean, that regardless of what's chosen as sentinel (e.g.
>'.', 'bbb' or even a multibyte '∈')... as long as these are valid
>characters with respect to the locale/encoding in which the shell
>parses, they should yield the same bytes all over:
Using <period> is more robust because it *additionally* covers the
case in which you happen upon other-encoding data.
>> tmp="$(command; printf ∈)"
>> LC_ALL='C'
>> tmp="${tmp%∈}"
>
>So the printf gets the very same bytes as sentinel (whether it's '.',
>'bbb' or '∈') ... as does the pattern in the parameter expansion, that
>strips off the sentinel... at least from the lexical PoV.
>From the lexical PoV, sure… but do consider:
LC_ALL=$value1
foo() {
tmp=$(command; echo ∈)
tmp=${tmp%∈}
}
LC_ALL=$value2
foo
In this scenario, the %∈ is parsed in $value1 locale, but the
command is run in $value2 locale. On the other hand, the echo
will still get just a string… unless ∈ suddenly contains back‐
slashes (or percent, in your printf case… please don’t overuse
printf(1) like that when echo suffices).
So it’s most robust to use <period> (or <slash> but…).
[100 lines later]
>- one or more trailing bytes of the original $tmp plus one or more
> (leading) bytes (or all) of the sentinel could form a new valid
> character in the current locale, therefore effectively making the
That’s why you use <period>, not a fancy Unicode character.
[…]
>And that seems a bit ambiguous (well, to me at least).
[…]
It’s got to be characters for ${x%?} so if you pass it ∈ but
∈ isn’t valid in the current locale any more you’re no longer
a conforming shell script, and the shell dictates whether you
live or die :þ That’s why you use <period>.
>a) What is it now, bytes or characters (in the current locale)?
In Real Life™ you’ll find that the answer doesn’t matter because
you’ll always want to be compatible to shells that are super‐
ficially POSIX-compatible but not fully compliant to whatever is
the *current* version of the standard. We here try to change our
shells to match the *next* version of it already (thinking of the
${var op pattern/word} discussions, for example), whereas some
random Certified™ UNIX®©™ vendor system will probably implement
the *previous* version).
That’s why you’re not writing literally hundreds of what-if lines
but use the most compatible value <period> ☻
(As an aside, I’m pretty sure that, except for yash, all shells
cheat and work on bytes internally as much as they can get away
with anyway. Anything else would be too slow… as, for example,
the Python3 developers learnt, at some point.)
Good luck,
//mirabilos
--
Support mksh as /bin/sh and RoQA dash NOW!
‣ src:bash (406 (433) bugs: 0 RC, 275 (295) I&N, 131 (138) M&W, 0 F&P) + 208
‣ src:dash (91 (106) bugs: 0 RC, 51 (55) I&N, 40 (51) M&W, 0 F&P) + 63 ubu
‣ src:mksh (1 bug: 0 RC, 0 I&N, 1 M&W, 0 F&P)
dash has two RC bugs they just closed because they don’t care about quality…