Re: how do to cmd subst with trailing newlines portable (was: does POSIX mandate whether the output…)

Eric Blake via austin-group-l at The Open Group Tue, 08 Feb 2022 13:22:31 -0800

On Tue, Feb 08, 2022 at 06:53:50AM +0100, Christoph Anton Mitterer via 
austin-group-l at The Open Group wrote:
> Hey.
> 
> I'm afraid but some more questions came up on my side:
> 
> 
> 1) POSIX says:
> "The encoded values associated with <period>, <slash>, <newline>, and
> <carriage-return> shall be invariant across all locales supported by
> the implementation."
> 
> When now, for example, <period> is encoded as the byte 0x2E ... the
> consequence would be that it had to be 0x2E in all locales and their
> encodings, right?


Yes. And another fallout of that requirement: you cannot have a single
POSIX system supporting both ASCII and EBCDIC locales.  You can have
iconv and dd support for converting files between the two encodings,
but only one of those two encodings can match your current locale (all
syscalls, all filenames, and so forth, are tied to the current
encoding in use by the POSIX locale, whether that encoding be ASCII,
EBCDIC, or something else).  Any means for choosing which of those two
encodings is treated as the basis of the POSIX locale when starting a
subtree of processes that interact as a POSIX environment would be
vendor-specific interfaces outside of POSIX.

> 
> Doesn't that also mean that POSIX effectively forbids UTF16 or UTF32
> and actually any >1-byte fixed-encoding?
> Cause there it would have to be "padded" with 0x00?

Correct - a POSIX environment cannot use UTF16 or UTF32 encodings as
its basis.  Again, iconv and wide-character library calls (such as
wprintf) can support conversion of files into and out of those
encodings, but that is only file contents; all file names, syscalls,
and other aspects of the POSIX environment for cross-process
communication outside of file contents will use multi-byte encodings
where no multi-byte sequence has an embedded 0x00 byte, and NOT wide
character sequences that would represent UTF16 or UTF32 characters
directly.

> 2) When I have a shell script in some encoding, and it contains e.g.:
>   printf '.'
> would POSIX demand that this:
> a) always cause the byte 0x2E to be printed

POSIX states that <period> will be printed.  If that is the byte
0x2E, then your POSIX locale is probably ASCII-based.  But it is also
possible to have a POSIX conforming environment where the POSIX locale
is EBCDIC based, in which case it would print byte 0x4B, but that
would still be <period> for all file names and syscalls observable
from that POSIX environment.

> b) print the character 'x' according to the currently set locale, e.g.
>    if that was using UTF16, it would print the bytes 0x2e 0x00

It is not possible to have a POSIX locale based on the UTF16 encoding.
So this answer is not possible.  While you can write a file with
characters encoded in UTF16, which when recoded to a multibyte locale
form a shell script, it is only after you use iconv or fscanf or
similar to perform that encoding conversion before it actually becomes
a shell script (since sh is documented as being able to reject files
containing NUL bytes as not being a shell script).  POSIX does not
allow you to execute a file encoded in UTF16 as a shell script.

> c) print the character 'x' according to the locale in which the shell
>    parses the script (but there again, if it was UTF16... the bytes
>    0x2e 0x00)

The shell is not required to parse UTF16, because the POSIX locale
cannot be based on UTF16.

> d) Would it in some weird encodings like IBM905 cause the byte 0x4B to
>    be printed?

If you are running on an IBM machine where the POSIX locale is based
on EBCDIC, then it will indeed print the byte 0x4B.  But it will still
be <period>, as detected by all other processes reached from that
POSIX environment (and that system will necessarily by unable to have
an ASCII or UTF8 encoding in any of its locales; you are back to
having to use an extension outside of POSIX if you want to start a new
subtree of processes based on an ASCII base encoding).

> 
> 3) With respect to the command substitution with trailing newlines
> question:
> 
> Because of (2) ... would it be in any way safer to e.g.
>   printf '\056'
> (octal for . in ASCII/etc.)
> and also strip that off... rather than using '.'?

Actually, it is less portable.  \056 is a particular byte value, but
unless you know your POSIX locale is ASCII-based, you don't know
whether that byte value is <period>, or some other character, and
there are some POSIX-feasible locales where some single-byte
characters (such as 'A') may also appear in a multibyte-character
sequence.

> 
> Especially also with respect to a hypothetical UTF16/32 locale?

There is no such locale.

> 
> 4) Doesn't strictly belong here, but maybe someone knows:
> On my Debian (=> glibc) I was trying this:
> /usr/share/i18n/charmaps$ zgrep "[xX]2[eEfF]" * | grep -Ev 
> '[[:space:]](SOLIDUS|FULL STOP)$'
> 
> i.e. searching for any entries that are 0x2E or 0x2f ( . and / ),
> filtering out any who really are considered as that.
> 
> That gave quite some matches:
> BRF.gz:<U2828>     /x2e         BRAILLE PATTERN DOTS-46
> BRF.gz:<U280C>     /x2f         BRAILLE PATTERN DOTS-34
> EBCDIC-AT-DE-A.gz:<U0006>     /x2e         ACKNOWLEDGE (ACK)
> EBCDIC-AT-DE-A.gz:<U0007>     /x2f         BELL (BEL)

charmaps are useful to iconv in converting file contents between more
encodings that are possible than what is permitted in locales.

> IBM918.gz:<U0007>     /x2f         BELL (BEL)
> INIS-CYRILLIC.gz:<U2192>     /x2e         RIGHTWARDS ARROW
> INIS-CYRILLIC.gz:<U222B>     /x2f         INTEGRAL
> ISO_10646.gz:<I;>     /x01/x2E        LATIN CAPITAL LETTER I WITH OGONEK
> ISO_10646.gz:<i;>     /x01/x2F        LATIN SMALL LETTER I WITH OGONEK
> ISO_10646.gz:<JU>     /x04/x2E        CYRILLIC CAPITAL LETTER YU
> ISO_10646.gz:<JA>     /x04/x2F        CYRILLIC CAPITAL LETTER YA
> ISO_10646.gz:<x+>     /x06/x2E        ARABIC LETTER KHAH
> ISO_10646.gz:<d+>     /x06/x2F        ARABIC LETTER DAL
> ISO_10646.gz:<I:'>    /x1E/x2E        LATIN CAPITAL LETTER I WITH DIAERESIS 
> AND ACUTE
> ISO_10646.gz:<i:'>    /x1E/x2F        LATIN SMALL LETTER I WITH DIAERESIS AND 
> ACUTE
> ISO_10646.gz:<Io>     /x22/x2E        CONTOUR INTEGRAL
> ISO_10646.gz:<dlR>    /x25/x2E        BOX DRAWINGS RIGHT HEAVY AND LEFT DOWN 
> LIGHT
> ISO_10646.gz:<dH->    /x25/x2F        BOX DRAWINGS DOWN LIGHT AND HORIZONTAL 
> HEAVY
> ISO_11548-1.gz:<U282E>     /x2e BRAILLE PATTERN DOTS-2346
> ISO_11548-1.gz:<U282F>     /x2f BRAILLE PATTERN DOTS-12346
> JIS_C6220-1969-JP.gz:<YO>                   /x2E   <U30E7> KATAKANA LETTER 
> SMALL YO
> JIS_C6220-1969-JP.gz:<TU>                   /x2F   <U30C3> KATAKANA LETTER 
> SMALL TU
> 
> Since all these (well except perhaps ISO_10646) use 0x2E and 0x2F for
> other characters than . and /  ... doesn't that already mean that
> they're invalid with respect to POSIX?

Not quite.  You didn't ALSO check whether those charmaps define
<period> as something that overlaps with a multibyte character.  But
you are right that there are some charmaps which iconv can support but
which cannot be used as a locale in a given POSIX environment.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: how do to cmd subst with trailing newlines portable (was: does POSIX mandate whether the output…)

Reply via email to