On Tue, Feb 08, 2022 at 06:53:50AM +0100, Christoph Anton Mitterer via austin-group-l at The Open Group wrote: > Hey. > > I'm afraid but some more questions came up on my side: > > > 1) POSIX says: > "The encoded values associated with <period>, <slash>, <newline>, and > <carriage-return> shall be invariant across all locales supported by > the implementation." > > When now, for example, <period> is encoded as the byte 0x2E ... the > consequence would be that it had to be 0x2E in all locales and their > encodings, right?
Yes. And another fallout of that requirement: you cannot have a single POSIX system supporting both ASCII and EBCDIC locales. You can have iconv and dd support for converting files between the two encodings, but only one of those two encodings can match your current locale (all syscalls, all filenames, and so forth, are tied to the current encoding in use by the POSIX locale, whether that encoding be ASCII, EBCDIC, or something else). Any means for choosing which of those two encodings is treated as the basis of the POSIX locale when starting a subtree of processes that interact as a POSIX environment would be vendor-specific interfaces outside of POSIX. > > Doesn't that also mean that POSIX effectively forbids UTF16 or UTF32 > and actually any >1-byte fixed-encoding? > Cause there it would have to be "padded" with 0x00? Correct - a POSIX environment cannot use UTF16 or UTF32 encodings as its basis. Again, iconv and wide-character library calls (such as wprintf) can support conversion of files into and out of those encodings, but that is only file contents; all file names, syscalls, and other aspects of the POSIX environment for cross-process communication outside of file contents will use multi-byte encodings where no multi-byte sequence has an embedded 0x00 byte, and NOT wide character sequences that would represent UTF16 or UTF32 characters directly. > 2) When I have a shell script in some encoding, and it contains e.g.: > printf '.' > would POSIX demand that this: > a) always cause the byte 0x2E to be printed POSIX states that <period> will be printed. If that is the byte 0x2E, then your POSIX locale is probably ASCII-based. But it is also possible to have a POSIX conforming environment where the POSIX locale is EBCDIC based, in which case it would print byte 0x4B, but that would still be <period> for all file names and syscalls observable from that POSIX environment. > b) print the character 'x' according to the currently set locale, e.g. > if that was using UTF16, it would print the bytes 0x2e 0x00 It is not possible to have a POSIX locale based on the UTF16 encoding. So this answer is not possible. While you can write a file with characters encoded in UTF16, which when recoded to a multibyte locale form a shell script, it is only after you use iconv or fscanf or similar to perform that encoding conversion before it actually becomes a shell script (since sh is documented as being able to reject files containing NUL bytes as not being a shell script). POSIX does not allow you to execute a file encoded in UTF16 as a shell script. > c) print the character 'x' according to the locale in which the shell > parses the script (but there again, if it was UTF16... the bytes > 0x2e 0x00) The shell is not required to parse UTF16, because the POSIX locale cannot be based on UTF16. > d) Would it in some weird encodings like IBM905 cause the byte 0x4B to > be printed? If you are running on an IBM machine where the POSIX locale is based on EBCDIC, then it will indeed print the byte 0x4B. But it will still be <period>, as detected by all other processes reached from that POSIX environment (and that system will necessarily by unable to have an ASCII or UTF8 encoding in any of its locales; you are back to having to use an extension outside of POSIX if you want to start a new subtree of processes based on an ASCII base encoding). > > 3) With respect to the command substitution with trailing newlines > question: > > Because of (2) ... would it be in any way safer to e.g. > printf '\056' > (octal for . in ASCII/etc.) > and also strip that off... rather than using '.'? Actually, it is less portable. \056 is a particular byte value, but unless you know your POSIX locale is ASCII-based, you don't know whether that byte value is <period>, or some other character, and there are some POSIX-feasible locales where some single-byte characters (such as 'A') may also appear in a multibyte-character sequence. > > Especially also with respect to a hypothetical UTF16/32 locale? There is no such locale. > > 4) Doesn't strictly belong here, but maybe someone knows: > On my Debian (=> glibc) I was trying this: > /usr/share/i18n/charmaps$ zgrep "[xX]2[eEfF]" * | grep -Ev > '[[:space:]](SOLIDUS|FULL STOP)$' > > i.e. searching for any entries that are 0x2E or 0x2f ( . and / ), > filtering out any who really are considered as that. > > That gave quite some matches: > BRF.gz:<U2828> /x2e BRAILLE PATTERN DOTS-46 > BRF.gz:<U280C> /x2f BRAILLE PATTERN DOTS-34 > EBCDIC-AT-DE-A.gz:<U0006> /x2e ACKNOWLEDGE (ACK) > EBCDIC-AT-DE-A.gz:<U0007> /x2f BELL (BEL) charmaps are useful to iconv in converting file contents between more encodings that are possible than what is permitted in locales. > IBM918.gz:<U0007> /x2f BELL (BEL) > INIS-CYRILLIC.gz:<U2192> /x2e RIGHTWARDS ARROW > INIS-CYRILLIC.gz:<U222B> /x2f INTEGRAL > ISO_10646.gz:<I;> /x01/x2E LATIN CAPITAL LETTER I WITH OGONEK > ISO_10646.gz:<i;> /x01/x2F LATIN SMALL LETTER I WITH OGONEK > ISO_10646.gz:<JU> /x04/x2E CYRILLIC CAPITAL LETTER YU > ISO_10646.gz:<JA> /x04/x2F CYRILLIC CAPITAL LETTER YA > ISO_10646.gz:<x+> /x06/x2E ARABIC LETTER KHAH > ISO_10646.gz:<d+> /x06/x2F ARABIC LETTER DAL > ISO_10646.gz:<I:'> /x1E/x2E LATIN CAPITAL LETTER I WITH DIAERESIS > AND ACUTE > ISO_10646.gz:<i:'> /x1E/x2F LATIN SMALL LETTER I WITH DIAERESIS AND > ACUTE > ISO_10646.gz:<Io> /x22/x2E CONTOUR INTEGRAL > ISO_10646.gz:<dlR> /x25/x2E BOX DRAWINGS RIGHT HEAVY AND LEFT DOWN > LIGHT > ISO_10646.gz:<dH-> /x25/x2F BOX DRAWINGS DOWN LIGHT AND HORIZONTAL > HEAVY > ISO_11548-1.gz:<U282E> /x2e BRAILLE PATTERN DOTS-2346 > ISO_11548-1.gz:<U282F> /x2f BRAILLE PATTERN DOTS-12346 > JIS_C6220-1969-JP.gz:<YO> /x2E <U30E7> KATAKANA LETTER > SMALL YO > JIS_C6220-1969-JP.gz:<TU> /x2F <U30C3> KATAKANA LETTER > SMALL TU > > Since all these (well except perhaps ISO_10646) use 0x2E and 0x2F for > other characters than . and / ... doesn't that already mean that > they're invalid with respect to POSIX? Not quite. You didn't ALSO check whether those charmaps define <period> as something that overlaps with a multibyte character. But you are right that there are some charmaps which iconv can support but which cannot be used as a locale in a given POSIX environment. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
