bug report: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings

Stephane Chazelas via austin-group-l at The Open Group Fri, 02 May 2025 21:19:10 -0700

[posting it here as austingroupbugs.net rejects it with a 403
HTTP error, presumably because it contains non-ASCII characters]


Description
-----------

This is an objection to the resolution of bugid:1560 and a
follow-up on (now withdrawn) bugid:1920

bugid:1560 has changed the way word-splitting is meant to work
from splitting strings of characters on the characters of IFS to
splitting strings of bytes on the byte encoding of characters in
$IFS.

While having operators that can safely deal with arbitrary
strings of (non-null) bytes is a worthwhile endeavour, here the
new *required* behaviour is inappropriate in locales where the
character encoding is non-self-synchronising such as when the
encoding of some characters contains the encoding of others,
including the single byte ones that encode characters of the
portable character set such as \, *, ?, [ themselves involved as
part of or after word splitting (backslash processing by the
read utility, globbing by sh), as that means word splitting can
result in characters being split in the middle and new different
characters including \*?[]'" to be introduced or removed by that
splitting.

It is not a theoretical problem. There are still locales on real
life systems that use character encodings such as BIG5,
BIG5-HKSCS or GB18030 which have dozens of characters whose
encoding contain the encoding of \, [ or ] and thousands whose
encoding contains those of decimal digits.

For instance, as already mentioned in bugid:1920, the new word
splitting wording would require 'Stéphane' to be split into
$'St\x88' (invalid encoding) and 'phane' with IFS=m in a locale
using BIG5-HKSCS as é there is encoded as 0x88 0x6d (and m as
0x6d as in ASCII). And how would word splitting even work with
IFS='mé'?

In a locale using the GB18030 encoding, with IFS='芠', '∑[0-9]'
would be split into $'\xa1' and '0-9]', turning a glob into two
non-globs one of which invalid text as the encoding of 芠is
found inside the encoding of ∑[.

The EUC-JP character encoding does not (AFAIK) have characters
whose encoding contains that of others but is still not self
synchronising.

There for example, 与is encoded as 0xcd 0xbf and 人 as 0xbf
0xcd, so with IFS='与', 人人人 would be split into $'\xbf', ''
and $'\xcd'.

Even in locales using UTF-8, the algorithm that POSIX now
requires sh/read to implement (and that AFAIK no shell
implements other than the ones that don't support multibyte
encodings) is arguably not the best one if we remove the
constraint of $IFS having to contain characters.

As a contrived (as 0xc0, 0xc1, 0xf5..0xff would be better
choices as those bytes can not appear in valid modern UTF-8)
example, since a lone 0x80 byte cannot occur in valid UTF-8, one
could want to join valid UTF-8 strings with that byte value and
expect word splitting to split the result back with IFS=$'\x80'.
But with the new POSIX algorithm, that wouldn't work if the
strings contained characters whose encoding contains that byte.

Several systems, when converting strings expected to be UTF-8
encoded into wide character strings convert valid UTF-8 encoded
characters into the corresponding Unicode codepoint value, and
each byte that cannot be decoded into a character in a special
range of values outside the 0x0..0xd7ff, 0xe000..0x10ffff range
covered by Unicode. Many of them use the range reserved for the
second half of the UTF-16 surrogate pairs (0xdc00 to 0xdfff) as
that means it can also be used to decode into UTF-16.

That's the case of python, zsh (in some areas only, like pattern
matching), java (I believe) at least.

<pre>
$ python3 -c 'import sys; print("{!r}".format(sys.argv[1]))' $'\x80'
'\udc80'
$ a=$'\x80' zsh -c $'case $a in ([\ud7ff-\ue000]) echo yes; esac'
yes
</pre>

That approach also allows splitting arbitrary strings, and if
$IFS contained valid UTF-8 would produce identical results as
the bugid:1560 algorithm and if not would arguably be preferable
as it wouldn't cut valid characters in the middle.

That approach also allows pattern matching on potentially
invalid strings or ${#var} to work better.

Desired Action
--------------

First, at least the wording should make it clear that
shells/read implementations are not required to implement that
algorithm, just that whatever algorithm they use must produce
the same result as long as $IFS contains only properly encoded
characters ("shall be split *as if* by looking for the encoding
of characters of $IFS...").

In any case, that method should be constrained to locales with
self-synchronising encodings (in practice today, probably just
single byte encodings and UTF-8), and *must not* be used
otherwise as it can produce incorrect splitting of perfectly
correct text.

In those other locales (the ones with non-self-synchronising
encodings), the pre-bugid:1560 wording is probably the best:
split on characters and behaviour unspecified if subject or $IFS
contain sequences of bytes that can't be decoded into characters
(STDIN section of the read utility would need to be updated
accordingly)

Potentially better algorithms such as the one described above
that would allow pattern matching for instance to work better on
arbitrary sequences of bytes should probably be explored.

Another option would be to put all non-self-synchronising
character encodings out of scope of POSIX (like was already done
for locking-shift encodings), but that's probably a step too far
as that would make a lot of the POSIX spec designed to deal with
those irrelevant.

Best regards,
Stephane

bug report: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings

Reply via email to