On 15/04/2022 00:03, Robert Elz via austin-group-l at The Open Group wrote:
Date: Thu, 14 Apr 2022 09:42:37 +0100
From: "Geoff Clare via austin-group-l at The Open Group"
<[email protected]>
Message-ID: <20220414084237.GA15370@localhost>
| That is how things are at present. The suggested changes just make it
| explicit.
Yes, I know, but that's what I am suggesting that we not do in this one case.
| Do you have an alternative proposal?
Only to the extent of "do nothing". I am certainly not suggesting that
we attempt to solve the problem.
Hmm, I would.
Except perhaps it might be worth adding something to the Rationale (but
about what, ie: where there, I have no idea) along the lines of:
It is often unclear whether a string is to be interpreted as
characters in some locale, or as an arbitrary byte string.
While it would have been possible to arbitrarily make the various
cases more explicit, or explicitly unspecifried, it was considered
better, in this version of <however the doc refers to itself> to
make no changes, as it is believed that much additional work is
required to enable a standards-worthy specification possible.
This work is beyond the scope of this standard.
The problem I see, is that any specification at all of any of this,
allows implementors to just say "that is what posix requires" and do
nothing at all, where we really need some innovation, by someone who
actually understands the issues and how to deal with them in a rational
way - or at least who can come up with some kind of plan, and without
any possibility of being considered a non-conformant implementation
because of it.
For the most part(*), those shells that support locales appear to
already be in agreement that single bytes that do not form a valid
multi-byte character are interpreted as single characters that can be
matched with *, with ?, and with those single bytes themselves. Shells
are not in agreement on whether such single bytes can be matched with
[...], nor in those shells where they can be, whether multiple bracket
expressions can be used to match the individual bytes of a valid
multi-byte character.
The cases with [...] only come up when scripts themselves use patterns
that are not valid character strings, they are unlikely to affect
existing scripts and I imagine there is not much harm in leaving those
unspecified. The cases with * and ? do come up in existing scripts, but
if shells are in agreement as they appear to be, there is no need to
coordinate with shell authors on whether they would be willing to change
their implementations, it is possible to change POSIX to describe the
shells' current behaviour.
If there is interest in getting this standardised, I can spend some more
time on creating some hopefully comprehensive tests for this to confirm
in what cases shells agree and disagree, and use that as a basis for
proposing wording to cover it.
Cheers,
Harald van Dijk
(*) The notable exception here is yash, which internally processes
strings as wide strings and cannot handle any string that cannot be
converted to a wide string. This was already said to be contrary to what
POSIX intends in other cases.