On 19/05/2022 02:46, Christoph Anton Mitterer wrote:
On Sun, 2022-05-15 at 16:14 +0100, Harald van Dijk wrote:
Please see the tests and results here.

So dash/ash/mksh/posh/pdksh,... and every other shell that doesn't
handle locales at all (and thus works in the C locale)... is anyway
always right (except for bugs), since any (non-NUL) byte is treated as
a character.

Correct.

For the other shells (and fncmatch):

String
Pattern
dash, busybox ash, mksh, posh, pdksh
glibc fnmatch
bash
bosh
gwsh
ksh
zsh
\303\244
[\303\244]
no match
match
match
match
match
match
match
\303\244
?
no match
match
match
match
match
match
match
\303
[\303]
match
match
match
match
match
match
match
\303
?
match
match
match
match
match
match
match

The above, AFAIU, mean that any shell/fnmatch matches a valid multibyte
character... but also a byte that is not a character in the locale.

Correct, though as I wrote later on, the way they go about it is different.

String
Pattern
dash, busybox ash, mksh, posh, pdksh
glibc fnmatch
bash
bosh
gwsh
ksh
zsh

\303.\303\244
[\303].[\303\244]
no match
no match
no match
match
match
match
match
\303.\303\244
?.?
no match
no match
no match
match
match
match
match
\303\303\244
[\303][\303\244]
no match
no match
no match
match
match
match
match
\303\303\244
??
no match
no match
no match
match
match
match
match
\303\244.\303
[\303\244].[\303]
no match
no match
no match
match
match
match
match
\303\244.\303
?.?
no match
no match
no match
match
match
match
match
\303\244\303
[\303\244][\303]
no match
no match
no match
match
match
match
match
\303\244\303
??
no match
no match
no match
match
match
match
match


The above, I'm not quite sure what these tell/prove...

I assume the ones with '?': that for all except bash/fnmatch   '?'
matches both, valid characters and a single byte that is no character.

Correct.

And the ones with bracket expression, that these also work when the BE
has either a valid character or a byte (that is not a character) and
vice-versa?

Correct.

If Chet is reading along, is the above intended in bash, or considered
a bug?


IMO it would have been interesting to see whether ? would also match
multiple bytes that are each for themselves and together no valid
character... cause for '*' one can kinda assume that it has this "match
anything" meaning... one could also say that is more or less reasonable
that '?' matches a single invalid byte... but why not several of them?

I tested this now. In that same list of shells, and in glibc fnmatch(), ? only matches a single invalid byte. Tested in an UTF-8 locale with the string \200\200 and the patterns ? and ??. With ?, they do not match. With ??, they do.

String
Pattern
dash, busybox ash, mksh, posh, pdksh
glibc fnmatch
bash
bosh
gwsh
ksh
zsh

\303\244
\303*
match
match
match
match
no match
match
no match
\303\244
\303?
match
match
match
no match
no match
match
no match
\303\244
[\303]*
match
match
match
match
no match
match
no match
\303\244
[\303]?
match
match
match
no match
no match
match
no match
\303\244
*\204
match
match
match
no match
no match
no match
match
\303\244
?\204
match
match
match
no match
no match
no match
no match
\303\244
*[\204]
match
match
match
no match
no match
no match
no match
\303\244
?[\204]
match
match
match
no match
no match
no match
no match



So unlike before, in the above bash/fnmatch do seem to let '?' match a
single byte that is not a character... and the remaining ones have
quite mixed feelings
Not quite: all of them always let ? match a single invalid byte, but here we have a single byte that is invalid on its own, valid as part of a character, and appears in the string as part of that character. When processing \303\244, most shells don't process this as the single byte \303 followed by the single byte \244, they preprocess this so that by the time they actually check whether it matches, they just see the character U+00C4, so that if a pattern looks for \303 on its own, it will not be found.

String
Pattern
dash, busybox ash, mksh, posh, pdksh
glibc fnmatch
bash
bosh
gwsh
ksh
zsh

\243]
[\243]]
match
match
match
match
match
match
match
\243]
?
no match
match
match
match
match
match
match
\243
?
match
match
match
match
match
match
match
\243
[\243]
match
match
match
match
no match
no match
error
\243
[\243!]
match
match
match
match
match
match
match
\243]
[\243!]]
match
match
no match
no match
no match
match
no match
\243]
?]
match
match
no match
no match
no match
no match
no match
\243]
*]
match
match
no match
no match
no match
no match
match
The tests involving \243 are run in a Big5 environment. In Big5,
\243\135 is the representation of β, a single valid character, even
though \135 on its own is still the single character ].

Seem also a bit strange to me,... all shells match \243 against ? ...
i.e. ? matches a single byte that is not a character... but later on it
doesn't work again with \243] and ?]

Remember that \243] is a single character β. \243] is not supposed to match when given a pattern ?]. The pattern ?] means "any character, followed by ]". "β" is a character not followed by ]. This is similar to how in UTF-8 environments, ä should not match against the pattern ?? even though both of the bytes that make up ä individually do match against the pattern ?.

The other shells, when either the string or the pattern are not valid
in the current locale, are not in agreement on whether parts of the
rest of the string or the pattern are still interpreted according to
the current locale, and if so, which parts.
I assume this effectively puts an end to any efforts of standardising
this for byte strings, or what is your conclusion?

I think there is still value in standardising this, but there is a bit more variation than I expected. For the non-bosh implementations, I can think of a general idea of wording that would cover everything except for the bits I wrote I felt should be regarded as implementation bugs, that shouldn't be too difficult. For bosh, though, I would really want to know what the logic is that it uses; without that, I would not feel comfortable saying whether any proposed wording covers that bosh attempts to do.

Cheers,
Harald van Dijk

Thanks for your efforts in this.


Cheers,
Chris.

  • [Issue 8 dra... Austin Group Bug Tracker via austin-group-l at The Open Group
    • Re: [Is... Robert Elz via austin-group-l at The Open Group
      • Re:... Geoff Clare via austin-group-l at The Open Group
      • Re:... Robert Elz via austin-group-l at The Open Group
        • ... Harald van Dijk via austin-group-l at The Open Group
          • ... Christoph Anton Mitterer via austin-group-l at The Open Group
            • ... Harald van Dijk via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Harald van Dijk via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Harald van Dijk via austin-group-l at The Open Group
              • ... Christoph Anton Mitterer via austin-group-l at The Open Group
              • ... Harald van Dijk via austin-group-l at The Open Group
              • ... Chet Ramey via austin-group-l at The Open Group
              • ... Harald van Dijk via austin-group-l at The Open Group
          • ... Geoff Clare via austin-group-l at The Open Group
            • ... Harald van Dijk via austin-group-l at The Open Group
        • ... Geoff Clare via austin-group-l at The Open Group
      • Re:... Christoph Anton Mitterer via austin-group-l at The Open Group
  • [Issue 8 dra... Austin Group Bug Tracker via austin-group-l at The Open Group
  • [Issue 8 dra... Austin Group Bug Tracker via austin-group-l at The Open Group

Reply via email to