The following issue has been SUBMITTED.
======================================================================
https://www.austingroupbugs.net/view.php?id=1564
======================================================================
Reported By: calestyo
Assigned To:
======================================================================
Project: Issue 8 drafts
Issue ID: 1564
Category: Shell and Utilities
Type: Clarification Requested
Severity: Editorial
Priority: normal
Status: New
Name: Christoph Anton Mitterer
Organization:
User Reference:
Section: 2.13 Pattern Matching Notation
Page Number: 2351
Line Number: 76099
Final Accepted Text:
======================================================================
Date Submitted: 2022-02-23 01:54 UTC
Last Modified: 2022-02-23 01:54 UTC
======================================================================
Summary: clariy on what (character/byte) strings pattern
matching notation should work
Description:
On the mailing list, the question arose (from my side) what the current
wording in the standard implies as to whether pattern matching works on
byte or character strings.
- In some earlier discussion it was pointed out that shell variables
should be strings (of bytes, other than NUL)
=> which could one lead to think that pattern
matching must work on any such strings
- 2.6.2 Parameter Expansion
doesn't seem to say, what the #, ##, % and %% special forms of
expansion work on: bytes or characters, it just
refers to the pattern matching chapter
- 2.13. Pattern Matching Notation says:
"The pattern matching notation described in this section is used to
specify patterns for matching strings in the shell."
=> strings... would mean bytes (as per 3.375 String)
- 2.13.1 Patterns Matching a Single Character however says:
"The following patterns matching a single character shall match a
single character: ordinary characters,..."
I questioned whether one could deduce from that, that patten matching is
required to cope with any non-characters in the string it operates upon.
This was however rejected on the list, and Geoff Clare pointed out, that
since no behaviour is specified (i.e. how the implementation would need to
handle such invalidly encoded character) the use of pattern matching on
arbitrary byte strings is undefined behaviour.
Desired Action:
Either:
1) - In line 76099, replace "strings" with "character strings" and perhaps
mention that the results when this is done on strings that contain any byte
sequence that is not a character in the current locale, the results are
undefined.
Perhaps also clarify this in fnmatch() (page 879), this doesn't seem to
mention locales at all, but when the above assumption is true, and pattern
matching operates on characters only, wouldn't it then need to be subject
of the current LC_CTYPE?
2) Alternatively, some expert could check whether there are any
shell/fnmatch() implementations which do not simply carry on any bytes that
do not form characters. Probably there are (yash?). But if there weren't
POSIX might even chose to standardise that behaviour, which would probably
be better than leaving it unspecified?!
======================================================================
Issue History
Date Modified Username Field Change
======================================================================
2022-02-23 01:54 calestyo New Issue
2022-02-23 01:54 calestyo Name => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo Section => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo Page Number => 2351
2022-02-23 01:54 calestyo Line Number => 76099
======================================================================