A NOTE has been added to this issue. ====================================================================== https://www.austingroupbugs.net/view.php?id=1924 ====================================================================== Reported By: stephane Assigned To: ====================================================================== Project: 1003.1(2024)/Issue8 Issue ID: 1924 Category: Shell and Utilities Tags: tc1-2024 Type: Error Severity: Objection Priority: normal Status: Resolved Name: Stephane Chazelas Organization: User Reference: Section: Shell word splitting and "read" utility Page Number: various Line Number: various Interp Status: --- Final Accepted Text: https://www.austingroupbugs.net/view.php?id=1924#c7183 Resolution: Accepted As Marked Fixed in Version: ====================================================================== Date Submitted: 2025-05-05 19:02 UTC Last Modified: 2025-05-16 06:25 UTC ====================================================================== Summary: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings ======================================================================
---------------------------------------------------------------------- (0007186) stephane (reporter) - 2025-05-16 06:25 https://www.austingroupbugs.net/view.php?id=1924#c7186 ---------------------------------------------------------------------- Re: https://www.austingroupbugs.net/view.php?id=1924#c7183 Thanks for that. A few comments: > After page 79 line 2388 section 3 Definitions, add: > > 3.328 Self-synchronizing Character Encoding > > A character encoding in which no contiguous subset of bytes > from the encoding of any one character or two adjacent > characters can also represent the encoding of any valid > character on its own. [...] Not sure that wording works. There's necessarily "A subset of bytes from the encoding of two adjacent characters" that can "represent the encoding of any valid character on its own", since it contains the encoding of each of those two characters. Maybe a "subset (other than the encoding of each character)". > On page 2481 line 80454 section 2.5.3 Shell Variables (IFS), after: > > If the value of IFS includes any bytes that do not form part > of a valid character, the results of field splitting, > expansion of '*', and use of the read utility are > unspecified. > > add a sentence: > > If the character encoding used for the characters in IFS is > not self-synchronizing and the value of IFS includes any > character for which the byte encoding can overlap with the > byte encoding of any other sequence of characters, the > results of field splitting, expansion of '*', and use of the > read utility are unspecified. (Note: the UTF-8 encoding is > self-synchronizing, meaning that no character's encoding can > be confused with any other sequence of characters, and thus > does not trigger this exception.) "encoding used for the characters in IFS" is not clear to me. In the shells I know (with the possible exception of yash), when IFS is assigned a value, it's assigned a sequence of bytes which may or may not form characters in the locale (as determined by ${LC_ALL:-${LC_CTYPE:-$LANG}}) at the time, but what matters wrt word splitting is the locale (specifically ${LC_ALL:-${LC_CTYPE:-$LANG}}) in effect at the time splitting is performed, and the characters that those bytes form (and potentially whether they're classified as iswspace()). So it would be about whether the *locale's character encoding* is self-synchronizing or not, not "the character encoding used for the characters in IFS" (whatever that means). Now, those concerns aside, AFAICT that resolution addresses this issue (other than be request to add a "as if by") and it's nice that it makes it clear that character encodings such a BIG5/GB18030 and other non-self-synchronising encodings are not usable (at least reliably), but I fear it's not going to be very useful to a portable application writer. How is someone to know which character may or may not be used in IFS? In practice, on systems that have locales that use GB18030 or BIG5-HKSCS charsets (which are many), we're basically telling them that they can't use characters other than U+0001..U+002F (control characters, space and !"#$%&'()*+,-./), U+003A..U+003F (:;<=>?) and U+007F (DEL) in IFS. They can't use IFS='|', IFS=_, IFS='~', IFS=X or non-ASCII characters for instance if they want their script to be usable on user input in any of the system's locales. Word splitting is meant to be about splitting *text* on *characters* of IFS, we should be able to tell application writers that if they have valid text in IFS and the subject being split (as input by the user in their own locale for instance), it will be split correctly. AFAICT, and bugs aside, shells that support multibyte encodings (bash, zsh, AT&T ksh, bosh, yash at least) do that, they do not "split on the encoding of characters of IFS" like bug:1560 requires. It's a welcome addition to mandate that in locales using a self-synchronising character encoding (and IFS containing valid text as per that encoding), implementations must be able to split arbitrary sequences of bytes *as if* by splitting on the encoding of characters of IFS. But then, IMO, it should say that. As in: - split on characters of IFS (essentially revert bug:1560) - and also: in locales using a self-synchronising character encoding (and IFS containing valid text as per that encoding), implementations must be able to split arbitrary sequences of bytes even if they don't form valid characters *as if* by splitting on the encoding of characters of IFS. (same for read -d delimiter with the added constraint that the delimiter must be a single-byte character). With non-self-synchronising encoding, behaviour unspecified on non-text subject. Also, why make $* unspecified? $* unquoted is not useful, so I don't really care what POSIX says about it but I can't see why "$*" can't be just the concatenation of positional parameters with the first *character* of IFS (at byte level) regardless of what that character may be (assuming IFS contains valid text), even if the positional parameters don't contain valid text (which may result in character recombination, but why would we care at that point?). [btw, one still can't use $IFS or ${IFS} in this bug tracker, any way that particular rule could be disabled?] Issue History Date Modified Username Field Change ====================================================================== 2025-05-05 19:02 stephane New Issue 2025-05-15 15:14 geoffclare Note Added: 0007183 2025-05-15 15:16 geoffclare Status New => Resolved 2025-05-15 15:16 geoffclare Resolution Open => Accepted As Marked 2025-05-15 15:16 geoffclare Interp Status => --- 2025-05-15 15:16 geoffclare Final Accepted Text => https://www.austingroupbugs.net/view.php?id=1924#c7183 2025-05-15 15:16 geoffclare Tag Attached: tc1-2024 2025-05-16 06:25 stephane Note Added: 0007186 ======================================================================
