[1003.1(2024)/Issue8 0001924]: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings

Austin Group Issue Tracker via austin-group-l at The Open Group Thu, 15 May 2025 23:32:21 -0700

A NOTE has been added to this issue. 
====================================================================== 
https://www.austingroupbugs.net/view.php?id=1924 
====================================================================== 
Reported By:                stephane
Assigned To:                
====================================================================== 
Project:                    1003.1(2024)/Issue8
Issue ID:                   1924
Category:                   Shell and Utilities
Tags:                       tc1-2024
Type:                       Error
Severity:                   Objection
Priority:                   normal
Status:                     Resolved
Name:                       Stephane Chazelas 
Organization:                
User Reference:              
Section:                    Shell word splitting and "read" utility 
Page Number:                various 
Line Number:                various 
Interp Status:              --- 
Final Accepted Text:       
https://www.austingroupbugs.net/view.php?id=1924#c7183 
Resolution:                 Accepted As Marked
Fixed in Version:           
====================================================================== 
Date Submitted:             2025-05-05 19:02 UTC
Last Modified:              2025-05-16 06:25 UTC
====================================================================== 
Summary:                    New word splitting requirements inappropriate in
locales with non-self-synchronising character encodings
======================================================================


---------------------------------------------------------------------- 
 (0007186) stephane (reporter) - 2025-05-16 06:25
 https://www.austingroupbugs.net/view.php?id=1924#c7186 
---------------------------------------------------------------------- 
Re: https://www.austingroupbugs.net/view.php?id=1924#c7183

Thanks for that. A few comments:

> After page 79 line 2388 section 3 Definitions, add:
> 
> 3.328 Self-synchronizing Character Encoding
> 
>     A character encoding in which no contiguous subset of bytes
>     from the encoding of any one character or two adjacent
>     characters can also represent the encoding of any valid
>     character on its own.
[...]

Not sure that wording works. There's necessarily "A subset of
bytes from the encoding of two adjacent characters" that can
"represent the encoding of any valid character on its own",
since it contains the encoding of each of those two characters.

Maybe a "subset (other than the encoding of each character)".

> On page 2481 line 80454 section 2.5.3 Shell Variables (IFS), after:
> 
>     If the value of IFS includes any bytes that do not form part
>     of a valid character, the results of field splitting,
>     expansion of '*', and use of the read utility are
>     unspecified.
> 
> add a sentence:
> 
>     If the character encoding used for the characters in IFS is
>     not self-synchronizing and the value of IFS includes any
>     character for which the byte encoding can overlap with the
>     byte encoding of any other sequence of characters, the
>     results of field splitting, expansion of '*', and use of the
>     read utility are unspecified. (Note: the UTF-8 encoding is
>     self-synchronizing, meaning that no character's encoding can
>     be confused with any other sequence of characters, and thus
>     does not trigger this exception.)

"encoding used for the characters in IFS" is not clear to me.

In the shells I know (with the possible exception of yash), when
IFS is assigned a value, it's assigned a sequence of bytes which
may or may not form characters in the locale (as determined by
${LC_ALL:-${LC_CTYPE:-$LANG}}) at the time, but what matters wrt
word splitting is the locale (specifically
${LC_ALL:-${LC_CTYPE:-$LANG}}) in effect at the time splitting
is performed, and the characters that those bytes form (and
potentially whether they're classified as iswspace()).

So it would be about whether the *locale's character encoding*
is self-synchronizing or not, not "the character encoding used
for the characters in IFS" (whatever that means).

Now, those concerns aside, AFAICT that resolution addresses this
issue (other than be request to add a "as if by") and it's nice
that it makes it clear that character encodings such a
BIG5/GB18030 and other non-self-synchronising encodings are not
usable (at least reliably), but I fear it's not going to be very
useful to a portable application writer.

How is someone to know which character may or may not be used in
IFS? In practice, on systems that have locales that use GB18030
or BIG5-HKSCS charsets (which are many), we're basically telling
them that they can't use characters other than U+0001..U+002F
(control characters, space and !"#$%&'()*+,-./), U+003A..U+003F
(:;<=>?) and U+007F (DEL) in IFS. They can't use IFS='|',
IFS=_, IFS='~', IFS=X or non-ASCII characters for instance if
they want their script to be usable on user input in any of the
system's locales.

Word splitting is meant to be about splitting *text* on
*characters* of IFS, we should be able to tell application
writers that if they have valid text in IFS and the subject
being split (as input by the user in their own locale for
instance), it will be split correctly.

AFAICT, and bugs aside, shells that support multibyte encodings
(bash, zsh, AT&T ksh, bosh, yash at least) do that, they do not
"split on the encoding of characters of IFS" like bug:1560
requires.

It's a welcome addition to mandate that in locales using a
self-synchronising character encoding (and IFS containing
valid text as per that encoding), implementations must be able
to split arbitrary sequences of bytes *as if* by splitting on
the encoding of characters of IFS.

But then, IMO, it should say that. As in:

- split on characters of IFS (essentially revert bug:1560)
- and also: in locales using a self-synchronising character
  encoding (and IFS containing valid text as per that
  encoding), implementations must be able to split arbitrary
  sequences of bytes even if they don't form valid characters
  *as if* by splitting on the encoding of characters of IFS.
  (same for read -d delimiter with the added constraint that the
  delimiter must be a single-byte character).  With
  non-self-synchronising encoding, behaviour unspecified on
  non-text subject.

Also, why make $* unspecified?

$* unquoted is not useful, so I don't really care what POSIX
says about it but I can't see why "$*" can't be just the
concatenation of positional parameters with the first
*character* of IFS (at byte level) regardless of what that
character may be (assuming IFS contains valid text), even if
the positional parameters don't contain valid text (which may
result in character recombination, but why would we care at that
point?).

[btw, one still can't use &#36;IFS or &#36;{IFS} in this bug
tracker, any way that particular rule could be disabled?] 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2025-05-05 19:02 stephane       New Issue                                    
2025-05-15 15:14 geoffclare     Note Added: 0007183                          
2025-05-15 15:16 geoffclare     Status                   New => Resolved     
2025-05-15 15:16 geoffclare     Resolution               Open => Accepted As
Marked
2025-05-15 15:16 geoffclare     Interp Status             => ---             
2025-05-15 15:16 geoffclare     Final Accepted Text       =>
https://www.austingroupbugs.net/view.php?id=1924#c7183    
2025-05-15 15:16 geoffclare     Tag Attached: tc1-2024                       
2025-05-16 06:25 stephane       Note Added: 0007186                          
======================================================================

[1003.1(2024)/Issue8 0001924]: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings

Reply via email to