[1003.1(2024)/Issue8 0001924]: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings

Austin Group Issue Tracker via austin-group-l at The Open Group Fri, 16 May 2025 08:47:02 -0700

A NOTE has been added to this issue. 
====================================================================== 
https://www.austingroupbugs.net/view.php?id=1924 
====================================================================== 
Reported By:                stephane
Assigned To:                
====================================================================== 
Project:                    1003.1(2024)/Issue8
Issue ID:                   1924
Category:                   Shell and Utilities
Tags:                       tc1-2024
Type:                       Error
Severity:                   Objection
Priority:                   normal
Status:                     Resolved
Name:                       Stephane Chazelas 
Organization:                
User Reference:              
Section:                    Shell word splitting and "read" utility 
Page Number:                various 
Line Number:                various 
Interp Status:              --- 
Final Accepted Text:       
https://www.austingroupbugs.net/view.php?id=1924#c7183 
Resolution:                 Accepted As Marked
Fixed in Version:           
====================================================================== 
Date Submitted:             2025-05-05 19:02 UTC
Last Modified:              2025-05-16 15:31 UTC
====================================================================== 
Summary:                    New word splitting requirements inappropriate in
locales with non-self-synchronising character encodings
======================================================================


---------------------------------------------------------------------- 
 (0007190) hvd (reporter) - 2025-05-16 15:31
 https://www.austingroupbugs.net/view.php?id=1924#c7190 
---------------------------------------------------------------------- 
> I don't really like the idea of specifying implementation algorithms unless
that's the obvious one and it can't be perfectible (maybe because I'm on the
user and not implementor
side).

Although I specified it as an implementation algorithm, it's from the user's
perspective that I'm suggesting it. The reason for the spec changes is to be
able to hold arbitrary file names that are not valid characters according to the
current locale. I have such files myself, it's from that perspective that I care
about this. In some cases we need to have multiple file names joined by
something other than '\0', and being able to resume the conversion after invalid
bytes have been encountered on a "best effort" basis is important for that.

> The 0001561 algorithm is perfectly fine in locales using single-byte or
self-synchronising (UTF-8) encodings, likely the most efficient and the ones
implementations / systems that don't intend to support other encodings (and
don't otherwise decode all input à la PYTHONIOENCODING=utf-8:surrogateescape)
may want to use.

It's not fine even there, in my opinion. I recall invalid bytes being
interpreted by some shells in some situations as the same characters as other
valid bytes, and I can imagine scenarios where that would make sense (e.g.
interpreting a single 0xA0 byte, which represents U+00A0 in ISO-8859-1, as
U+00A0 even in an UTF-8 locale), that should IMO be permitted so that shell
implementors can figure out what works best for them. The current wording does
not permit it, my suggested wording does.

> In those cases IMO, the best thing for POSIX to do is leave the behaviour
unspecified,

I don't mind if the behaviour is unspecified for bytes that do not form valid
characters, I do mind if the required behaviour is contrary to previously
required behaviour for bytes that do form valid characters. That is the basis
for my suggestion, it limits the unspecified behaviour to those cases.

> I don't know if shell implementations use the algorithm you describe with
mbrtowc() and handling of EILSEQ, but for the record, I reported 0001920 and
this follow-up bug after having been made aware of the bash bug described at:

Thanks for the pointer, it looks like bash didn't do it this way in this
specific case but it was acknowledged as a bug and will be fixed for the next
version?

> some may want to treat input and IFS as single-byte characters when the input
or IFS can't be decoded into characters

This, however, I am less sure about. This is neither permitted by the current
wording nor by my suggested wording, but is a valid idea of what constitutes
"best effort" and it seems reasonable to find some way of allowing it.

> Also bear in mind, that on non-seekable (and non-peekable) input at least,
read has to read one byte at a time (at try to decode what has been read at
every step) so as not to read past the delimiter, which complicates things
further.

I'm aware of that, that is easy to handle portably since mbrtowc() allows
processing one single byte at a time, and better optimised implementations for
specific locales would be able to do the same even easier. 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2025-05-05 19:02 stephane       New Issue                                    
2025-05-15 15:14 geoffclare     Note Added: 0007183                          
2025-05-15 15:16 geoffclare     Status                   New => Resolved     
2025-05-15 15:16 geoffclare     Resolution               Open => Accepted As
Marked
2025-05-15 15:16 geoffclare     Interp Status             => ---             
2025-05-15 15:16 geoffclare     Final Accepted Text       =>
https://www.austingroupbugs.net/view.php?id=1924#c7183    
2025-05-15 15:16 geoffclare     Tag Attached: tc1-2024                       
2025-05-16 06:25 stephane       Note Added: 0007186                          
2025-05-16 06:28 stephane       Note Added: 0007187                          
2025-05-16 09:39 hvd            Note Added: 0007188                          
2025-05-16 14:13 stephane       Note Added: 0007189                          
2025-05-16 15:31 hvd            Note Added: 0007190                          
======================================================================

[1003.1(2024)/Issue8 0001924]: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings

Reply via email to