[1003.1(2024)/Issue8 0001924]: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings

Austin Group Issue Tracker via austin-group-l at The Open Group Fri, 16 May 2025 07:17:19 -0700

A NOTE has been added to this issue. 
====================================================================== 
https://www.austingroupbugs.net/view.php?id=1924 
====================================================================== 
Reported By:                stephane
Assigned To:                
====================================================================== 
Project:                    1003.1(2024)/Issue8
Issue ID:                   1924
Category:                   Shell and Utilities
Tags:                       tc1-2024
Type:                       Error
Severity:                   Objection
Priority:                   normal
Status:                     Resolved
Name:                       Stephane Chazelas 
Organization:                
User Reference:              
Section:                    Shell word splitting and "read" utility 
Page Number:                various 
Line Number:                various 
Interp Status:              --- 
Final Accepted Text:       
https://www.austingroupbugs.net/view.php?id=1924#c7183 
Resolution:                 Accepted As Marked
Fixed in Version:           
====================================================================== 
Date Submitted:             2025-05-05 19:02 UTC
Last Modified:              2025-05-16 14:13 UTC
====================================================================== 
Summary:                    New word splitting requirements inappropriate in
locales with non-self-synchronising character encodings
======================================================================


---------------------------------------------------------------------- 
 (0007189) stephane (reporter) - 2025-05-16 14:13
 https://www.austingroupbugs.net/view.php?id=1924#c7189 
---------------------------------------------------------------------- 
Re: https://www.austingroupbugs.net/view.php?id=1924#c7188

I don't really like the idea of specifying implementation
algorithms unless that's the obvious one and it can't be
perfectible (maybe because I'm on the user and not implementor
side).

The https://www.austingroupbugs.net/view.php?id=1561 algorithm is perfectly fine
in locales using
single-byte or self-synchronising (UTF-8) encodings, likely the
most efficient and the ones implementations / systems that don't
intend to support other encodings (and don't otherwise decode
all input à la PYTHONIOENCODING=utf-8:surrogateescape) may want
to use.

But it's plain wrong for other encodings.

In locales using non-self-synchronising encodings, with
sequences of bytes that can't be decoded into text, I don't
think a perfect solution exists.

That's the point, if you lose synchronisation, there's no sure
way to tell where it was lost and where to resume it, whether
the corruption was caused by left or right truncation, byte
deletion/insertion, bit flip or the input was actually
encoded using a different encoding or a different version of the
encoding, or is supplied by an attacker trying to trick you or
exploit a bug. Whatever you do, you may very well be
hallucinating new characters, missing perfectly encoded
characters...

In those cases IMO, the best thing for POSIX to do is leave the
behaviour unspecified, letting implementations decide what they
think is best in their specific context. For instance, some may
want to detect that the input is actually encoded in UTF-8 and
treat it as such (because that's the most likely cause on those
systems for instance), some may want to treat input and IFS as
single-byte characters when the input or IFS can't be decoded
into characters (like bash does for pattern matching when
subject or pattern cannot be decoded as text¹) 

I don't know if shell implementations use the algorithm you
describe with mbrtowc() and handling of EILSEQ, but for the
record, I reported https://www.austingroupbugs.net/view.php?id=1920 and this
follow-up bug after
having been made aware of the bash bug described at:
https://mywiki.wooledge.org/BashPitfalls#pf65
https://lists.gnu.org/archive/html/bug-bash/2025-04/msg00065.html
where read can read passed the delimiter (even if newline or
null which can't be found in the encoding of other characters)
if reading sequences of bytes that don't form valid characters.

Suggesting it may not be how it does it or that it's not as
simple as that. Also bear in mind, that on non-seekable (and
non-peekable) input at least, read has to read one byte at a
time (at try to decode what has been read at every step) so as
not to read past the delimiter, which complicates things
further.

---
¹ Which I personally consider a bug, see
https://lists.gnu.org/archive/html/bug-bash/2021-02/msg00054.html 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2025-05-05 19:02 stephane       New Issue                                    
2025-05-15 15:14 geoffclare     Note Added: 0007183                          
2025-05-15 15:16 geoffclare     Status                   New => Resolved     
2025-05-15 15:16 geoffclare     Resolution               Open => Accepted As
Marked
2025-05-15 15:16 geoffclare     Interp Status             => ---             
2025-05-15 15:16 geoffclare     Final Accepted Text       =>
https://www.austingroupbugs.net/view.php?id=1924#c7183    
2025-05-15 15:16 geoffclare     Tag Attached: tc1-2024                       
2025-05-16 06:25 stephane       Note Added: 0007186                          
2025-05-16 06:28 stephane       Note Added: 0007187                          
2025-05-16 09:39 hvd            Note Added: 0007188                          
2025-05-16 14:13 stephane       Note Added: 0007189                          
======================================================================

[1003.1(2024)/Issue8 0001924]: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings

Reply via email to