A NOTE has been added to this issue. ====================================================================== https://www.austingroupbugs.net/view.php?id=1924 ====================================================================== Reported By: stephane Assigned To: ====================================================================== Project: 1003.1(2024)/Issue8 Issue ID: 1924 Category: Shell and Utilities Tags: tc1-2024 Type: Error Severity: Objection Priority: normal Status: Resolved Name: Stephane Chazelas Organization: User Reference: Section: Shell word splitting and "read" utility Page Number: various Line Number: various Interp Status: --- Final Accepted Text: https://www.austingroupbugs.net/view.php?id=1924#c7183 Resolution: Accepted As Marked Fixed in Version: ====================================================================== Date Submitted: 2025-05-05 19:02 UTC Last Modified: 2025-05-16 14:13 UTC ====================================================================== Summary: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings ======================================================================
---------------------------------------------------------------------- (0007189) stephane (reporter) - 2025-05-16 14:13 https://www.austingroupbugs.net/view.php?id=1924#c7189 ---------------------------------------------------------------------- Re: https://www.austingroupbugs.net/view.php?id=1924#c7188 I don't really like the idea of specifying implementation algorithms unless that's the obvious one and it can't be perfectible (maybe because I'm on the user and not implementor side). The https://www.austingroupbugs.net/view.php?id=1561 algorithm is perfectly fine in locales using single-byte or self-synchronising (UTF-8) encodings, likely the most efficient and the ones implementations / systems that don't intend to support other encodings (and don't otherwise decode all input à la PYTHONIOENCODING=utf-8:surrogateescape) may want to use. But it's plain wrong for other encodings. In locales using non-self-synchronising encodings, with sequences of bytes that can't be decoded into text, I don't think a perfect solution exists. That's the point, if you lose synchronisation, there's no sure way to tell where it was lost and where to resume it, whether the corruption was caused by left or right truncation, byte deletion/insertion, bit flip or the input was actually encoded using a different encoding or a different version of the encoding, or is supplied by an attacker trying to trick you or exploit a bug. Whatever you do, you may very well be hallucinating new characters, missing perfectly encoded characters... In those cases IMO, the best thing for POSIX to do is leave the behaviour unspecified, letting implementations decide what they think is best in their specific context. For instance, some may want to detect that the input is actually encoded in UTF-8 and treat it as such (because that's the most likely cause on those systems for instance), some may want to treat input and IFS as single-byte characters when the input or IFS can't be decoded into characters (like bash does for pattern matching when subject or pattern cannot be decoded as text¹) I don't know if shell implementations use the algorithm you describe with mbrtowc() and handling of EILSEQ, but for the record, I reported https://www.austingroupbugs.net/view.php?id=1920 and this follow-up bug after having been made aware of the bash bug described at: https://mywiki.wooledge.org/BashPitfalls#pf65 https://lists.gnu.org/archive/html/bug-bash/2025-04/msg00065.html where read can read passed the delimiter (even if newline or null which can't be found in the encoding of other characters) if reading sequences of bytes that don't form valid characters. Suggesting it may not be how it does it or that it's not as simple as that. Also bear in mind, that on non-seekable (and non-peekable) input at least, read has to read one byte at a time (at try to decode what has been read at every step) so as not to read past the delimiter, which complicates things further. --- ¹ Which I personally consider a bug, see https://lists.gnu.org/archive/html/bug-bash/2021-02/msg00054.html Issue History Date Modified Username Field Change ====================================================================== 2025-05-05 19:02 stephane New Issue 2025-05-15 15:14 geoffclare Note Added: 0007183 2025-05-15 15:16 geoffclare Status New => Resolved 2025-05-15 15:16 geoffclare Resolution Open => Accepted As Marked 2025-05-15 15:16 geoffclare Interp Status => --- 2025-05-15 15:16 geoffclare Final Accepted Text => https://www.austingroupbugs.net/view.php?id=1924#c7183 2025-05-15 15:16 geoffclare Tag Attached: tc1-2024 2025-05-16 06:25 stephane Note Added: 0007186 2025-05-16 06:28 stephane Note Added: 0007187 2025-05-16 09:39 hvd Note Added: 0007188 2025-05-16 14:13 stephane Note Added: 0007189 ======================================================================
