A NOTE has been added to this issue. ====================================================================== https://www.austingroupbugs.net/view.php?id=1924 ====================================================================== Reported By: stephane Assigned To: ====================================================================== Project: 1003.1(2024)/Issue8 Issue ID: 1924 Category: Shell and Utilities Tags: tc1-2024 Type: Error Severity: Objection Priority: normal Status: Resolved Name: Stephane Chazelas Organization: User Reference: Section: Shell word splitting and "read" utility Page Number: various Line Number: various Interp Status: --- Final Accepted Text: https://www.austingroupbugs.net/view.php?id=1924#c7183 Resolution: Accepted As Marked Fixed in Version: ====================================================================== Date Submitted: 2025-05-05 19:02 UTC Last Modified: 2025-05-16 15:31 UTC ====================================================================== Summary: New word splitting requirements inappropriate in locales with non-self-synchronising character encodings ======================================================================
---------------------------------------------------------------------- (0007190) hvd (reporter) - 2025-05-16 15:31 https://www.austingroupbugs.net/view.php?id=1924#c7190 ---------------------------------------------------------------------- > I don't really like the idea of specifying implementation algorithms unless that's the obvious one and it can't be perfectible (maybe because I'm on the user and not implementor side). Although I specified it as an implementation algorithm, it's from the user's perspective that I'm suggesting it. The reason for the spec changes is to be able to hold arbitrary file names that are not valid characters according to the current locale. I have such files myself, it's from that perspective that I care about this. In some cases we need to have multiple file names joined by something other than '\0', and being able to resume the conversion after invalid bytes have been encountered on a "best effort" basis is important for that. > The 0001561 algorithm is perfectly fine in locales using single-byte or self-synchronising (UTF-8) encodings, likely the most efficient and the ones implementations / systems that don't intend to support other encodings (and don't otherwise decode all input à la PYTHONIOENCODING=utf-8:surrogateescape) may want to use. It's not fine even there, in my opinion. I recall invalid bytes being interpreted by some shells in some situations as the same characters as other valid bytes, and I can imagine scenarios where that would make sense (e.g. interpreting a single 0xA0 byte, which represents U+00A0 in ISO-8859-1, as U+00A0 even in an UTF-8 locale), that should IMO be permitted so that shell implementors can figure out what works best for them. The current wording does not permit it, my suggested wording does. > In those cases IMO, the best thing for POSIX to do is leave the behaviour unspecified, I don't mind if the behaviour is unspecified for bytes that do not form valid characters, I do mind if the required behaviour is contrary to previously required behaviour for bytes that do form valid characters. That is the basis for my suggestion, it limits the unspecified behaviour to those cases. > I don't know if shell implementations use the algorithm you describe with mbrtowc() and handling of EILSEQ, but for the record, I reported 0001920 and this follow-up bug after having been made aware of the bash bug described at: Thanks for the pointer, it looks like bash didn't do it this way in this specific case but it was acknowledged as a bug and will be fixed for the next version? > some may want to treat input and IFS as single-byte characters when the input or IFS can't be decoded into characters This, however, I am less sure about. This is neither permitted by the current wording nor by my suggested wording, but is a valid idea of what constitutes "best effort" and it seems reasonable to find some way of allowing it. > Also bear in mind, that on non-seekable (and non-peekable) input at least, read has to read one byte at a time (at try to decode what has been read at every step) so as not to read past the delimiter, which complicates things further. I'm aware of that, that is easy to handle portably since mbrtowc() allows processing one single byte at a time, and better optimised implementations for specific locales would be able to do the same even easier. Issue History Date Modified Username Field Change ====================================================================== 2025-05-05 19:02 stephane New Issue 2025-05-15 15:14 geoffclare Note Added: 0007183 2025-05-15 15:16 geoffclare Status New => Resolved 2025-05-15 15:16 geoffclare Resolution Open => Accepted As Marked 2025-05-15 15:16 geoffclare Interp Status => --- 2025-05-15 15:16 geoffclare Final Accepted Text => https://www.austingroupbugs.net/view.php?id=1924#c7183 2025-05-15 15:16 geoffclare Tag Attached: tc1-2024 2025-05-16 06:25 stephane Note Added: 0007186 2025-05-16 06:28 stephane Note Added: 0007187 2025-05-16 09:39 hvd Note Added: 0007188 2025-05-16 14:13 stephane Note Added: 0007189 2025-05-16 15:31 hvd Note Added: 0007190 ======================================================================
