A NOTE has been added to this issue. ====================================================================== https://www.austingroupbugs.net/view.php?id=1564 ====================================================================== Reported By: calestyo Assigned To: ====================================================================== Project: Issue 8 drafts Issue ID: 1564 Category: Shell and Utilities Type: Clarification Requested Severity: Editorial Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section: 2.13 Pattern Matching Notation Page Number: 2351 Line Number: 76099 Final Accepted Text: ====================================================================== Date Submitted: 2022-02-23 01:54 UTC Last Modified: 2022-02-25 20:54 UTC ====================================================================== Summary: clariy on what (character/byte) strings pattern matching notation should work ======================================================================
---------------------------------------------------------------------- (0005719) mirabilos (reporter) - 2022-02-25 20:54 https://www.austingroupbugs.net/view.php?id=1564#c5719 ---------------------------------------------------------------------- 「would a "ls *" in principle be expected to only match filenames who are character strings in the current locale?」 That’d be the logical consequence of treating it as characters in the current encoding. When I added locale “things” to my projects, I extended the definition of character. Instead of just accepting the characters that are valid in the current encoding (which is either 8-bit-transparent 7-bit ASCII or (back then BMP-only, but I’m moving to full 21-bit) UTF-8), so-called “raw octets” are also mapped into the wide character range. Every time a conversion fails, the first octet of it is handled as raw octet, then the conversion restarts on the next one. (This can obviously be optimised for illegal UTF-8 sequences if one is careful about the beginning of the next possibly valid sequence.) In 16-bit wchar_t times (basically “until 2022Q1”), this is mapped into a PUA range reserved by the CSUR for this. (Not quite optimal.) This is U+EF80‥U+EFFF. (What happens when you encounter \xEE\xBE\x80 can only be described as fun.) In the new scheme, I’m mapping them to U-10000080‥U-100000FF which is outside of the range of things, so not a problem (except now I’m wondering what to set WCHAR_MAX to, but I think 0x10FFFFU still, because only these are, strictly speaking, valid?) There’s a complication that has to do with the idiotic Standard C API for mbrtowc(3) in that “return value == 0” is the sole test for “*pwc == L'\0'” and so cannot be used to signal that 0 octets have been eaten, which means I might need to use even higher numbers for 2‑, 3‑ and 4-byte raw octet sequences. (The latter of which has wcwidth() == 4…) But that’s detail. The thing relevant here is that this is (could be, but anything else is either discarding the notion of character here (which would be hard to make congruent with the existence of character classes) or an active and certainly harmful disservice to users (the “only match filenames that are valid” I quoted above)) a middle ground between characters and bytes: bytes that are characters if possible and have character semantics applied, but may not. This is currently unspecified. I’d like to (continue) treat(int) things in a way that means that, for example, ? is either a character or a single byte from an invalid multibyte sequence (of length 1 or more), with a subsequent ? catching a possible second byte, and so on. Raw octets are displayed as � with a wcwidth() of 1 each (or some application-local suitable encoding, where that is possible). Issue History Date Modified Username Field Change ====================================================================== 2022-02-23 01:54 calestyo New Issue 2022-02-23 01:54 calestyo Name => Christoph Anton Mitterer 2022-02-23 01:54 calestyo Section => 2.13 Pattern Matching Notation 2022-02-23 01:54 calestyo Page Number => 2351 2022-02-23 01:54 calestyo Line Number => 76099 2022-02-25 04:57 calestyo Note Added: 0005716 2022-02-25 20:54 mirabilos Note Added: 0005719 ======================================================================
