[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

Austin Group Bug Tracker via austin-group-l at The Open Group Fri, 25 Feb 2022 12:55:44 -0800


A NOTE has been added to this issue. 
====================================================================== 
https://www.austingroupbugs.net/view.php?id=1564 
====================================================================== 
Reported By:                calestyo
Assigned To:                
====================================================================== 
Project:                    Issue 8 drafts
Issue ID:                   1564
Category:                   Shell and Utilities
Type:                       Clarification Requested
Severity:                   Editorial
Priority:                   normal
Status:                     New
Name:                       Christoph Anton Mitterer 
Organization:                
User Reference:              
Section:                    2.13 Pattern Matching Notation 
Page Number:                2351 
Line Number:                76099 
Final Accepted Text:         
====================================================================== 
Date Submitted:             2022-02-23 01:54 UTC
Last Modified:              2022-02-25 20:54 UTC
====================================================================== 
Summary:                    clariy on what (character/byte) strings pattern
matching notation should work
======================================================================


---------------------------------------------------------------------- 
 (0005719) mirabilos (reporter) - 2022-02-25 20:54
 https://www.austingroupbugs.net/view.php?id=1564#c5719 
---------------------------------------------------------------------- 
「would a "ls *" in principle be expected to only match filenames who are
character strings in the current locale?」

That’d be the logical consequence of treating it as characters in the
current encoding.

When I added locale “things” to my projects, I extended the definition
of character. Instead of just accepting the characters that are valid in
the current encoding (which is either 8-bit-transparent 7-bit ASCII or
(back then BMP-only, but I’m moving to full 21-bit) UTF-8), so-called
“raw octets” are also mapped into the wide character range.

Every time a conversion fails, the first octet of it is handled as raw
octet, then the conversion restarts on the next one. (This can obviously be
optimised for illegal UTF-8 sequences if one is careful about the beginning
of the next possibly valid sequence.)

In 16-bit wchar_t times (basically “until 2022Q1”), this is mapped into
a PUA range reserved by the CSUR for this. (Not quite optimal.) This is
U+EF80‥U+EFFF. (What happens when you encounter \xEE\xBE\x80 can only be
described as fun.)

In the new scheme, I’m mapping them to U-10000080‥U-100000FF which is
outside of the range of things, so not a problem (except now I’m
wondering what to set WCHAR_MAX to, but I think 0x10FFFFU still, because
only these are, strictly speaking, valid?)

There’s a complication that has to do with the idiotic Standard C API for
mbrtowc(3) in that “return value == 0” is the sole test for “*pwc ==
L'\0'” and so cannot be used to signal that 0 octets have been eaten,
which means I might need to use even higher numbers for 2‑, 3‑ and
4-byte raw octet sequences. (The latter of which has wcwidth() == 4…)

But that’s detail. The thing relevant here is that this is (could be, but
anything else is either discarding the notion of character here (which
would be hard to make congruent with the existence of character classes) or
an active and certainly harmful disservice to users (the “only match
filenames that are valid” I quoted above)) a middle ground between
characters and bytes: bytes that are characters if possible and have
character semantics applied, but may not.

This is currently unspecified. I’d like to (continue) treat(int) things
in a way that means that, for example, ? is either a character or a single
byte from an invalid multibyte sequence (of length 1 or more), with a
subsequent ? catching a possible second byte, and so on. Raw octets are
displayed as � with a wcwidth() of 1 each (or some application-local
suitable encoding, where that is possible). 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2022-02-23 01:54 calestyo       New Issue                                    
2022-02-23 01:54 calestyo       Name                      => Christoph Anton
Mitterer
2022-02-23 01:54 calestyo       Section                   => 2.13 Pattern
Matching Notation
2022-02-23 01:54 calestyo       Page Number               => 2351            
2022-02-23 01:54 calestyo       Line Number               => 76099           
2022-02-25 04:57 calestyo       Note Added: 0005716                          
2022-02-25 20:54 mirabilos      Note Added: 0005719                          
======================================================================

[Issue 8 drafts 0001564]: clariy on what (character/byte) strings pattern matching notation should work

Reply via email to