Re: More issues with pattern matching

Robert Elz Thu, 26 Sep 2019 14:13:25 -0700

    Date:        Thu, 26 Sep 2019 17:54:21 +0100
    From:        Geoff Clare <g...@opengroup.org>
    Message-ID:  <20190926165421.GA32280@lt2.masqnet>


  | In the case of [x[:bogus:]], the use of both colons clearly indicates
  | the intention to use the new character-class feature.  If the name
  | between the colons is not a valid class name, that is likely due to
  | an error on the user or application writer's part when typing the name.

I had been waiting for that argument, it is the only one that is
half way rational, and supports that position.  But half way is as
far as it gets.

POSIX allows locales to define new char classes, it says so, XBD 7.3,
page 141, lines 4218-4226.

Since a locale is allowed to define a new char class name, the shell
(or regcomp() for the RE case) cannot know whether the user here:

  | For example, if a user types:
  |
  | grep '[[:alhpa:]]' file

made a typo for alpha (the standard posix defined char class), or really
intended alhpa a locale specific char class in some locale which is not
the current one.

Making this some kind of error, in either REs, or shell patterns (whatever
the effect of that is) makes it impossible for users to ever safely, and
simply, use the locale specific locale name.

They cannot even test which locale is in use as (aside from it being impossible
to be sure which locales have added this new char class to their definitions)
there's no guarantee that even if we know that LC_CTYPE=EN_dislexic
contains the alhpa character class, in some implementations, there is no
sane way to know whether the current impoementation does.

That is, unless you're requiring that before a locale specific char class
can be used, the user (on the command line) or script, is required to
query the locale and test whether the char class is defined there or not.

Requiring that would be absurd.   Disallowing users from using locale specific
char classes even though the locale is free to provide them would be absurd.
A non-absurd outcome is achieved only when unknown char class names are
treated as missing empty class definitions in the current locale.  That
works, is easy, and clean.

And yes, it means that what are really user errors cannot be trivially
diagnosed, but simply produce unexpected results.   But this is far from
the only case where that happens - sh is a very forgiving language, vast
numbers of obvious user errors are allowed to pass undiagnosed, because the
shell cannot know that what the user entered is not actualy what they
intended to enter, and preventing genuine work in order to improve error
diagnosis is not the direction that the shell has ever taken.   Regular
expressions are more strictly interpreted, and do have detectable error
cases, so in theory those could give errors for the case you describe
(using the char class in the pattern arg to grep, for example) but the
same arguments as above apply here as well, so that is not a desirable
outcome.  Further, even if it were, it would be difficult to achieve,
given that POSIX has merged the definitions of bracket expressions in
shell patterns and regular expressions into one definition, and we really
want unknown clar class names to be usable (if not matching anything) in
shell patterns (as they don't generate error messages if invalid, they
simply match different things - or sometimes nothing - which is even harder
to diagnose than noticing a mistyped class name.

So to:

  | It is not absurd, it makes perfect sense.

I could not disagree more.


  | This is the same reason we added item 8,

I have no problems with item 8, and I understand it (even though I
don't implement it) it simply is not relevant to anything we're currently
talking about.

However:

  | but POSIX was preventing them from behaving in a
  | way that is more useful to the user.

was a very good argument.   So let's adopt the same one for locale defined
class names, and make sure they work well, and are useful to the user.

  | > So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] ,
  | > not allowed to be treated the same, explicitly unspecified, or simply
  | > never considered (previously) ?
  |
  | I believe the intention is that it be treated the same as [[:alpha:]].

Good, that is what I would have hoped.   Now maybe we should add something
to make that explicit.

  | This is the only reasonable conclusion if you consider the similarity to:
  | ls *"a"*

That is a good analogy.

  | Clearly the intention here is that the quotes are not treated as part of
  | the pattern, even though pathname expansion is done before quote removal.

Yes, agreed.   That is what I do, and I wish everyone would do the same.

An interlude before we continue:

konrad.schw...@siemens.com said:
  | An argument for requiring [[:"alpha":]] to be the same as [[:alpha:]]
  | is that it would allow character-class names with white space, e.g.,
  | "title case". 

That would be a nice argument, except that it isn't possible, there actually
is a syntax for character class names, see XBD 7.4.1, page 166, lines
5345-5348:

CHARCLASS  A string of alphanumeric characters from the portable character set,
           the first of which is not a digit, consisting of at least one and
           at most {CHARCLASS_NAME_MAX} bytes, and optionally surrounded by
           double-quotes.

Now the part about double quootes is relevant only to the grammar for
locales, so we can't gain any comfort from that for the issue under
discussion, but the restricted syntax means that "title case" cannot be
a char class name, and is why I used "tonemark" rather than "tone-mark"
or "tone_mark" and certainly not "tone mark" in my example in an earlier
message.   The only argument that requires quoting in char class names
is
        CLASS=alpha
        case $word in
        [[:${CLASS}:]]) printf %s\\n "Found $CLASS in $word";;
        esac

that works fine, without quotes, but

        ls [[:${CLASS}:]]

does not if we have IFS=a instead we need

        ls [[:"${CLASS}":]]

and if the quotes were to make the class name invalid, we'd have a real
problem.


And from your other message

konrad.schw...@siemens.com said:
  | POSIX should disallow `:' and `]' in character class names. 

It does.  See above.


Further, this is what also handles:

stephane.chaze...@gmail.com said:
  | In practice, some implementations support [[:<:]] as the equivalent of the
  | standard ex utility regexp \< operator. 

since '<' cannot be a character class name, [:<:] cannot be a character
class, and thus implementations are free to use this as an extension.
Like character classes themselves, extensions to patterns require potentially
breaking valid code, as everything is valid, but using as an extension
a bracket expression containing duplicated char elements is one fairly safe
way of making extensions, since using such a thing to actually mean
the set of chars [ : and < would be very unlikely, one would use [[:<]
instead (and of course, quote the < in the shell).

And

stephane.chaze...@gmail.com said:
  | One could choose to implement a [[:[<:>]:]] to match on smileys for instance
  | :-) independantly of whether there's a class by that name. 

one could indeed, but there cannot be a character class of that name,
so we don't need to worry about that issue.

This also really handles the [[:alpha]:]] example:

a...@gigawatt.nl said:
  | If this is the whole pattern, then agreed, but if this is only part of  the
  | pattern, I am not sure. [[:alpha]:]] is interpreted by many shells  (bash,
  | bosh, mksh, zsh) as a character class containing an invalid  character class
  | name "alpha]".

The part about the invalid class name is certainly correct, but the 
interpretation cannot be, XBD 9.3.5 page 185, lines 6136-6138:

        A character class expression is expressed as a character class
        name enclosed within bracket-<colon> ("[:" and ":]") delimiters.

Since "alpha]" is not (cannot be) a character class name, we do not
have a character class expression at all, as a character class name
is required to exist between the delimiters for a character class
expression to exist.

What's more, that is the correct definition, it is all part of the effort
to minimize the fallout from stealing what would otherwise be a perfectly
valid bracket expression (which, in the days before locales, and char class
exprs, could have been used, perfectly validly (if extremely unlikely) as
a simple bracket expression containing '[' and ':' characters, as well as
the more usual letters and numbers.

Now back to our regularly scheduelled message from Geoff:

  | The word "may" has a strict usage.  See XBD 1.5 - it "Describes a
  | feature or behavior that is optional for an implementation that
  | conforms to POSIX.1-2017."
  |
  | However, there have been cases in the past where incorrect uses "may"
  | have been found and changed to "can".
  |
  | In any case, the "shall" in XCU 2.13.1 overrides it.

Only for shell patterns, we still need to decide whether it was the
defined "may" or an erroneous use which should be replaced by "can"
for regular expressions.   Given the shell imperative, and the desire
to make bracket expressions in sh patterns and REs as equivalent as
possible, I suspect the latter.

kre

ps: I am not sure whether there has already been a bug report, and perhaps
a fix, for this already, so I will leave it for someone who can find out,
hoipefully avoid irrelevant work if this is already fixed, to check (I still
cannot sanely search in mantis and get any kind of useful results).

But whereas, as above, XBD 7.4.1, page 166 defines CHARSLASS as a token,
and it is used in the grammar in XBD 7.4.2 (page 168, lines 5427, 5428 and
5433) the list of tokens at the beginning of the grammar (page 167, lines
5369-5376) does not include it.  If some earlier bug report has not already
corrected this, we need a new one to add CHARCLASS, probably in line 5373,
though that might cause that line to exceed a reasonable length - whatever -
but it should be added.

Re: More issues with pattern matching

Reply via email to