Date: Thu, 26 Sep 2019 17:54:21 +0100 From: Geoff Clare <g...@opengroup.org> Message-ID: <20190926165421.GA32280@lt2.masqnet>
| In the case of [x[:bogus:]], the use of both colons clearly indicates | the intention to use the new character-class feature. If the name | between the colons is not a valid class name, that is likely due to | an error on the user or application writer's part when typing the name. I had been waiting for that argument, it is the only one that is half way rational, and supports that position. But half way is as far as it gets. POSIX allows locales to define new char classes, it says so, XBD 7.3, page 141, lines 4218-4226. Since a locale is allowed to define a new char class name, the shell (or regcomp() for the RE case) cannot know whether the user here: | For example, if a user types: | | grep '[[:alhpa:]]' file made a typo for alpha (the standard posix defined char class), or really intended alhpa a locale specific char class in some locale which is not the current one. Making this some kind of error, in either REs, or shell patterns (whatever the effect of that is) makes it impossible for users to ever safely, and simply, use the locale specific locale name. They cannot even test which locale is in use as (aside from it being impossible to be sure which locales have added this new char class to their definitions) there's no guarantee that even if we know that LC_CTYPE=EN_dislexic contains the alhpa character class, in some implementations, there is no sane way to know whether the current impoementation does. That is, unless you're requiring that before a locale specific char class can be used, the user (on the command line) or script, is required to query the locale and test whether the char class is defined there or not. Requiring that would be absurd. Disallowing users from using locale specific char classes even though the locale is free to provide them would be absurd. A non-absurd outcome is achieved only when unknown char class names are treated as missing empty class definitions in the current locale. That works, is easy, and clean. And yes, it means that what are really user errors cannot be trivially diagnosed, but simply produce unexpected results. But this is far from the only case where that happens - sh is a very forgiving language, vast numbers of obvious user errors are allowed to pass undiagnosed, because the shell cannot know that what the user entered is not actualy what they intended to enter, and preventing genuine work in order to improve error diagnosis is not the direction that the shell has ever taken. Regular expressions are more strictly interpreted, and do have detectable error cases, so in theory those could give errors for the case you describe (using the char class in the pattern arg to grep, for example) but the same arguments as above apply here as well, so that is not a desirable outcome. Further, even if it were, it would be difficult to achieve, given that POSIX has merged the definitions of bracket expressions in shell patterns and regular expressions into one definition, and we really want unknown clar class names to be usable (if not matching anything) in shell patterns (as they don't generate error messages if invalid, they simply match different things - or sometimes nothing - which is even harder to diagnose than noticing a mistyped class name. So to: | It is not absurd, it makes perfect sense. I could not disagree more. | This is the same reason we added item 8, I have no problems with item 8, and I understand it (even though I don't implement it) it simply is not relevant to anything we're currently talking about. However: | but POSIX was preventing them from behaving in a | way that is more useful to the user. was a very good argument. So let's adopt the same one for locale defined class names, and make sure they work well, and are useful to the user. | > So, is [[:"alpha":]] required to be treated the same as [[:alpha:]] , | > not allowed to be treated the same, explicitly unspecified, or simply | > never considered (previously) ? | | I believe the intention is that it be treated the same as [[:alpha:]]. Good, that is what I would have hoped. Now maybe we should add something to make that explicit. | This is the only reasonable conclusion if you consider the similarity to: | ls *"a"* That is a good analogy. | Clearly the intention here is that the quotes are not treated as part of | the pattern, even though pathname expansion is done before quote removal. Yes, agreed. That is what I do, and I wish everyone would do the same. An interlude before we continue: konrad.schw...@siemens.com said: | An argument for requiring [[:"alpha":]] to be the same as [[:alpha:]] | is that it would allow character-class names with white space, e.g., | "title case". That would be a nice argument, except that it isn't possible, there actually is a syntax for character class names, see XBD 7.4.1, page 166, lines 5345-5348: CHARCLASS A string of alphanumeric characters from the portable character set, the first of which is not a digit, consisting of at least one and at most {CHARCLASS_NAME_MAX} bytes, and optionally surrounded by double-quotes. Now the part about double quootes is relevant only to the grammar for locales, so we can't gain any comfort from that for the issue under discussion, but the restricted syntax means that "title case" cannot be a char class name, and is why I used "tonemark" rather than "tone-mark" or "tone_mark" and certainly not "tone mark" in my example in an earlier message. The only argument that requires quoting in char class names is CLASS=alpha case $word in [[:${CLASS}:]]) printf %s\\n "Found $CLASS in $word";; esac that works fine, without quotes, but ls [[:${CLASS}:]] does not if we have IFS=a instead we need ls [[:"${CLASS}":]] and if the quotes were to make the class name invalid, we'd have a real problem. And from your other message konrad.schw...@siemens.com said: | POSIX should disallow `:' and `]' in character class names. It does. See above. Further, this is what also handles: stephane.chaze...@gmail.com said: | In practice, some implementations support [[:<:]] as the equivalent of the | standard ex utility regexp \< operator. since '<' cannot be a character class name, [:<:] cannot be a character class, and thus implementations are free to use this as an extension. Like character classes themselves, extensions to patterns require potentially breaking valid code, as everything is valid, but using as an extension a bracket expression containing duplicated char elements is one fairly safe way of making extensions, since using such a thing to actually mean the set of chars [ : and < would be very unlikely, one would use [[:<] instead (and of course, quote the < in the shell). And stephane.chaze...@gmail.com said: | One could choose to implement a [[:[<:>]:]] to match on smileys for instance | :-) independantly of whether there's a class by that name. one could indeed, but there cannot be a character class of that name, so we don't need to worry about that issue. This also really handles the [[:alpha]:]] example: a...@gigawatt.nl said: | If this is the whole pattern, then agreed, but if this is only part of the | pattern, I am not sure. [[:alpha]:]] is interpreted by many shells (bash, | bosh, mksh, zsh) as a character class containing an invalid character class | name "alpha]". The part about the invalid class name is certainly correct, but the interpretation cannot be, XBD 9.3.5 page 185, lines 6136-6138: A character class expression is expressed as a character class name enclosed within bracket-<colon> ("[:" and ":]") delimiters. Since "alpha]" is not (cannot be) a character class name, we do not have a character class expression at all, as a character class name is required to exist between the delimiters for a character class expression to exist. What's more, that is the correct definition, it is all part of the effort to minimize the fallout from stealing what would otherwise be a perfectly valid bracket expression (which, in the days before locales, and char class exprs, could have been used, perfectly validly (if extremely unlikely) as a simple bracket expression containing '[' and ':' characters, as well as the more usual letters and numbers. Now back to our regularly scheduelled message from Geoff: | The word "may" has a strict usage. See XBD 1.5 - it "Describes a | feature or behavior that is optional for an implementation that | conforms to POSIX.1-2017." | | However, there have been cases in the past where incorrect uses "may" | have been found and changed to "can". | | In any case, the "shall" in XCU 2.13.1 overrides it. Only for shell patterns, we still need to decide whether it was the defined "may" or an erroneous use which should be replaced by "can" for regular expressions. Given the shell imperative, and the desire to make bracket expressions in sh patterns and REs as equivalent as possible, I suspect the latter. kre ps: I am not sure whether there has already been a bug report, and perhaps a fix, for this already, so I will leave it for someone who can find out, hoipefully avoid irrelevant work if this is already fixed, to check (I still cannot sanely search in mantis and get any kind of useful results). But whereas, as above, XBD 7.4.1, page 166 defines CHARSLASS as a token, and it is used in the grammar in XBD 7.4.2 (page 168, lines 5427, 5428 and 5433) the list of tokens at the beginning of the grammar (page 167, lines 5369-5376) does not include it. If some earlier bug report has not already corrected this, we need a new one to add CHARCLASS, probably in line 5373, though that might cause that line to exceed a reasonable length - whatever - but it should be added.