Stephane Chazelas <stephane.chaze...@gmail.com> wrote, on 04 May 2018:
>
> 2018-05-04 09:30:56 +0100, Geoff Clare:
> [...]
> > > That's the point: we should allow \ to be an escaping operator
> > > inside brackets. In awk and anything else. Technically, that
> > > means a portable application has to double the \ inside
> > > brackets.
> > 
> > The point of awk's extra level of \ interpretation is to allow the
> > use of \t for TAB, etc.  And that's the only reason to use it inside
> > a bracket expression, since the XBD ERE rules say \ is not special
> > in a bracket expression, 
> 
> It says that, but it doesn't match the reality of current awk
> implementations where \ is special within brackets (and also in
> many shell wildcards and in some sed, ex, vi implementations,
> and in most REs outside of POSIX).

It was a deliberate choice made by the original POSIX.2 developers.
See XRAT A.9.3.5:

    Current practice in awk and lex is to accept escape sequences in
    bracket expressions as per XBD Table 5-1 (on page 121), while the
    normal ERE behavior is to regard such a sequence as consisting of
    two characters. Allowing the awk/lex behavior in EREs would change
    the normal behavior in an unacceptable way; it is expected that
    awk and lex will decode escape sequences in EREs before passing
    them to regcomp() or comparable routines.

(The Table 5-1 reference is the \a, \b, \f, etc. sequences.)

> It's not only about ^. /[\*]/ fails to match to match backslash
> as well for instance (in all but busybox and Solaris
> /usr/xpg4/bin/awk).
> 
> > This alone is enough to mean portable applications have to use \\
> > inside brackets to include \ in the list of characters to match.
> > So I think we should just make \^ inside a bracket expression
> > undefined.
> 
> That would not be enough to match the current reality, I'd say
> \<anything-but-the-C-escapes> (\n, \ooo, \b...) at least should
> be undefined inside bracket expressions.

I'd be okay with that.

> Does the change address the differences in behaviour for 
> 
> PATTERN='\f' awk '$0 ~ $ENVIRON["PATTERN"]'
> or
> awk '$0 ~ "\\f"'

This was discussed in a conference call and we decided no change
was needed as the standard already says the following in the paragraph
after the table that the bug resolution modifies:

    If the right-hand operand is any expression other than the lexical
    token ERE, the string value of the expression shall be interpreted
    as an extended regular expression, including the escape conventions
    described above. Note that these same escape conventions shall also
    be applied in determining the value of a string literal (the lexical
    token STRING), and thus shall be applied a second time when a
    string literal is used in this context.

> You said earlier that,
> 
> echo b | awk '/[a\55c]/'
> 
> is required to match (on ASCII-based systems), but that's not
> the case with several implementations (like nawk on Solaris,
> bwk's awk, mawk, FreeBSD awk...).
> 
> Should those be considered non-compliant?

The previous discussion was \056 matching any character because it
becomes a <period>, but this case should match too (assuming both
ASCII encoding and POSIX locale; the latter because it's a range).

-- 
Geoff Clare <g.cl...@opengroup.org>
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England

Reply via email to