Re: RegExp syntax suggestion: allow CharacterClassEscape in CharacterRange

Gavin Barraclough Tue, 28 Dec 2010 17:56:52 -0800

Hi Lasse,

Thank you for the comprehensive list!

In all these cases we should perhaps first attempt to determine whether these 
could just be considered spec compliance issues that ought to be fixed in 
implementations.  If retaining compatibility with existing code is going to 
require continued support for syntax beyond that currently covered by the spec, 
then it would seem beneficial to try to add this.  I'd be interested in 
discussing some proposals to address the issues you've listed.

On Dec 9, 2010, at 12:33 PM, Lasse Reichstein wrote:
> // Invalid ControlEscape/IdentityEscape character treated as literal.
>  /\z/;  // Invalid escape, same as /z/

> To match the current behavior, IdentityEscape shouldn't exclude all of 
> IdentifierPart,
> but only the characters that already mean something else.

Agreed - I expect this change will be necessary to retain compatibility with 
existing code.  String literals define all source characters not used in escape 
sequences - bar decimal digits - to be valid identity escapes 
(NonEscapeCharacter), so it would only seem consistent to do the same for 
RegExps.  A minimal but slightly messy fix would be:

IdentityEscape :: SourceCharacter but not DecimalDigit nor ControlEscape nor 
CharacterClassEscape nor any of:
        b B c x u

ClassEscape ::
        DecimalEscape
        b
        B
        CharacterEscape
        CharacterClassEscape

Adding semantics such that ClassEscape :: B generates the character set { B }.  
A slightly cleaner solution may be to restructure this bit of BNF to give 
ClassEscapes and AtomEscapes separate copies of IdentityEscape such that 
special handling for /[\B]/ is not necessary:

AtomEscape ::
        DecimalEscape
        CharacterEscape
        CharacterClassEscape
        AtomIdentityEscape

AtomIdentityEscape :: SourceCharacter but not DecimalDigit nor ControlEscape 
nor CharacterClassEscape nor any of:
        b B c x u

CharacterEscape ::
        ControlEscape
        c ControlLetter
        HexEscapeSequence
        UnicodeEscapeSequence

ClassEscape ::
        DecimalEscape
        b
        CharacterEscape
        CharacterClassEscape
        ClassIdentityEscape

ClassIdentityEscape :: SourceCharacter but not DecimalDigit nor ControlEscape 
nor CharacterClassEscape nor any of:
        b c x u

Even if we did not need to liberalize IdentityEscape to this extent, the 
current definition in the spec seems a little unworkable, particularly since it 
appears to prohibit /\$/.  Defining IdentityEscape in terms of IdentifierPart 
also appears to have the consequence of requiring users to remember which set 
of punctuation falls within the UnicodeConnectorPunctuation set to correctly 
determine which punctuation marks they can validly escape this way!  I would 
suggest a still restrictive, but more practical, definition for IdentityEscape 
would be:

IdentityEscape ::
        SourceCharacter but not ControlLetter or DecimalDigit

> // Incomplete/Invalid ControlEscape treated as either "\\c" or "c"
>  /\c/;  // same as /c/ or /\\c/
>  /\c2/;  // same as /c2/ or /\\c2/
> // Incomplete HexEscapeSequence escape treated as either "\\x" or "x".
>  /\x/;  // incomplete x-escape
>  /\x1/;  // incomplete x-escape
>  /\x1z/;  // incomplete x-escape
> // Incomplete UnicodeEscapeSequence escape treated as either "\\u" or "u".
>  /\u/;  // incomplete u-escape
>  /\uz/;  // incomplete u-escape
>  /\u1/;  // incomplete u-escape
>  /\u1z/;  // incomplete u-escape
>  /\u12/;  // incomplete u-escape
>  /\u12z/;  // incomplete u-escape
>  /\u123/;  // incomplete u-escape
>  /\u123z/;  // incomplete u-escape

> Most of the RegExps treat a malformed (start of a multi-character) escape 
> sequence
> as a simple identity escape or octal escape, and extends identity escapes to 
> all characters
> that doesn't already have another meaning (ControlEscape, 
> CharacterClassEscape or
> one of c, x, u, or b, and B outside a CharacterClass).

Agreed - though I think it would be preferable to try generating the syntax 
error in these cases, if this is feasible.

Given that these invalid sequences are currently being handled with some 
inconsistency across browsers, this might a good hint that we can treat this as 
a compliance bug that we can just fix in implementations.  Hex and unicode 
escapes are also present in string literals, and malformed escapes in string 
literals should also generate a syntax error (browser behaviour appears 
inconsistent here), and it would seem preferable to retain consistency between 
string & RegExp literals.

If we do need to be permissive in these cases, then I guess this requires up to 
four characters of lookahead.  I believe invalid escapes - if not a syntax 
error - should be handled as per IdentityEscapes (i.e. the backslash is not 
retained - /\c/ as /c/, not /\\c/).

> Allowing /\c2/ to match "c2", but requiring /\CB/ to match "\x02" seems like 
> it would
> be better explained in prose than in the BNF.

Well, if we're going to allow /\c2/ to match "c2" then it seems we should also 
allow /\c/ to match "c", and I think that will require changes to the BNF.  
Also, a prose explanation may be a little trickier than it appears at first 
blush, when you take quantifiers into account (if one trivially were to write 
semantics that would generate a nested pair of matchers for /\c2/ (to first 
match for { c }, then {2}), then I think this would result in /\c2*/ being 
handled as /(?:c2)*/ when the quantifier is applied - I'm sure there would be 
ways to avoid this, but would probably make a prose approach more complex).  
And this might not actually be too bad to deal with in BNF anyway - I believe 
this ought to work?:

CharacterEscape ::
        ControlEscape
        c [lookahead ∉ ControlLetter]
        c ControlLetter
        HexEscapeSequence
        UnicodeEscapeSequence
        IdentityEscape

Then we would just need to add semantics to define the result of c [lookahead ∉ 
ControlLetter] to be the character set { c }.

> // Bad quantifier range:
>  /x{z/;  // same as /x\{z/
>  /x{1z/;  // same as /x\{1z/
>  /x{1,z/;  // same as /x\{1,z/
>  /x{1,2z/;  // same as /x\{1,2z/
>  /x{10000,20000z/;  // same as /x\{10000,20000z/
> // Notice: It needs arbitrary lookahead to determine the invalidity,
> // except Mozilla that limits the numbers.

Permitting open braces that don't form valid range quantifiers (per current 
browser implementations) leads to a couple of odd quirks:
- it seems inconsistent, given that syntax errors are generated for unclosed 
brackets and parentheses.
- it means that a correctly formed quantifier without an atom will throw a 
syntax error, but a malformed one will not.  E.g. /{0}/ throws a syntax error, 
/{0{/ does not.

If we can't change implementations to adopt spec behavior here, then one 
possible simple solution (assuming it is web compatible) might be to add the 
following pattern to the 'Atom' rule:

Atom ::
        PatternCharacter
        .
        \ AtomEscape
        CharacterClass
        ( Disjunction )
        ( ? : Disjunction )
        { [lookahead ∉ DecimalDigit]

This would allow an unclosed brace in many situations, without requiring 
arbitrary lookahead, and would not be ambiguous with range quantifiers (since 
these always require the open brace to be followed by a decimal digit).
The drawback I see with this approach would be that it might appear arbitrary 
to accept /{a/ but reject /{1/.  This said, the spec already is more prone to 
generate syntax errors around decimal digits in literals, for example escaped 
decimal digits being a syntax error in string literals and RegExp character 
classes, so perhaps this would be okay.

> // Zero-initialized Octal escapes.
>  /\012/;    // same as /\x0a/

I've only tested this one on a couple of browsers, but those tested permit any 
number of zeros, followed by any octal sequence with a decimal value up to 255. 
 Here's a possible set of grammar rules that I think might be okay:

DecimalEscape ::
        Zeros OptionalLatin1OctalValue
        NonZeroDigit DecimalDigits 

Zeros ::
        0
        Zeros 0

OptionalLatin1OctalValue ::
        [lookahead ∉ OctalDigit]
        OneToThree [lookahead ∉ OctalDigit]
        FourToSeven [lookahead ∉ OctalDigit]
        OneToThree OctalDigit [lookahead ∉ OctalDigit]
        FourToSeven OctalDigit
        OneToThree OctalDigit OctalDigit

OneToThree :: one of
        1 2 3

FourToSeven :: one of
        4 5 6 7

(Alternatively this syntax could be tightened up a little to restrict octal 
values to a single leading zero, if implementations can do so too).

> // Nonexisting back-references treated as octal escapes:
>  /\5/;  // same as /\x05/

I really hope we can remove this 'feature' from implementations.  I'm going to 
hold off for now on attempting to propose grammar.  This one does look a little 
complex to write BNF for, I guess we can attampt a prose solution, though again 
this will run into problems in the case of quantifiers.

(OOI there is some inconsistency in how this is handled in browsers - for 
example, is /\777/ equivalent to /\x3f/ or /\x3f7/).

> // Invalid PatternCharacter accepted unescaped
>  /]/;
>  /{/;
>  /}/;

'{' is addressed above, fixing the spec to permit ']' and '}' would be easy, 
and would mean modifying this rule:

PatternCharacter :: SourceCharacter but not any of:
        ^ $ \ . * + ? ( ) [ ] { } |

However it would introduce something of an inconsistency if we permit unmatched 
']' and '}" but not ')'.

* One possibility would be to relax both the spec and implementations, and 
allow all types of close brackets - ']' and '}" and ')' (though I think 
permitting ')' might require a little ugly duplication in the BNF).
* Depending on the resolution to bad quantifier ranges, above, another option 
might be to permit an unmatched close brace, but start generating a syntax 
error for unmatched close brackets (and retain the error for unmatched closed 
parentheses) - i.e. if we do permit unmatched open brace, then perhaps we 
should also be permitting unmatched close brace.

> // Bad escapes also inside CharacterClass.
>  /[\z]/;
>  /[\c]/;
>  /[\c2]/;
>  /[\x]/;
>  /[\x1]/;
>  /[\x1z]/;
>  /[\u]/;
>  /[\uz]/;
>  /[\u1]/;
>  /[\u1z]/;
>  /[\u12]/;
>  /[\u12z]/;
>  /[\u123]/;
>  /[\u123z]/;
>  /[\012]/;
>  /[\5]/;
> // And in addition:
>  /[\B]/;
>  /()()[\2]/;  // Valid backreference should be invalid.

In the case of hex and unicode escapes, I believe handling of ClassEscapes 
matches that for AtomEscapes, and this should remain the case.

I'm not sure about other browsers, but I am aware that Safari and Firefox both 
extend the syntax of control letters in character classes to include 
DecimalDigits and '_', so /[\c2]/ is accepted and is regarded as a control 
letter, unlike /\c2/ (which neither generates a syntax error, nor is treated as 
a control letter) - I believe that is both browsers /\c2/ is equivalent to /c2/ 
but /[\c2]/ is equivalent to /[\x12]/ (or /\x12/).  Hopefully this extension 
can simply be removed – having incompatible syntax here between ClassEscapes 
and AtomEscapes seems unhelpful.

Since most browsers seem to permit decimal characters as an IdentityEscape, 
though these are permitted in string literals, so it would be a shame to add 
support if not strictly necessary.  Additionally implementations permit octal 
escapes in ClassEscapes - we should keep this in sync with whatever decision we 
make for octal in AtomEscapes.  Assuming some of the additional grammar 
proposed above (the BNF for octal, and splitting IdentityEscape for AtomEscape 
and ClassEscape) the following may be a workable solution if we need to support 
these features:

ClassEscape ::
        Zeros OptionalLatin1OctalValue
        b
        CharacterEscape
        CharacterClassEscape
        ClassIdentityEscape

ClassIdentityEscape :: SourceCharacter but not ControlEscape nor 
CharacterClassEscape nor any of:
        0 b c x u

I'd be keen to hear your thoughts on all this.  I'll continue investigating 
what parts of the extended syntax are truly necessary from a compatibility 
perspective, and expect I should start moving this conversation onto a wiki 
page at some point.

cheers,
G.

p.s. One unrelated missing syntax error I've noticed is on quantified 
parenthetical assertions – browsers seem to permit /(?=)*/, which should be a 
syntax error.

_______________________________________________
es-discuss mailing list
es-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

Re: RegExp syntax suggestion: allow CharacterClassEscape in CharacterRange

Reply via email to