Sorry for delaying my response on this. Wanted to get some public
opinions on it:
<uri:http://web-graphics.com/2007/09/05/a-quick-js-quizz-for-anybody-who-think-they-know-regex/>

Sadly I only got responses from two developers, but they both seem to
agree with me (and I was careful to not state my opinion there)...


> On 9/2/07, liorean <[EMAIL PROTECTED]> wrote:
> >     var
> >         re=/(a\1)+|b/,
> >         a=[],
> >         key,
> >         aab=re.exec('aab')
> >         aaab=re.exec('aaab');
> >     for(key in aab)
> >          a.push('aab["'+key+'"]: '+aab[key]);
> >     for(key in aaab)
> >          a.push('aaab["'+key+'"]: '+aaab[key]);
> >     a.join('\r\n');

On 03/09/07, Lars T Hansen <[EMAIL PROTECTED]> wrote:
> From the spec it's pretty clear (to me anyhow) that SpiderMonkey and
> Opera are correct here (and the ES4 reference implementation, whose
> RegEx engine was modelled directly on the spec, agrees).  The
> reasoning is that (a) the **undefined** value in the captures array
> matches any input without advancing the input pointer [15.10.2.9 step
> 8 substep 3], (b) every element in the captures array starts out as
> **undefined** [15.10.2.2 step 2 substep 4], (c) every element in the
> captures array affected by a particular set of capturing parens is
> reset to **undefined** every time those parens are entered [15.10.2.5
> case "RepeatMatcher" step 4], and (d) the captures array is only
> updated after the capturing parens are exited during matching
> [15.10.2.8 case "( Disjunction )" step 3 substep 1 subsubstep 5].
> Therefore the pattern above is always the same as /(a)+|b/.  Therefore
> the results are what they are.

Seems I've been looking in entirely the wrong places. Yes, that seems correct.

> On 9/2/07, liorean <[EMAIL PROTECTED]> wrote:
> > It seems to me that, looking at the situation from the perspective of
> > a user of regex, that there are two ways of tackling this problem that
> > makes sense, and they are hinged on whether a captured submatch is
> > considered to be undefined or the empty string by default.

On 03/09/07, Lars T Hansen <[EMAIL PROTECTED]> wrote:
> In ES3 they are clearly **undefined** by default.

This was, as stated, looking from the perspective of a *user* or
regex, not an implementer. The ES3 spec is actually the solution that
makes least sense from the user perspective, if you ask me. I'd even
call it a weird choice, given the alternatives. To throw out a
submatch before trying to match it the next time instead of just
replacing the submatch value after the next match is unintuitive
behaviour.

> On 9/2/07, liorean <[EMAIL PROTECTED]> wrote:
> > If the capture is considered to be undefined by default, then the ES3
> > solution makes most sense... except for one thing: a capturing
> > submatch containing a backreference to the containing capture means a
> > guaranteed failure to match in that path, and is pretty easy to detect
> > at compile time.

On 03/09/07, Lars T Hansen <[EMAIL PROTECTED]> wrote:
> The way I understand the spec it means a guaranteed success.  Still
> pretty silly, to be sure.

This is what I was misunderstanding - that the behaviour for when the
capture was undefined actually meant a match for the empty string.

> On 9/2/07, liorean <[EMAIL PROTECTED]> wrote:
> > Throwing a syntax error at compile time seems more
> > appropriate since I hardly think developers want their regex to spend
> > time in constructs that cannot possibly match. Alternatively, engines
> > can simply throw out that entire path at compile time and just set all
> > the contained captures to undefined - behaviour would still be
> > identical to what ES3 specifies. I.e. it's a choice between loud or
> > silent failure.

On 03/09/07, Lars T Hansen <[EMAIL PROTECTED]> wrote:
> Or a vacuous and silent success, as in this case.

Which makes less sense than either the JavaScriptCore or the JScript
behaviour, IMO.

On 03/09/07, Lars T Hansen <[EMAIL PROTECTED]> wrote:
> I'm not sure what problem you're trying to solve, but my hope is that
> once ES4 is widely implemented programmers will switch to named
> submatches to avoid the problem with numeric backreferences entirely.

The problem is the same whether it's named or not, I think. Consider:

    /(?P<as>a(?P=as))+|b/.exec('aaab');

Is the behaviour of that any different from the behaviour of these:

    /(a\1)+|b/.exec('aaab');
    /(?P<as>a\1)+|b/.exec('aaab');

Since in this case, unless I've misunderstood the proposal, (?P=as) is
just an alias for \1.

On 03/09/07, Lars T Hansen <[EMAIL PROTECTED]> wrote:
> In the mean time, there's obviously room for a good implementation to
> warn about ineffectual submatches like these.  Changing the semantics
> to an error is not very appealing, on the one hand I doubt it will
> break a lot of sites, on the other hand I doubt it will fix many
> either.

I don't believe there is just about any live code that actually uses
backreferences within the capturing match they refer to. If there
were, I don't think a behaviour as divergent from the other browsers
as that of JavaScriptCore would have been able to fly under the radar.



What I was really asking for here is: Which behaviour should I
implement. From the user perspective, I think it's best if a
backreference to the containing submatch (independent of whether it's
a named backreference or an indexed one) is a failure, like in
JavaScriptCore. The reason is that it's simple to understand using the
following reasoning based on preexpanding the backreference:

- The pattern (\1) matches zero characters plus an expansion of
itself, which is zero characters. Thus that's equal to
((?:){Infinity}), which is still zero characters.
- However, (a\1) matches one character plus an expansion of itself
(exactly which character is irrelevant, which should be obvious soon),
which is one character plus an expansion of itself, which is two
characters... ad inifinitum. Thus that is equal to (a{Infinity}) for
which there cannot possibly exist a matching sting, since strings are
finite.

In other ways, a backreference to the containing capture is logically
just plain nonsense.

>From an implementor perspective, it doesn't matter much whether this
is a silent or loud failure, I think. Making it silent would just mean
we can ignore that whole code path anyway.

I agree making it a syntax error is probably not the best idea, since
that could kill an entire script, even if it's never even used.
Emitting a warning is probably a good idea, though.



Second best, from a user perspective, is to not reset the captures,
like in JScript. This fits well into the following iterative reasoning
about what the pattern does:

- The pattern (\1)+ is equivalent to ((?:)) first repetition, a match
for the first capture which is the empty string. The capturing
submatch is the new value for the first capture.
- The pattern (\1)+ is equivalent to (?:)((?:)) second repetition, a
match for first repetition which is the empty string plus zero
characters plus the first capture which is the empty string. The
capturing submatch is the new value for the first capture.

- The pattern (a\1)+ is equivalent to (a(?:)) first repetition, which
matches 'a' plus the first capture which is the empty string. The
capturing submatch is the new value for the first capture.
- The pattern (a\1)+ is equivalent to (?:a)(a(?:a)) second repetition,
which matches the first repetition which is 'a' plus 'a' plus the
first capture which is 'a'. The capturing submatch is the new value
for the first capture.
- The pattern (a\1)+ is equivalent to (?:aaa)(a(?:aa)) third
repetition, which matches the second repetition which is 'aaa' plus
'a' plus the first capture which is 'a'a. The capturing submatch is
the new value for the first capture.

I'm sure you recognise the pattern:)



>From an implementor perspective, I hardly think not having to reset
part of the capture array for each repetition is bad.




How does it work with capturing submatches within the submatch that is repeated?

    /(?:(a)(b)?)+/.exec('aaba');

JavaScriptCore and JScript: ['aaba','a','b']
SpiderMonkey and futhark: ['aaba','a','']


    /(?:(\2|a)(b)?)+/.exec('abba');

JavaScriptCore: ['abba','a','b']
JScript: ['ab','a','b']
SpiderMonkey and futhark: ['abba','a','']

ES3: If I understand 15.10.2.5 Term, fourth algorithm, point 4, the
correct results would be  ['ab','a','b'] since the \2 capture would be
set to undefined before the second repetition. Thus futhark and
SpiderMonkey fail to implement this part of ES3 Regex. (I'd be happy
if it turned out I'm wrong here, however.)




I repeat: I think ES3 (and SpiderMonkey and futhark) has taken the
wrong choice here. It doesn't make sense to throw away submatches
after they have been captured, and it's not what regex users expect.
And a reasonable handling of a backreference to a containing submatch
is that it's nonsense since it logically can't match anything. But
even if they're not deemed nonsense, they should refer to last
repetition's capture, not be reset with each repetition.





So I guess what I'm asking you is if you think the ES3 algorithm is
appropriate, and if a change to a more sensible interpretation could
be done in ES4. Since we have three totally different interpretations
in browser hosted engines, I bet there's precious little code out
there  that relies on the ES3 behaviour, or for that matter the
particular implementation in any of the browser hosted engines.
-- 
David "liorean" Andersson
_______________________________________________
Es4-discuss mailing list
Es4-discuss@mozilla.org
https://mail.mozilla.org/listinfo/es4-discuss

Reply via email to