Sorry for delaying my response on this. Wanted to get some public opinions on it: <uri:http://web-graphics.com/2007/09/05/a-quick-js-quizz-for-anybody-who-think-they-know-regex/>
Sadly I only got responses from two developers, but they both seem to agree with me (and I was careful to not state my opinion there)... > On 9/2/07, liorean <[EMAIL PROTECTED]> wrote: > > var > > re=/(a\1)+|b/, > > a=[], > > key, > > aab=re.exec('aab') > > aaab=re.exec('aaab'); > > for(key in aab) > > a.push('aab["'+key+'"]: '+aab[key]); > > for(key in aaab) > > a.push('aaab["'+key+'"]: '+aaab[key]); > > a.join('\r\n'); On 03/09/07, Lars T Hansen <[EMAIL PROTECTED]> wrote: > From the spec it's pretty clear (to me anyhow) that SpiderMonkey and > Opera are correct here (and the ES4 reference implementation, whose > RegEx engine was modelled directly on the spec, agrees). The > reasoning is that (a) the **undefined** value in the captures array > matches any input without advancing the input pointer [15.10.2.9 step > 8 substep 3], (b) every element in the captures array starts out as > **undefined** [15.10.2.2 step 2 substep 4], (c) every element in the > captures array affected by a particular set of capturing parens is > reset to **undefined** every time those parens are entered [15.10.2.5 > case "RepeatMatcher" step 4], and (d) the captures array is only > updated after the capturing parens are exited during matching > [15.10.2.8 case "( Disjunction )" step 3 substep 1 subsubstep 5]. > Therefore the pattern above is always the same as /(a)+|b/. Therefore > the results are what they are. Seems I've been looking in entirely the wrong places. Yes, that seems correct. > On 9/2/07, liorean <[EMAIL PROTECTED]> wrote: > > It seems to me that, looking at the situation from the perspective of > > a user of regex, that there are two ways of tackling this problem that > > makes sense, and they are hinged on whether a captured submatch is > > considered to be undefined or the empty string by default. On 03/09/07, Lars T Hansen <[EMAIL PROTECTED]> wrote: > In ES3 they are clearly **undefined** by default. This was, as stated, looking from the perspective of a *user* or regex, not an implementer. The ES3 spec is actually the solution that makes least sense from the user perspective, if you ask me. I'd even call it a weird choice, given the alternatives. To throw out a submatch before trying to match it the next time instead of just replacing the submatch value after the next match is unintuitive behaviour. > On 9/2/07, liorean <[EMAIL PROTECTED]> wrote: > > If the capture is considered to be undefined by default, then the ES3 > > solution makes most sense... except for one thing: a capturing > > submatch containing a backreference to the containing capture means a > > guaranteed failure to match in that path, and is pretty easy to detect > > at compile time. On 03/09/07, Lars T Hansen <[EMAIL PROTECTED]> wrote: > The way I understand the spec it means a guaranteed success. Still > pretty silly, to be sure. This is what I was misunderstanding - that the behaviour for when the capture was undefined actually meant a match for the empty string. > On 9/2/07, liorean <[EMAIL PROTECTED]> wrote: > > Throwing a syntax error at compile time seems more > > appropriate since I hardly think developers want their regex to spend > > time in constructs that cannot possibly match. Alternatively, engines > > can simply throw out that entire path at compile time and just set all > > the contained captures to undefined - behaviour would still be > > identical to what ES3 specifies. I.e. it's a choice between loud or > > silent failure. On 03/09/07, Lars T Hansen <[EMAIL PROTECTED]> wrote: > Or a vacuous and silent success, as in this case. Which makes less sense than either the JavaScriptCore or the JScript behaviour, IMO. On 03/09/07, Lars T Hansen <[EMAIL PROTECTED]> wrote: > I'm not sure what problem you're trying to solve, but my hope is that > once ES4 is widely implemented programmers will switch to named > submatches to avoid the problem with numeric backreferences entirely. The problem is the same whether it's named or not, I think. Consider: /(?P<as>a(?P=as))+|b/.exec('aaab'); Is the behaviour of that any different from the behaviour of these: /(a\1)+|b/.exec('aaab'); /(?P<as>a\1)+|b/.exec('aaab'); Since in this case, unless I've misunderstood the proposal, (?P=as) is just an alias for \1. On 03/09/07, Lars T Hansen <[EMAIL PROTECTED]> wrote: > In the mean time, there's obviously room for a good implementation to > warn about ineffectual submatches like these. Changing the semantics > to an error is not very appealing, on the one hand I doubt it will > break a lot of sites, on the other hand I doubt it will fix many > either. I don't believe there is just about any live code that actually uses backreferences within the capturing match they refer to. If there were, I don't think a behaviour as divergent from the other browsers as that of JavaScriptCore would have been able to fly under the radar. What I was really asking for here is: Which behaviour should I implement. From the user perspective, I think it's best if a backreference to the containing submatch (independent of whether it's a named backreference or an indexed one) is a failure, like in JavaScriptCore. The reason is that it's simple to understand using the following reasoning based on preexpanding the backreference: - The pattern (\1) matches zero characters plus an expansion of itself, which is zero characters. Thus that's equal to ((?:){Infinity}), which is still zero characters. - However, (a\1) matches one character plus an expansion of itself (exactly which character is irrelevant, which should be obvious soon), which is one character plus an expansion of itself, which is two characters... ad inifinitum. Thus that is equal to (a{Infinity}) for which there cannot possibly exist a matching sting, since strings are finite. In other ways, a backreference to the containing capture is logically just plain nonsense. >From an implementor perspective, it doesn't matter much whether this is a silent or loud failure, I think. Making it silent would just mean we can ignore that whole code path anyway. I agree making it a syntax error is probably not the best idea, since that could kill an entire script, even if it's never even used. Emitting a warning is probably a good idea, though. Second best, from a user perspective, is to not reset the captures, like in JScript. This fits well into the following iterative reasoning about what the pattern does: - The pattern (\1)+ is equivalent to ((?:)) first repetition, a match for the first capture which is the empty string. The capturing submatch is the new value for the first capture. - The pattern (\1)+ is equivalent to (?:)((?:)) second repetition, a match for first repetition which is the empty string plus zero characters plus the first capture which is the empty string. The capturing submatch is the new value for the first capture. - The pattern (a\1)+ is equivalent to (a(?:)) first repetition, which matches 'a' plus the first capture which is the empty string. The capturing submatch is the new value for the first capture. - The pattern (a\1)+ is equivalent to (?:a)(a(?:a)) second repetition, which matches the first repetition which is 'a' plus 'a' plus the first capture which is 'a'. The capturing submatch is the new value for the first capture. - The pattern (a\1)+ is equivalent to (?:aaa)(a(?:aa)) third repetition, which matches the second repetition which is 'aaa' plus 'a' plus the first capture which is 'a'a. The capturing submatch is the new value for the first capture. I'm sure you recognise the pattern:) >From an implementor perspective, I hardly think not having to reset part of the capture array for each repetition is bad. How does it work with capturing submatches within the submatch that is repeated? /(?:(a)(b)?)+/.exec('aaba'); JavaScriptCore and JScript: ['aaba','a','b'] SpiderMonkey and futhark: ['aaba','a',''] /(?:(\2|a)(b)?)+/.exec('abba'); JavaScriptCore: ['abba','a','b'] JScript: ['ab','a','b'] SpiderMonkey and futhark: ['abba','a',''] ES3: If I understand 15.10.2.5 Term, fourth algorithm, point 4, the correct results would be ['ab','a','b'] since the \2 capture would be set to undefined before the second repetition. Thus futhark and SpiderMonkey fail to implement this part of ES3 Regex. (I'd be happy if it turned out I'm wrong here, however.) I repeat: I think ES3 (and SpiderMonkey and futhark) has taken the wrong choice here. It doesn't make sense to throw away submatches after they have been captured, and it's not what regex users expect. And a reasonable handling of a backreference to a containing submatch is that it's nonsense since it logically can't match anything. But even if they're not deemed nonsense, they should refer to last repetition's capture, not be reset with each repetition. So I guess what I'm asking you is if you think the ES3 algorithm is appropriate, and if a change to a more sensible interpretation could be done in ES4. Since we have three totally different interpretations in browser hosted engines, I bet there's precious little code out there that relies on the ES3 behaviour, or for that matter the particular implementation in any of the browser hosted engines. -- David "liorean" Andersson _______________________________________________ Es4-discuss mailing list Es4-discuss@mozilla.org https://mail.mozilla.org/listinfo/es4-discuss