Re: Why this result with the re module

John Bond Tue, 02 Nov 2010 03:27:14 -0700

On 2/11/2010 8:53 AM, Yingjie Lan wrote:


BUT, but.

1. I expected findall to find matches of the whole
regex '(.a.)+', not just the subgroup (.a.) from

re.findall('(.a.)+', 'Mary has a lamb')

Thus it is probably a misunderstanding/bug??

Again, as soon as you put a capturing group in your expression, youchange the nature of what findall returns as described in thedocumentation. It then returns what gets assigned to each capturinggroup, not what chunk of text was matched by the whole expression ateach matching point in the string.

A capturing group returns what was matched by the regex fregment *insideit*. If you put repetition *outside it* (as you have - "(.a.)*+*") thatdoesn't change but, if the repetition clause results in it being matchedmultiple times, only the last match is returned as the capturing groupssingle, only allowed return value.

I find that strange, and limiting (why not return a list of all matchescaused by the repetition?) but that's the way it is.

Have you read the "Regular Exp[ression HOWTO" in the docs? It explainsall this stuff.

2. Here is an statement from the documentation on
    non-capturing groups:
    see http://docs.python.org/dev/howto/regex.html

"Except for the fact that you can’t retrieve the
contents of what the group matched, a non-capturing
group behaves exactly the same as a capturing group; "

In terms of how the regular expression works when matching text, whichis what the above is addressing, that's true. In terms of how theresults are returned to API callers, it isn't true.

    Thus, I'm again confused, despite of your
    previous explanation. This might be a better
    explanation: when a subgroup is repeated, it
    only captures the last repetition.


That's true, but it's not related to the above.

3. It would be convenient to have '(*...)' for
    non-capturing groups -- but of course, that's
    only a remote suggestion.


Fair enough - each to their own preferences.

4. By reason of greediness of '*', and the concept
of non-overlapping, it should go like this for
    re.findall('((.a.)*)', 'Mary has a lamb')

step 1: Match 'Mar' + '' (gready!)
step 2: skip 'y'
step 3: Match ''
step 4: skip ' '
step 5: Match ''+'has'+' a '+'lam'+'' (greedy!)
step 7: skip 'b'
step 8: Match ''

So there should be 4 matches in total:

'Mar', '', 'has a lam', ''

Also, if a repeated subgroup only captures
the last repetition, the repeated
subgroup (.a.)* should always be ''.

Yet the execution in Python results in 6 matches.

.....

All you have done is wrapped one of your earlier regexes, '*(*.a.*)**'in another, outer capturing group, to make '*(*(.a.)**)*'. This doesn'tchange what is actually matched, so there are still the same six matchesfound. However it does change what is *returned *- you now have twocapturing groups that findall has to return information about (at eachmatch), so you will see that it returns 6 tuples (each with two items -one for each capturing group) instead of six strings, ie:


re.findall('(.a.)*', 'Mary has a lamb')

['Mar', '', '', 'lam', '', '']

becomes:

re.findall('((.a.)*)', 'Mary has a lamb')

[('Mar', 'Mar'), ('', ''), ('', ''), ('has a lam', 'lam'), ('', ''),('', '')]

As you can see, the top set of results appear in the bottom set (in thesecond item in each tuple, because the original capturing group is thesecond one now - the new, outer one is the first).

If you look at the fourth tuple, ('has a lam', 'lam'), you can see the"capturing group with repetition only returns the last match" rule inaction. The inner capturing group (which has repetition) returns 'lam'because that was the last occurrence of ".a." in the three ("has", " a", "lam") that it matched that time. However the outer capturing group,which doesn't have repetition, returns the whole thing ('has a lam').

Finally, The name findall implies all matches
should be returned, whether there are subgroups in
the pattern or not. It might be best to return all
the match objects (like a re.match call) instead
of the matched strings. Then there is no need
to return tuples of subgroups. Even if tuples
of subgroups were to be returned, group(0) must
also be included in the returned tuple.

Regards,

Yingjie

All matches are returned by findall, so I don't understand that.

I really do suggest that you read the above-mentioned HOWTO, or any ofthe numerous tutorials on the net. Regexes are hard to get your headaround at first, not helped by a few puzzling API design choices, butit's worth the effort, and those will be far more useful than lots oftyped explanations here.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Why this result with the re module

Reply via email to