Re: re.match -- not greedy?

2006-11-22 Thread Ant

EXI-Andrews, Jack wrote:

> the '*' and '+' don't seem to be greedy.. they will consume less in
> order to match a string:
>
> >>> import re;re.match('(a+)(ab)','aaab').groups()
> ('aa', 'ab')
>
> this is the sort of behaviour i'd expect from
>'(a+?)(ab)'
>
> a+ should greedily consume a's at the expense of the string not matching

They are greedy - they consume as much as possible *without*
sacrificing the match.

You are thinking I think of possessive quantifiers, such as found in
Java, which *will* match as much as possible even if doing so causes
the match to fail. This would be written:

re.match('(a++)(ab)','aaab')

Python doesn't support these possessive quantifiers.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: re.match -- not greedy?

2006-11-21 Thread Fredrik Lundh
EXI-Andrews, Jack wrote:

 > > that's a misunderstanding of what a regular expression is, though:
 > > conceptually, a RE describes a set of strings, and the RE engine is
 > > designed to answer the question "does this string belong to this
 > > set".

 > if that's so, what is the point of +? and *?   (?)

to influence which string to pick from the set, when a target string can 
match multiple members.

> seems to me it's a bit more pragmatic than pure set membership

your assertion was that a RE should *fail* if you used greedy matching. 
  that's not how RE's work.



-- 
http://mail.python.org/mailman/listinfo/python-list


re.match -- not greedy?

2006-11-21 Thread EXI-Andrews, Jack
>> I wrote:
>>
>> import re;re.match('(a+)(ab)','aaab').groups()
>>> ('aa', 'ab')
>>> 
>>> this is the sort of behaviour i'd expect from 
>>>'(a+?)(ab)'
>>> 
>>> a+ should greedily consume a's at the expense of the string
>>> not matching
>
> Fredrick wrote:
> that's a misunderstanding of what a regular expression is, though: 
> conceptually, a RE describes a set of strings, and the RE engine is 
> designed to answer the question "does this string belong to this
> set".

if that's so, what is the point of +? and *?   (?)

seems to me it's a bit more pragmatic than pure set membership



jack
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: re.match -- not greedy?

2006-11-19 Thread Noah Rawlins
Carl Banks wrote:
> EXI-Andrews, Jack wrote:
>> the '*' and '+' don't seem to be greedy.. they will consume less in
>> order to match a string:
>>
> import re;re.match('(a+)(ab)','aaab').groups()
>> ('aa', 'ab')
>>
>> this is the sort of behaviour i'd expect from
>>'(a+?)(ab)'
>>
>> a+ should greedily consume a's at the expense of the string not matching
> 
> It's called backtracking.  If a match fails, the regexp engine
> recursively unwinds some of the greedy matching and tries again.
> 
>> in real life, i am trying to match #defines:
>>
> re.match( '#define ([A-Za-z0-9_]+)([^(])', '#define
>> abc(d)').groups()
>> ('ab', 'c')
>>
>> i want this example to fail because the first character after a string
>> of letters is a '('
>> i want to match only #defines without parameters.
> 
> Probably the easiest thing to do is to attempt to match zero or one
> opening parentheses, and bail out if it did end up matching the
> parenthesis.
> 
> m = re.match(r"#define ([A-Za-z0-9_]+)(\(?)","#define abc(d)")
> if m and m.groups(2) != "(":
> ...
> 
> (BTW, it's good practice to use a raw strings for the regular
> expressions as I did.)
> 

Another way that seems clearer to me is to use negative lookahead 
assertions.

 >>>defPatt = re.compile(r'#define\s+(\w+)\b(?!\()')
 >>> defPatt.match("#define abc(x)")
 >>> defPatt.match("#define foo_BarBaz")
<_sre.SRE_Match object at 0xb7dd5820>
 >>> defPatt.match("#define foo_BarBaz").groups()
('foo_BarBaz',)

In general '\w' is the same as [A-Za-z0-9_]

There are other considerations too... I don't know if

#define abc (x)

But the main thing here is the use of '\b' to require there to be a word 
boundary at the end your match and a (?!\() to require that the match is 
not followed by a '('

see http://docs.python.org/lib/re-syntax.html

noah


>> so what's the definition of greedy?
> 
> It means match as much you can *while getting a match*.
> 
> Just remember regexp parsing is not a one way street: if it hits a dead
> end, it'll go back and try another path.  In fact, greediness usually
> doesn't affect *whether* a regexp matches; it only affects the
> groupings.  I'm suddenly curious if there are any cases at all where
> greediness changes whether it finds a match.
> 
>> (pls copy me on responses)
> 
> Ah, too bad you won't see it then
> 
> 
> Carl Banks
> 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: re.match -- not greedy?

2006-11-19 Thread André Malo
* Carl Banks wrote:

> I'm suddenly curious if there are any cases at all where
> greediness changes whether it finds a match.

nope, per definitionem ;-)

greedy := longest leftmost match

nd
-- 
die (eval q-qq[Just Another Perl Hacker
]
;-)
# André Malo,  #
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: re.match -- not greedy?

2006-11-19 Thread Carl Banks
EXI-Andrews, Jack wrote:
> the '*' and '+' don't seem to be greedy.. they will consume less in
> order to match a string:
>
> >>> import re;re.match('(a+)(ab)','aaab').groups()
> ('aa', 'ab')
>
> this is the sort of behaviour i'd expect from
>'(a+?)(ab)'
>
> a+ should greedily consume a's at the expense of the string not matching

It's called backtracking.  If a match fails, the regexp engine
recursively unwinds some of the greedy matching and tries again.

> in real life, i am trying to match #defines:
>
> >>> re.match( '#define ([A-Za-z0-9_]+)([^(])', '#define
> abc(d)').groups()
> ('ab', 'c')
>
> i want this example to fail because the first character after a string
> of letters is a '('
> i want to match only #defines without parameters.

Probably the easiest thing to do is to attempt to match zero or one
opening parentheses, and bail out if it did end up matching the
parenthesis.

m = re.match(r"#define ([A-Za-z0-9_]+)(\(?)","#define abc(d)")
if m and m.groups(2) != "(":
...

(BTW, it's good practice to use a raw strings for the regular
expressions as I did.)

> so what's the definition of greedy?

It means match as much you can *while getting a match*.

Just remember regexp parsing is not a one way street: if it hits a dead
end, it'll go back and try another path.  In fact, greediness usually
doesn't affect *whether* a regexp matches; it only affects the
groupings.  I'm suddenly curious if there are any cases at all where
greediness changes whether it finds a match.

> (pls copy me on responses)

Ah, too bad you won't see it then


Carl Banks

-- 
http://mail.python.org/mailman/listinfo/python-list


re.match -- not greedy?

2006-11-19 Thread EXI-Andrews, Jack
the '*' and '+' don't seem to be greedy.. they will consume less in
order to match a string:

>>> import re;re.match('(a+)(ab)','aaab').groups()
('aa', 'ab')

this is the sort of behaviour i'd expect from 
   '(a+?)(ab)'

a+ should greedily consume a's at the expense of the string not matching

in real life, i am trying to match #defines:

>>> re.match( '#define ([A-Za-z0-9_]+)([^(])', '#define
abc(d)').groups()
('ab', 'c')

i want this example to fail because the first character after a string
of letters is a '('
i want to match only #defines without parameters.

so what's the definition of greedy?

(pls copy me on responses)

ta,


jack
-- 
http://mail.python.org/mailman/listinfo/python-list