On 22/01/13 10:11, Marcin Mleczko wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello Hugo, hello Walter,

first thank you very much for the quick reply.

The functions used here i.e. re.match() are taken directly form the
example in the mentioned HowTo. I'd rather use re.findall() but I
think the general interpretetion of the given regexp sould be nearly
the same in both functions.

So I'd like to neglect the choise of a particular function for a
moment a concentrate on the pure theory.
What I got so far:
in theory form s = '<<html><head><title>Title</title>'
'<.*?>' would match '<html>' '<head>' '<title>' '</title>'
to achieve this the engine should:
1. walk forward along the text until it finds <
2. walk forward from that point until in finds >
3. walk backward form that point (the one of >) until it finds <
4. return the string between < from 3. and > from 2. as this gives the
least possible string between < and >

Did I get this right so far? Is this (=least possible string between <
and >), what non-greedy really translates to?

Nope, that's not how regex works. In particular, it does not do step 3, non-greedy regex do not walk backward to find the shortest string that matches.

What it does for the non-greedy pattern <.*?> is:

1. walk forward along the text until it finds <
2. if the next character matches . (any chars) then step one character forward, otherwise backtrack from where we left off in the previous step 3. if the next char is >, then we find a match, otherwise backtrack from where we left off in the previous step
4. return the string between < from 1 and > from 3

or alternatively

1. walk forward along the text until it finds <
2. walk forward from that point until in finds >
3. return the string between < from 1 and > from 2

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to