Now I think I got it. Thanks a lot again. Marcin
Am 22.01.2013 12:00, schrieb tutor-requ...@python.org: > > Message: 1 > Date: Tue, 22 Jan 2013 11:31:01 +1100 > From: Steven D'Aprano <st...@pearwood.info> > To: tutor@python.org > Subject: Re: [Tutor] Question regular expressions - the non-greedy > pattern > Message-ID: <50fdddc5.6030...@pearwood.info> > Content-Type: text/plain; charset=UTF-8; format=flowed > > On 22/01/13 10:11, Marcin Mleczko wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Hello Hugo, hello Walter, >> >> first thank you very much for the quick reply. >> >> The functions used here i.e. re.match() are taken directly form the >> example in the mentioned HowTo. I'd rather use re.findall() but I >> think the general interpretetion of the given regexp sould be nearly >> the same in both functions. > > Regular expressions are not just "nearly" the same, they are EXACTLY > the same, in whatever re function you call, with one exception: > > re.match only matches at the start of the string. > > >> So I'd like to neglect the choise of a particular function for a >> moment a concentrate on the pure theory. >> What I got so far: >> in theory form s = '<<html><head><title>Title</title>' >> '<.*?>' would match'<html>''<head>''<title>''</title>' > > Incorrect. It will match > > '<<html>' > '<head>' > '<title>' > '</title>' > > > Why don't you try it and see? > > py> s = '<<html><head><title>Title</title>' > py> import re > py> re.findall('<.*?>', s) > ['<<html>', '<head>', '<title>', '</title>'] > > > The re module is very stable. The above is what happens in every Python > version between *at least* 1.5 and 3.3. > > >> to achieve this the engine should: >> 1. walk forward along the text until it finds< > > Correct. That matches the first "<". > > >> 2. walk forward from that point until in finds> > > Correct. That matches the first ">". > > Since the regex has now found a match, it moves on to the next part > of the regex. Since this regex is now complete, it is done, and > returns what it has found. > > >> 3. walk backward form that point (the one of>) until it finds< > > Regexes only backtrack on *misses*, not once they successfully find > a match. Once a regex has found a match, it is done. > > >> 4. return the string between< from 3. and> from 2. as this gives the >> least possible string between< and> > > Incorrect. > > >> Did I get this right so far? Is this (=least possible string between< >> and>), what non-greedy really translates to? > > No. The ".*" regex searches forward as far as possible; the ".*?" searches > forward as little as possible. They do not backtrack. > > The only time a non-greedy regex will backtrack is if the greedy version > will backtrack. Since ".*" has no reason to backtrack, neither does ".*?". > > >> For some reason, I did not get so far the regexp engine in Python >> omits step 3. and returns the string between< from 1. and> from 2. >> resulting in '<<html>' >> >> Am I right? If so, is there an easily graspable reason for the engine >> designers to implement it this way? > > Because that's the way regexes work. You would need to learn about > regular expression theory, which is not easy material. But you can start > here: > > http://en.wikipedia.org/wiki/Regular_expression > > and for more theoretical approach: > > http://en.wikipedia.org/wiki/Chomsky_hierarchy > http://en.wikipedia.org/wiki/Regular_language > http://en.wikipedia.org/wiki/Regular_grammar > > If you don't understand all the theory, don't worry, neither do I. > > > _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor