"Nobody" <nob...@nowhere.com> wrote in message news:pan.2010.04.08.10.12.59.594...@nowhere.com... > On Wed, 07 Apr 2010 18:25:36 -0700, Patrick Maupin wrote: > >>> Regular expressions != Parsers >> >> True, but lots of parsers *use* regular expressions in their >> tokenizers. In fact, if you have a pure Python parser, you can often >> get huge performance gains by rearranging your code slightly so that >> you can use regular expressions in your tokenizer, because that >> effectively gives you access to a fast, specialized C library that is >> built into practically every Python interpreter on the planet. > > Unfortunately, a typical regexp library (including Python's) doesn't allow > you to match against a set of regexps, returning the index of which one > matched. Which is what you really want for a tokeniser. > [snip]
Really !, I am only a python newbie, but what about ... import re rr = [ ( "id", '([a-zA-Z][a-zA-Z0-9]*)' ), ( "int", '([+-]?[0-9]+)' ), ( "float", '([+-]?[0-9]+\.[0-9]*)' ), ( "float", '([+-]?[0-9]+\.[0-9]*[eE][+-]?[0-9]+)' ) ] tlist = [ t[0] for t in rr ] pat = '^ *(' + '|'.join([ t[1] for t in rr ]) + ') *$' p = re.compile(pat) ss = [ ' annc', '1234', 'abcd', ' 234sz ', '-1.24e3', '5.' ] for s in ss: m = p.match(s) if m: ix = [ i-2 for i in range(2,6) if m.group(i) ] print "'"+s+"' matches and has type", tlist[ix[0]] else: print "'"+s+"' does not match" output: ' annc' matches and has type id '1234' matches and has type int 'abcd' matches and has type id ' 234sz ' does not match '-1.24e3' matches and has type float '5.' matches and has type float seems to me to match a (small) set of regular expressions and indirectly return the index of the matched expression, without doing a sequential loop over the regular expressions. Of course there is a loop over the reults of the match to determine which sub-expression matched, but a good regexp library (which I presume Python has) should match the sub-expressions without looping over them. The techniques to do this were well known in the 1970's when the first versons of lex were written. Not that I would recommend tricks like this. The regular expression would quickly get out of hand for any non-trivial list of regular expresssions to match. Charles -- http://mail.python.org/mailman/listinfo/python-list