Re: Speeding up multiple regex matches

2005-11-18 Thread Alex Martelli
Talin <[EMAIL PROTECTED]> wrote:
   ...
> 1) Combine all of the regular expressions into one massive regex, and
> let the regex state machine do all the discriminating. The problem with
> this is that it gives you no way to determine which regex was the
> matching one.

Place each regex into a parenthesized group, and check which groups have
matched on the resulting matchobject:

>>> x=re.compile('(aa)|(bb)')
>>> mo=x.search('zaap!')
>>> mo.groups()
('aa', None)

There's a limit of 99 groups, so if you have unbounded number of regexes
to start with you'll have to split them up 99-or-fewer at a time, but
that shouldn't be impossibly hard.


Alex
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Speeding up multiple regex matches

2005-11-19 Thread Talin
OK that worked really well. In particular, the "lastindex" property of
the match object can be used to tell exactly which group matched,
without having to sequentially search the list of groups.

In fact, I was able to use your idea to cobble together a poor man's
lexer which I am calling "reflex" (Regular Expressions For Lexing).
Here's an example of how it's used:

# Define the states using an enumeration
State = Enum( 'Default', 'Comment', 'String' )

# Create a scanner
scanner = reflex.scanner( State.Default )
scanner.rule( "\s+" )
scanner.rule( "/\*", reflex.set_state( State.Comment ) )
scanner.rule( "[a-zA-Z_][\w_]*", KeywordOrIdent )
scanner.rule( "0x[\da-fA-F]+|\d+", token=TokenType.Integer )
scanner.rule(
"(?:\d+\.\d*|\.\d+)(?:[eE]?[+-]?\d+)|\d+[eE]?[+-]?\d+",
token=TokenType.Real )

# Multi-line comment state
scanner.state( State.Comment )
scanner.rule( "\*/", reflex.set_state( State.Default ) )
scanner.rule( "(?:[^*]|\*(?!/))+" )

# Now, create an instance of the scanner
token_stream = scanner( input_file_iter )
for token in token_stream:
print token

Internally, it creates an array of patterns and actions for each state.
Then when you ask it to create a scanner instance, it combines all of
the patterns into a large regular expression. Input lines are marched
vs. this regex, and if a match succeeds, then the match object's
'lastindenx' property is used to look up the actions to perform in the
array.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Speeding up multiple regex matches

2005-11-20 Thread Fredrik Lundh
"Talin" wrote:

> I've run in to this problem a couple of times. Say I have a piece of
> text that I want to test against a large number of regular expressions,
> where a different action is taken based on which regex successfully
> matched. The naive approach is to loop through each regex, and stop
> when one succeeds. However, I am finding this to be too slow for my
> application -- currently 30% of the run time is being taken up in the
> regex matching.
>
> I thought of a couple of approaches, but I am not sure how to make them
> work:
>
> 1) Combine all of the regular expressions into one massive regex, and
> let the regex state machine do all the discriminating. The problem with
> this is that it gives you no way to determine which regex was the
> matching one.

use a capturing group for each alternative, and use lastindex to quickly
find the match:

http://docs.python.org/lib/match-objects.html

lastindex

The integer index of the last matched capturing group, or None if
no group was matched at all.

also see:

http://effbot.org/zone/xml-scanner.htm





-- 
http://mail.python.org/mailman/listinfo/python-list