On Tue, Nov 10, 2009 at 6:08 AM, Shawn H Corey <shawnhco...@gmail.com>wrote:

> Matthew Sacks wrote:
> > I am unabashedly posting a quiz question I have about regular
> expressions:
> >
> > Looking for suggestions.
> >
> > I am thinking
> >
> > 1) make the set of regular expressions into one big expression?
> > 2) search the seach strings, somehow, for common substrings.  "acme.org"
> would be example. Each hit on acme.org would indicate a match on one of
> the original search strings?
> > comments invited.
> >
> >
> > You have 100,000 strings which are regex patterns
> > intended to match URLs (including hostname and possibly a URI). When you
> > receive a message, you need to see if it matches at least one of these
> > patterns.
> >
> > The naive approach would be to go through your list of
> > patterns linearly and attempt a regex match on each one.  Suggest an
> alternative that would be more
> > efficient.
>
> Trade memory for speed.
>
> Below is my solution:  create a hash of hashes of hashes, etc.  Each
> level has one letter in the pattern or a wildcard marker or a successful
> match marker.  For each string, break it into characters and do a
> breath-wide search down through the tree.  If at the end of the string,
> there is a match amrker, the string is matched.
>
> Known bugs:  foo***bar will fail if the number of characters between foo
> and bar is less than 3.
>
> To do:  record the pattern in the hash so if a match occurs, you know
> what pattern matched.
>
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> use Data::Dumper;
>
> # Make Data::Dumper pretty
> $Data::Dumper::Sortkeys = 1;
> $Data::Dumper::Indent   = 1;
>
> # Set maximum depth for Data::Dumper, zero means unlimited
> $Data::Dumper::Maxdepth = 0;
>
> binmode STDOUT, ':utf8';
>
> my %patterns = ();
>
> print "\nPatterns\n";
> while( <DATA> ){
>  last if m{ __DATA__ }msx;
>  print;
>
>  chomp;
>  my $p = \%patterns;
>  while( m{ ( \\? . ) }gmsx ){
>    my $c = $1;
>
>    $c = '-wildcard' if $c eq '*';
>    $c = $1 if $c =~ m{ \\ (.) }msx;
>
>    $p->{$c} = {} unless $p->{$c};
>    $p = $p->{$c};
>    # print "\x{ab}$c\x{bb}  ";
>  }
>  $p->{-matches} = 1;
>  # print "\n";
>
> }
> # print 'patterns = ', Dumper \%patterns;
>
> $Data::Dumper::Maxdepth = 3;
> print "\nStrings\n";
> while( <DATA> ){
>  chomp;
>
>  my @patterns = ( [ \%patterns, 0 ] );
>  for my $c ( split // ){
>    # print "\x{ab}$c\x{bb}  ", Dumper \...@patterns; <STDIN>;
>    my @next_patterns = ();
>    for my $set ( @patterns ){
>      push @next_patterns, $set if $set->[1];
>      my $p = $set->[0];
>      if( exists $p->{$c} ){
>        push @next_patterns, [ $p->{$c}, 0 ];
>      }
>      if( exists $p->{-wildcard} ){
>        push @next_patterns, [ $p->{-wildcard}, 1 ];
>      }
>    }
>    @patterns = @next_patterns;
>  }
>  # print "\n";
>
>  my $matches = 0;
>  for my $p ( @patterns ){
>    # print 'checking ', Dumper $p; <STDIN>;
>    if( exists $p->[0]{-matches} || exists $p->[0]{-wildcard}{-matches} ){
>      print "$_ matches\n";
>      $matches = 1;
>      last;
>    }
>  }
>  print "$_ no match\n" unless $matches;
>
> }
>
> __DATA__
> *amazon.com*
> *craigslist.*
> acme.org*sub=7*
> a1.vcl.com*
> ebay.com/sports*
> foo\*bar
> foo
> food
> foo\*blah
> __DATA__
> www.amazon.com?x=123
> books.amazon.com?y=123
> acme.org
> acme.org/stuff?sub=7
> acme.org?w=1&sub=7
> vcl.com/suba/subb?qs=abc
> a1.vcl.com?w=3
> a2.vcl.com/xyz?w=1
>
>
>
> --
> Just my 0.00000002 million dollars worth,
>  Shawn
>
> Programming is as much about organization and communication
> as it is about coding.
>
> I like Perl; it's the only language where you can bless your
> thingy.
>
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
>
>
>
Hi Matthew,

I have to agree with Shawn, as you follow the tree branch by branch you get
to the final match, the total number of checks required should on average be
lower then it is when checking every single regex, which you would have to
do for every string that does not match anything else.

However due to a URI now or very soon (lost track of that a bit) being able
to consist of any international character the list will be a lot longer then
just 'a' to 'Z' and '0 to '9' so checking one character at a time could
easily be way more work then one would initially think.
It might therefore be better to match the first two, three or how ever many
characters to reduce the initial number of regular expressions one has to go
over for every single string. Then on a next step again it might pay to
match a set of characters rather then just one at a time as this would
reduce the number of required checks to decide if a match is possible or
not, and to go down the next branch or toss the string back with a does not
match response.

I would say that depending on how often you are going to do the checking and
how dynamic your list of regular expressions is you might want to spend a
lot more time on finding the fastest programmatic way to order this hash of
hashes then you will have to care about the resulting time needed to do the
matching of any number of strings using this resulting hash.

Regards,

Rob

Reply via email to