Matthew Sacks wrote:
> I am unabashedly posting a quiz question I have about regular expressions:
> 
> Looking for suggestions.
> 
> I am thinking
> 
> 1) make the set of regular expressions into one big expression?
> 2) search the seach strings, somehow, for common substrings.  "acme.org" 
> would be example. Each hit on acme.org would indicate a match on one of the 
> original search strings?
> comments invited.
> 
> 
> You have 100,000 strings which are regex patterns
> intended to match URLs (including hostname and possibly a URI). When you
> receive a message, you need to see if it matches at least one of these
> patterns.
> 
> The naive approach would be to go through your list of
> patterns linearly and attempt a regex match on each one.  Suggest an 
> alternative that would be more
> efficient.

Trade memory for speed.

Below is my solution:  create a hash of hashes of hashes, etc.  Each
level has one letter in the pattern or a wildcard marker or a successful
match marker.  For each string, break it into characters and do a
breath-wide search down through the tree.  If at the end of the string,
there is a match amrker, the string is matched.

Known bugs:  foo***bar will fail if the number of characters between foo
and bar is less than 3.

To do:  record the pattern in the hash so if a match occurs, you know
what pattern matched.


#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;

# Make Data::Dumper pretty
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent   = 1;

# Set maximum depth for Data::Dumper, zero means unlimited
$Data::Dumper::Maxdepth = 0;

binmode STDOUT, ':utf8';

my %patterns = ();

print "\nPatterns\n";
while( <DATA> ){
  last if m{ __DATA__ }msx;
  print;

  chomp;
  my $p = \%patterns;
  while( m{ ( \\? . ) }gmsx ){
    my $c = $1;

    $c = '-wildcard' if $c eq '*';
    $c = $1 if $c =~ m{ \\ (.) }msx;

    $p->{$c} = {} unless $p->{$c};
    $p = $p->{$c};
    # print "\x{ab}$c\x{bb}  ";
  }
  $p->{-matches} = 1;
  # print "\n";

}
# print 'patterns = ', Dumper \%patterns;

$Data::Dumper::Maxdepth = 3;
print "\nStrings\n";
while( <DATA> ){
  chomp;

  my @patterns = ( [ \%patterns, 0 ] );
  for my $c ( split // ){
    # print "\x{ab}$c\x{bb}  ", Dumper \...@patterns; <STDIN>;
    my @next_patterns = ();
    for my $set ( @patterns ){
      push @next_patterns, $set if $set->[1];
      my $p = $set->[0];
      if( exists $p->{$c} ){
        push @next_patterns, [ $p->{$c}, 0 ];
      }
      if( exists $p->{-wildcard} ){
        push @next_patterns, [ $p->{-wildcard}, 1 ];
      }
    }
    @patterns = @next_patterns;
  }
  # print "\n";

  my $matches = 0;
  for my $p ( @patterns ){
    # print 'checking ', Dumper $p; <STDIN>;
    if( exists $p->[0]{-matches} || exists $p->[0]{-wildcard}{-matches} ){
      print "$_ matches\n";
      $matches = 1;
      last;
    }
  }
  print "$_ no match\n" unless $matches;

}

__DATA__
*amazon.com*
*craigslist.*
acme.org*sub=7*
a1.vcl.com*
ebay.com/sports*
foo\*bar
foo
food
foo\*blah
__DATA__
www.amazon.com?x=123
books.amazon.com?y=123
acme.org
acme.org/stuff?sub=7
acme.org?w=1&sub=7
vcl.com/suba/subb?qs=abc
a1.vcl.com?w=3
a2.vcl.com/xyz?w=1



-- 
Just my 0.00000002 million dollars worth,
  Shawn

Programming is as much about organization and communication
as it is about coding.

I like Perl; it's the only language where you can bless your
thingy.

-- 
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to