> From: Bakken, Luke [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, 26 February 2004 4:59 AM
> To: Henry Todd; [EMAIL PROTECTED]
> Subject: RE: Pattern matching problem
> 
> > I'm having trouble counting the number of specific substrings 
> > within a 
> > string. I'm working on a bioinformatics coursework at the 
> > moment, so my 
> > string looks like this:
> > 
> > $sequence = "caggaacttcccctcggaagaccatgta";
> > 
> > I want to count the number of occurrences of each pair of 
> > letters, for example:
> > 
> > Number of occurrences of "aa"
> > Number of occurrences of "gt"
> > Number of occurrences of "cc"
> > 
> > This is how I'm counting the number of "cc" pairs at the 
> > moment ($cc is 
> > my counter variable):
> > 
> > $cc++ while $sequence =~ /cc/gi;

While I agree that the zero-width look-ahead solves the problem for
the two character case of a single pattern.  What about the three
character or more case?  what about a search string fed in from the
command line?  What about handling a large number of search strings
possibly not known at compile time?

The following code replaces the 'one line regex' with some code which
breaks (compiles?) the search-space into a parse tree, and performs the
equivalent of 'zero-width look-ahead', but allows you to count a [very]
large number of [variable length] search items in parallel.  A number of
possible optimisations have been left out for readability and brevity.  
Please avoid searching for equivalent substrings (ie, If you are
searching
for 'thump', searching for 'hump' as well is pointless, unless you need
to count the number of times 'hump' appears without a 't' in front :)

If you want to know what '$ptree' looks like, just 'use Data::Dumper;'
at
the top, and 'print Dumper($ptree).$/;'  at the bottom.

------------------------------- snip -----------------------------------

#!/usr/bin/perl

my $cstr = 'caggaacttcccctcggaagaccatgta';
my @set = qw(cc ca gg aa ttc);

# Convert '@set' into a parse tree
my $ptree = {};
for( @set ) {
        my $r = \$ptree;
        $r = \ $$r->{$_} for split //;
        push @rset, \$$r->{count};
}

# Parse string
my @tok = split//,$cstr;
for( 0..$#tok )
{
        my $r = \$ptree;
        my $n=$_;
        while( exists $$r->{$tok[$n]} ) {
                $r=\$$r->{$tok[$n++]}
        }
        $$r->{count}++ if exists $$r->{count}
}

print "Matches Found:".$/;
for(0..$#set ){
        printf "%10s %d$/", $set[$_], ${$rset[$_]};
}

% ./rs.pl
Matches Found:
        cc 4
        ca 2
        gg 2
        aa 2
       ttc 1

------------------------------- snip -----------------------------------


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to