Re: extracting common substrings...

2006-12-12 Thread Tom Phoenix

On 12/12/06, Helliwell, Kim <[EMAIL PROTECTED]> wrote:


Is there a function (perhaps in a library module) that would take two
strings and return the common substring (if any) contained in the
arguments?


You want the longest possible common substring? Or all of the longest
ones, if there's more than one?

I did something like this once. I used Perl's ability to treat strings
as bitstrings to xor one string against the other, at various
positions. Then I checked each xor result for a string of zeroes (i.e.
character matches) longer than any yet known, recording the length and
location of each match. Once I had checked all possible offsets that
could give a long-enough match, it was a simple matter to extract the
matching substring.

Of course, that method assumed ASCII characters were bytes. If you
might have non-byte characters in your data, this algorithom could
still work, but you'll need to code with more care.

Hope this helps!

--Tom Phoenix
Stonehenge Perl Training

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: extracting common substrings...

2006-12-12 Thread John W. Krahn
Helliwell, Kim wrote:
> Is there a function (perhaps in a library module) that would take two
> strings and return the common substring (if any) contained in the
> arguments? I've been looking for such a beast on CPAN, but no luck so
> far.

Perhaps this is what you require:

http://search.cpan.org/~jfreeman/Algorithm-LCSS-0.01/LCSS.pm



John
-- 
Perl isn't a toolbox, but a small machine shop where you can special-order
certain sorts of tools at low cost and in short order.   -- Larry Wall

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: extracting common substrings...

2006-12-12 Thread D. Bolliger
Helliwell, Kim am Dienstag, 12. Dezember 2006 21:56:
> Is there a function (perhaps in a library module) that would take two
> strings and return the common substring (if any) contained in the
> arguments? I've been looking for such a beast on CPAN, but no luck so
> far.
>
>
>
> If not, I guess I have to write it myself...

While I was writing it myself ;-) because I did not find anything via google 
or CPAN - missing term LCSS... John posted Algorithm:LCCS.

I thought I post it anyway instead of copying it to /dev/null.

The script contains a testcase with "long" strings, it takes 1.2 secs on my 
old machine (the test case is certainly not a worst case scenario).

It is just a dirty hack, using a naive aproach, and not proved to work 
correctly.

Here it is, comments are welcome:

#!/usr/bin/perl
use strict;
use warnings;

sub lcss {
  my ($s1, $s2)[EMAIL PROTECTED];

  my $max1=length($s1)-1;
  my $max2=length($s2)-1;

  # make $s1 the shorter string
  #
  ($s1, $s2, $max1, $max2)=($s2, $s1, $max2, $max1)
if $max1 > $max2;

  my %found;
  my $longest=0;

  foreach my $i (0..$max1) {
foreach my $j ($i..$max1) {
  my $searchlen=$j-$i+1;

  next if $searchlen < $longest; # because longest css searched

  my $search=substr($s1, $i, $searchlen); # pattern to search

  $found{$1}++ for ($s2=~/($search)/g); # although count not used below

  # not optimal because no test if match succeeded above
  #
  $longest=$searchlen if defined $1;
}
  }

  # (should) select only one random longest string if several present:
  #
  print '(one) LCSS found: ',
(sort {length($b) <=> length($a)} keys %found)[0], "\n";
}

### Test case:

my $pat=join '', 'hello' x 100;
my $bar=join '', 'hello' x 99;
my $foo=join '', 'a' x 100;

lcss ($pat, $bar.$foo.$bar.$pat.$bar.$foo.$bar);
lcss ('donut', 'I just want to eat one donut please!');
lcss ('I just want to eat one donut please!', 'donut');
__END__

Dani

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: extracting common substrings...

2006-12-12 Thread D. Bolliger
Tom Phoenix am Mittwoch, 13. Dezember 2006 02:32:
> On 12/12/06, D. Bolliger <[EMAIL PROTECTED]> wrote:
> >   $found{$1}++ for ($s2=~/($search)/g); # although count not used
> > below
>
> Didn't $search just come from the data? It's a string, not a pattern.
> If it's got any metacharacters, it could break your pattern, or worse.

Ouch, you're right Tom! I completely omitted (speak: forgot) *any* security 
considerations at this development state :-(

=> $s2=~/(\Q$search\E)/g

Waiting for other comments... good night, sleep time here :-)

Dani

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: extracting common substrings...

2006-12-12 Thread D. Bolliger
D. Bolliger am Mittwoch, 13. Dezember 2006 02:25:

Sorry for answering my own post...

[snipped]
> The script contains a testcase with "long" strings, it takes 1.2 secs on my
> old machine (the test case is certainly not a worst case scenario).
[snipped]
> ### Test case:
>
> my $pat=join '', 'hello' x 100;
> my $bar=join '', 'hello' x 99;
> my $foo=join '', 'a' x 100;
[snipped]

With the following worser case test, 

my $baz=join '', 'hiho'  x1;
my $pat=join '', $baz, 'hello' x 100;
my $bar=join '', $baz, 'hello' x 99;
my $foo=join '', 'a' x 100;

execution time increases to... wait a sec ;-)... still running...

"If there's a module, use the module" ?

Dani




-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 




Re: Re: extracting common substrings...

2006-12-12 Thread Tom Phoenix

On 12/12/06, D. Bolliger <[EMAIL PROTECTED]> wrote:


  $found{$1}++ for ($s2=~/($search)/g); # although count not used below


Didn't $search just come from the data? It's a string, not a pattern.
If it's got any metacharacters, it could break your pattern, or worse.

Cheers!

--Tom Phoenix
Stonehenge Perl Training

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]