On Tue, 20 Jul 2004 17:28:25 -0400, Chris Santerre wrote:
>� I bet the file could be half the size it is now. But I don't have
>� the script experience to do this. So anyone who can improve the
>� logic would be a great help.
How is it generated now?
Have you tried using "Regexp::List", "Regexp::Optimizer" or "Regex::PreSuf "
for this? 'm not sure how good or trustworthy they are, but maybe they are
worth a try.
A made a very small test just to see what they might do. I don't know what one
can change by telling the modules to act differently. And they probably work
better on some input than other. Some results below.
Regexp::Optimizer (optimize a regexp):
--8<--
/\bhomel(?:oanace\.com|andunited\.com|anddefensejournal\.com|anddefenseradio\.com|andsecurityresearch\.com|ead\.net|essprelates\.com|essteens\.com)\b/i
became
(?-xism:/\bhomel(?:and(?:defense(?:journal|radio)\.com|(?:united|securityresearch)\.com)|e(?:ss(?:prelate|teen)s\.com|ad\.net)|oanace\.com)\b/i)
/\bhomeg(?:ain\.com|ain\.biz|ain\.net|un\.com)\b/i
became
(?-xism:/\bhomeg(?:ain\.(?:com|biz|net)|un\.com)\b/i)
--8<--
Regexp::List (create a regexp from a list):
--8<--
gnyrfalo.nils.com, gnyrfippa.hasse.net, jsfd.hej.com, jasaf.asf.se,
jfdsjsdf.hsf.com, gnyrffal.hej.net
became
(?-xism:(?=[gj])(?:gnyrf(?:alo\.nils\.com|(?:ippa\.hasse|fal\.hej)\.net)|j(?:(?:sfd\.hej|fdsjsdf\.hsf)\.com|asaf\.asf\.se)))
gnyrfalo.nils.com, gnyrffal.hej.net, gnyrfippa.hasse.net, jasaf.asf.se,
jfdsjsdf.hsf.com, jsfd.hej.com
became
(?-xism:(?=[gj])(?:gnyrf(?:alo\.nils\.com|(?:fal\.hej|ippa\.hasse)\.net)|j(?:asaf\.asf\.se|(?:fdsjsdf\.hsf|sfd\.hej)\.com)))
gnyrfalo.nils.com, jsfd.hej.com, jfdsjsdf.hsf.com, gnyrfippa.hasse.net,
gnyrffal.hej.net, jasaf.asf.se
became
(?-xism:(?=[gj])(?:gnyrf(?:alo\.nils\.com|(?:ippa\.hasse|fal\.hej)\.net)|j(?:(?:sfd\.hej|fdsjsdf\.hsf)\.com|asaf\.asf\.se)))
--8<--
PreSuf (also creates a regexp from a list):
--8<--
gnyrfalo.nils.com, gnyrfippa.hasse.net, jsfd.hej.com, jasaf.asf.se,
jfdsjsdf.hsf.com, gnyrffal.hej.net
became
(?:gnyrf(?:alo\.nils\.com|fal\.hej\.net|ippa\.hasse\.net)|j(?:asaf\.asf\.se|fdsjsdf\.hsf\.com|sfd\.hej\.com))
gnyrfalo.nils.com, gnyrffal.hej.net, gnyrfippa.hasse.net, jasaf.asf.se,
jfdsjsdf.hsf.com, jsfd.hej.com
became
(?:gnyrf(?:alo\.nils\.com|fal\.hej\.net|ippa\.hasse\.net)|j(?:asaf\.asf\.se|fdsjsdf\.hsf\.com|sfd\.hej\.com))
gnyrfalo.nils.com, jsfd.hej.com, jfdsjsdf.hsf.com, gnyrfippa.hasse.net,
gnyrffal.hej.net, jasaf.asf.se
became
(?:gnyrf(?:alo\.nils\.com|fal\.hej\.net|ippa\.hasse\.net)|j(?:asaf\.asf\.se|fdsjsdf\.hsf\.com|sfd\.hej\.com))
--8<--
Just a thought... Might be stupid...
Anyway, here's the little test script:
--8<--
use strict;
use Regexp::List;
use Regexp::Optimizer;
use Regex::PreSuf;
my $l2r = Regexp::List->new;
my $ro = Regexp::Optimizer->new;
my $r1 =
'/\bhomel(?:oanace\.com|andunited\.com|anddefensejournal\.com|anddefenseradio\.com|andsecurityresearch\.com|ead\.net|essprelates\.com|essteens\.com)\b/i';
my $r2 = '/\bhomeg(?:ain\.com|ain\.biz|ain\.net|un\.com)\b/i';
my @l1 = (
'gnyrfalo.nils.com',
'gnyrfippa.hasse.net',
'jsfd.hej.com',
'jasaf.asf.se',
'jfdsjsdf.hsf.com',
'gnyrffal.hej.net',
);
my @l2 = sort @l1;
my @l3 = sort { my $ax = $a; my $bx = $b; $ax =~ s/.*\.([^\.]+)$/$1/; $bx =~
s/.*\.([^\.]+)$/$1/; $ax cmp $bx; } @l1;
print "$r1 =>\n" . $ro->optimize($r1) . "\n\n";
print "$r2 =>\n" . $ro->optimize($r2) . "\n\n";
print '('.join(', ',@l1).") =>\n" . $l2r->set(modifiers => 'i')->list2re(@l1) .
"\n\n";
print '('.join(', ',@l2).") =>\n" . $l2r->set(modifiers => 'i')->list2re(@l2) .
"\n\n";
print '('.join(', ',@l3).") =>\n" . $l2r->set(modifiers => 'i')->list2re(@l3) .
"\n\n";
print '('.join(', ',@l1).") =>\n" . presuf(@l1) . "\n\n";
print '('.join(', ',@l2).") =>\n" . presuf(@l2) . "\n\n";
print '('.join(', ',@l3).") =>\n" . presuf(@l3) . "\n\n";
--8<--
Regards (and many thanks)
/Jonas
--
Jonas Eckerman, [EMAIL PROTECTED]
http://www.fsdb.org/