WWW::RobotRules attempts to trim the robot's User-Agent before comparing 
it with the User-agent field of a robots.txt file:

        # Strip it so that it's just the short name.
        # I.e., "FooBot"                                      => "FooBot"
        #       "FooBot/1.2"                                  => "FooBot"
        #       "FooBot/1.2 [http://foobot.int; [EMAIL PROTECTED]" => "FooBot"

        delete $self->{'loc'};   # all old info is now stale
        $name = $1 if $name =~ m/(\S+)/; # get first word
        $name =~ s!/?\s*\d+.\d+\s*$!!;  # loose version

My robot's name is "WDG_SiteValidator/1.5.6".  The above code changes the
name to "WDG_SiteValidator/1.", which causes it not to match a robots.txt
User-agent field of "WDG_SiteValidator".

I've attached a patch against libwww-perl 5.76 (WWW::RobotRules 1.26) that
replaces the last line above with

        $name =~ s!/.*!!;  # lose version

which seems to cover the various cases correctly.

-- 
Liam Quinn


--- WWW/RobotRules.pm.orig      2003-10-23 15:11:33.000000000 -0400
+++ WWW/RobotRules.pm   2004-04-03 18:06:01.000000000 -0500
@@ -187,7 +187,7 @@
 
        delete $self->{'loc'};   # all old info is now stale
        $name = $1 if $name =~ m/(\S+)/; # get first word
-       $name =~ s!/?\s*\d+.\d+\s*$!!;  # loose version
+       $name =~ s!/.*!!;  # lose version
        $self->{'ua'}=$name;
     }
     $old;

Reply via email to