The problem... if I include a space in my robot's user agent, it will fail to 
recognize robots.txt records targeted to my robot.

My robot's user agent:
Hispanic Business Inc. Spider/1.0

Robots.txt file:
User-agent: Hispanic Business Inc. Spider
Disallow:

User-agent: *
Disallow: /

My robot will incorrectly refuse to spider anything, because 
WWW::RobotRules::agent shortens $self->{'ua'} to "Hispanic".

I propose the attached patch to the RobotRules.pm included in libwww-perl 5.803

-- 
Matthew.van.Eerde (at) hbinc.com               805.964.4554 x902
Hispanic Business Inc./HireDiversity.com       Software Engineer
--- libwww-perl-5.803/lib/WWW/RobotRules.pm.original    2005-10-13 
16:26:27.000000000 -0700
+++ libwww-perl-5.803/lib/WWW/RobotRules.pm     2005-10-13 16:27:27.000000000 
-0700
@@ -185,8 +185,8 @@
         #       "FooBot/1.2"                                  => "FooBot"
         #       "FooBot/1.2 [http://foobot.int; [EMAIL PROTECTED]" => "FooBot"
 
-       $name = $1 if $name =~ m/(\S+)/; # get first word
        $name =~ s!/.*!!;  # get rid of version
+       $name =~ s/\s+$//; # get rid of trailing space
        unless ($old && $old eq $name) {
            delete $self->{'loc'}; # all old info is now stale
            $self->{'ua'} = $name;

Reply via email to