The problem... if I include a space in my robot's user agent, it will fail to recognize robots.txt records targeted to my robot.
My robot's user agent: Hispanic Business Inc. Spider/1.0 Robots.txt file: User-agent: Hispanic Business Inc. Spider Disallow: User-agent: * Disallow: / My robot will incorrectly refuse to spider anything, because WWW::RobotRules::agent shortens $self->{'ua'} to "Hispanic". I propose the attached patch to the RobotRules.pm included in libwww-perl 5.803 -- Matthew.van.Eerde (at) hbinc.com 805.964.4554 x902 Hispanic Business Inc./HireDiversity.com Software Engineer
--- libwww-perl-5.803/lib/WWW/RobotRules.pm.original 2005-10-13 16:26:27.000000000 -0700 +++ libwww-perl-5.803/lib/WWW/RobotRules.pm 2005-10-13 16:27:27.000000000 -0700 @@ -185,8 +185,8 @@ # "FooBot/1.2" => "FooBot" # "FooBot/1.2 [http://foobot.int; [EMAIL PROTECTED]" => "FooBot" - $name = $1 if $name =~ m/(\S+)/; # get first word $name =~ s!/.*!!; # get rid of version + $name =~ s/\s+$//; # get rid of trailing space unless ($old && $old eq $name) { delete $self->{'loc'}; # all old info is now stale $self->{'ua'} = $name;