[Robots] Re: matching and "UserAgent:" in robots.txt

2002-03-15 Thread Sean M. Burke


I dug around more in Perl LWP's WWW::RobotRules module and the short story 
is that the bug I found exists, but that it's not as bad as I thought.
If you set up a user agent with the name "Foobar/1.23", a WWW::RobotRules 
object actually /does/ currently know to strip off the "/1.23" (this 
happens in the 'agent' method, not in the is_me method where I expected it).

The current bug surfaces only when your user-agent name is more than one 
word; if your user-agent name is "Foobar/1.23 [[EMAIL PROTECTED]]", the 
current 'agent' method's logic says "well, it doesn't end in 
'/number.number', so there's no version to strip off".

So I'm going to send Gisle Aas a patch so that the first word, minus any 
version suffix, is what's used for matching.  It's just a matter of adding 
a line saying:
 $name = $1 if $name =~ m/(\S+)/; # get first word
in the 'agent' method.


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: matching and "UserAgent:" in robots.txt

2002-03-14 Thread Sean M. Burke


At 12:49 2002-03-14 -0800, Nick Arnett wrote:
>[...]That does seem to be a problem, since apparently
>version numbers were contemplated in User-Agent headers...  Sounds like
>something for the LWP author(s).

Yes, we are (hereby) thinking about it.
I thought I'd seek the wisdom of the list on this before bringing it up 
with the others, tho.


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: matching and "UserAgent:" in robots.txt

2002-03-14 Thread Nick Arnett




> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Sean M. Burke

...

> E.g.,  http://www.robotstxt.org/wc/norobots.html says:
> < this field.
> A case insensitive substring match of the name without version
> information
> is recommended.>>
>
> ...note the "without version information".  Ditto the spec you
> cited, which
> says < more words,
> and the very first word is taken to be the "name", which is
> referred to in
> the robot exclusion files.>>

Ah, now I see your point.  That does seem to be a problem, since apparently
version numbers were contemplated in User-Agent headers...  Sounds like
something for the LWP author(s).

Or, a convenient excuse for a badly behaved robot... !

Nick


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: matching and "UserAgent:" in robots.txt

2002-03-14 Thread Sean M. Burke


At 12:47 2002-03-14 +0100, Martin Beet wrote:
>  On Thu, 14 Mar 2002 03:08:21 -0700, Sean M Burke (SMB) said
>SMB> I'm a bit perplexed over whether the current Perl library
>SMB> WWW::RobotRules implements a certain part of the Robots Exclusion
>SMB> Standard correctly.  So forgive me if this seems a simple
>SMB> question, but my reading of the Robots Exclusion Standard hasn't
>SMB> really cleared it up in my mind yet.
>[...]
>When you look at the WWW:RobotRules implementation, you will see that
>the actual comparison is done in the is_me () method, and essentially
>looks like this: [...] where $ua is the user agent "name"in the robot 
>exclusion file. I.e.
>it checks to see whether the user agent "name" is part of the whole
>UA identifier. Which is exactly what's required.

Well, the code in full looks like this:

# is_me()
#
# Returns TRUE if the given name matches the
# name of this robot
#
sub is_me {
 my($self, $ua) = @_;
 my $me = $self->agent;
 return index(lc($me), lc($ua)) >= 0;
}

But notice that it's asking whether the /whole/ agent name (like "Foo", 
"Foo/1.2", "Foo/1.2 (Stuff Blargle Gleiven; hoohah blingbling1231451)" is a 
substring of the content in "User-Agent: ...content..." (the content is 
what's passed to $thing->is_me($content))

I think that what it /should/ do (given what the various specs say) is this:

sub is_me {
 my($self, $ua) = @_;
 my $me = $self->agent;
 $me = $1 if $me =~ m<(\S+)>; # first word
 $me =~ s<> or $me =~ s<>;
   # remove version string
 return index(lc($me), lc($ua)) >= 0;
}

where that regexp extracts the "Foo" in all of: "Foo", "Foo/1.2", and 
"Foo/1.2 (Stuff Blargle Gleiven; hoohah blingbling1231451)".


E.g.,  http://www.robotstxt.org/wc/norobots.html says:
<>

...note the "without version information".  Ditto the spec you cited, which 
says <>


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: matching and "UserAgent:" in robots.txt

2002-03-14 Thread Tim Bray


Sean M. Burke wrote:

> I'm a bit perplexed over whether the current Perl library WWW::RobotRules 
> implements a certain part of the Robots Exclusion Standard correctly.  So 
> forgive me if this seems a simple question, but my reading of the Robots 
> Exclusion Standard hasn't really cleared it up in my mind yet.


Is this the REP stuff out of LWP?  My opinion, based on having used it
in a BG robot and not getting flamed, is that the LWP
implementation of Robot Exclusion is as close to 100% right as you're
going to get. -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: matching and "UserAgent:" in robots.txt

2002-03-14 Thread Martin Beet


Hi

 On Thu, 14 Mar 2002 03:08:21 -0700, Sean M Burke (SMB) said
SMB> I'm a bit perplexed over whether the current Perl library
SMB> WWW::RobotRules implements a certain part of the Robots Exclusion
SMB> Standard correctly.  So forgive me if this seems a simple
SMB> question, but my reading of the Robots Exclusion Standard hasn't
SMB> really cleared it up in my mind yet.
SMB> 
SMB> Basically the current WWW::RobotRules logic is this: As a
SMB> WWW:::RobotRules object is parsing the lines in the robots.txt
SMB> file, if it sees a line that says "User-Agent: ...foo...", it
SMB> extracts the foo, and if the name of the current user-agent is a
SMB> substring of "...foo...", then it considers this line as applying
SMB> to it.
[...]
SMB> However, the substring matching currently goes only one way.  So
SMB> if the user-agent object is called "Banjo/1.1
SMB> [http://nowhere.int/banjo.html [EMAIL PROTECTED]]" and the
SMB> robots.txt line being parsed says "User-Agent: Thing, Woozle,
SMB> Banjo, Stuff", then the library says "'Banjo/1.1
SMB> [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a
SMB> substring of 'Thing, Woozle, Banjo, Stuff', so this rule is NOT
SMB> talking to me!"
SMB> 
[...]
SMB> But before I submit a patch, I'm tempted to ask... what /is/ the
SMB> proper behavior?
[...]

I'm sorry, but I think you're mistaken:

>From the HTTP spec:
()

"User-Agent:

  This line if present gives the software program used by the original
  client. This is for statistical purposes and the tracing of protocol
  violations. It should be included. The first white space delimited
  word must be the software product name, with an optional slash and
  version designator.

  Other products which form part of the user agent may be put as
  separate words.

   =   User-Agent: +
 =[/]
 =   
"

That is, the User-Agent (HTTP) header consists of one or more words,
and the very first word is taken to be the "name", which is referred
to in the robot exclusion files.

When you look at the WWW:RobotRules implementation, you will see that
the actual comparison is done in the is_me () method, and essentially
looks like this:

 index(lc($self->agent), lc($ua)) >= 0;

where $ua is the user agent "name"in the robot exclusion file. I.e.
it checks to see whether the user agent "name" is part of the whole
UA identifier. Which is exactly what's required.

Regards, Martin





--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: matching and "UserAgent:" in robots.txt

2002-03-14 Thread Sean M. Burke


Oops, I just noticed that my topic has "UserAgent:"  where I meant 
"User-Agent:"


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".