RE: [Robots] robot in python?

2003-11-26 Thread Sean M. Burke
At 11:47 PM 2003-11-17, SsolSsinclair wrote:
Open Source is a project which came into being through a collective 
effort. Intelligence matching Intelligence. This movement cannot be 
stopped or prevented, SHORT of ceasing communication of all [resulting in 
Deaf Silence, and the Elimination of Sound as a sensory perception, 
clearly not in the interest of any individual or body or civilization, if 
it were possible in the first place.
You talk funny!

This pleases me.

--
Sean M. Burkehttp://search.cpan.org/~sburke/
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] leading whitespace in robots.txt files

2002-03-25 Thread Sean M. Burke


Recently I saw LWP's WWW::RobotRules seeing a robots.txt file that looked 
like this:

#
User-agent: *
 Disallow: /cgi-bin/
 Disallow: /~mojojojo/misc/

It complained about the Disallow lines being "unexpected".

The regexp it was using for these things is:
   /^Disallow:\s*(.*)/i

So I've changed it to this, and was about to submit it as a patch for the 
next LWP release:
   /^\s*Disallow:\s*(.*)/i
   # Silently forgive leading whitespace.

But first, I thought I'd ask the list here: does anyone thing this'd break 
anything?  I sure hope no-one out there is using leading-whitespace lines 
as comments, or as RFC-822-style continuation lines!
Thoughts, anyone?

--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/





[Robots] Re: better language for writing a Spider ?

2002-03-15 Thread Sean M. Burke


At 10:36 2002-03-14 -0800, Nick Arnett wrote:
>[with Python] I'm not seeing long, mysterious time-outs as I occasionally 
>did with LWP.

I have never run into this problem, but I have a dim memory that you may be 
alluding to what is a known bug not with LWP, but with old versions (long 
since fixed in modern Perls and/or CPAN) of the socket libraries in IO::*.

>Following an LWP request through the debugger is a long and convoluted
>journey...

Are you referring to perl -d, or LWP::Debug?

Maybe I should write an addendum to "lwpcook.pod" on figuring out what's 
going wrong, when something does go wrong.  The current lwpcook really 
needs an overhaul, and once my book /Perl & LWP/ is done (hopefully it'll 
be in press within a few weeks), I hope to send up some big doc patches to 
LWP, at the very least revamping lwpcook and then going into each class and 
noting in the docs whether a typical user needs to bother knowing about 
it.  (E.g., you need to know about HTTP::Response; you do /not/ need to 
know about LWP::Protocol.)

In short, if people want to see improvements to LWP, email me and say what 
you want done, and I'll either try my hand at implementing it, or I'll pass 
it on to someone more capable.  LWP is not the product of a massive 
bureaucracy, but of few enough people that you could fit all of us in a 
phone booth.  We're all manically busy, to varying degrees (companies to 
run, children to raise, books/articles/modules to write, etc.), but we do 
at times manage to do what needs doing, if it's pointed out clearly enough 
to stand out from the torrent of email messages (which I find incessantly 
discouraging) that manage no better than "halo I try to use LWP with hotmel 
but not work plz hlp k thx".


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: matching and "UserAgent:" in robots.txt

2002-03-15 Thread Sean M. Burke


I dug around more in Perl LWP's WWW::RobotRules module and the short story 
is that the bug I found exists, but that it's not as bad as I thought.
If you set up a user agent with the name "Foobar/1.23", a WWW::RobotRules 
object actually /does/ currently know to strip off the "/1.23" (this 
happens in the 'agent' method, not in the is_me method where I expected it).

The current bug surfaces only when your user-agent name is more than one 
word; if your user-agent name is "Foobar/1.23 [[EMAIL PROTECTED]]", the 
current 'agent' method's logic says "well, it doesn't end in 
'/number.number', so there's no version to strip off".

So I'm going to send Gisle Aas a patch so that the first word, minus any 
version suffix, is what's used for matching.  It's just a matter of adding 
a line saying:
 $name = $1 if $name =~ m/(\S+)/; # get first word
in the 'agent' method.


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: matching and "UserAgent:" in robots.txt

2002-03-14 Thread Sean M. Burke


At 12:49 2002-03-14 -0800, Nick Arnett wrote:
>[...]That does seem to be a problem, since apparently
>version numbers were contemplated in User-Agent headers...  Sounds like
>something for the LWP author(s).

Yes, we are (hereby) thinking about it.
I thought I'd seek the wisdom of the list on this before bringing it up 
with the others, tho.


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: matching and "UserAgent:" in robots.txt

2002-03-14 Thread Sean M. Burke


At 12:47 2002-03-14 +0100, Martin Beet wrote:
>  On Thu, 14 Mar 2002 03:08:21 -0700, Sean M Burke (SMB) said
>SMB> I'm a bit perplexed over whether the current Perl library
>SMB> WWW::RobotRules implements a certain part of the Robots Exclusion
>SMB> Standard correctly.  So forgive me if this seems a simple
>SMB> question, but my reading of the Robots Exclusion Standard hasn't
>SMB> really cleared it up in my mind yet.
>[...]
>When you look at the WWW:RobotRules implementation, you will see that
>the actual comparison is done in the is_me () method, and essentially
>looks like this: [...] where $ua is the user agent "name"in the robot 
>exclusion file. I.e.
>it checks to see whether the user agent "name" is part of the whole
>UA identifier. Which is exactly what's required.

Well, the code in full looks like this:

# is_me()
#
# Returns TRUE if the given name matches the
# name of this robot
#
sub is_me {
 my($self, $ua) = @_;
 my $me = $self->agent;
 return index(lc($me), lc($ua)) >= 0;
}

But notice that it's asking whether the /whole/ agent name (like "Foo", 
"Foo/1.2", "Foo/1.2 (Stuff Blargle Gleiven; hoohah blingbling1231451)" is a 
substring of the content in "User-Agent: ...content..." (the content is 
what's passed to $thing->is_me($content))

I think that what it /should/ do (given what the various specs say) is this:

sub is_me {
 my($self, $ua) = @_;
 my $me = $self->agent;
 $me = $1 if $me =~ m<(\S+)>; # first word
 $me =~ s<> or $me =~ s<>;
   # remove version string
 return index(lc($me), lc($ua)) >= 0;
}

where that regexp extracts the "Foo" in all of: "Foo", "Foo/1.2", and 
"Foo/1.2 (Stuff Blargle Gleiven; hoohah blingbling1231451)".


E.g.,  http://www.robotstxt.org/wc/norobots.html says:
<>

...note the "without version information".  Ditto the spec you cited, which 
says <>


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: matching and "UserAgent:" in robots.txt

2002-03-14 Thread Sean M. Burke


Oops, I just noticed that my topic has "UserAgent:"  where I meant 
"User-Agent:"


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] matching and "UserAgent:" in robots.txt

2002-03-14 Thread Sean M. Burke


I'm a bit perplexed over whether the current Perl library WWW::RobotRules 
implements a certain part of the Robots Exclusion Standard correctly.  So 
forgive me if this seems a simple question, but my reading of the Robots 
Exclusion Standard hasn't really cleared it up in my mind yet.


Basically the current WWW::RobotRules logic is this:
As a WWW:::RobotRules object is parsing the lines in the robots.txt file, 
if it sees a line that says "User-Agent: ...foo...", it extracts the foo, 
and if the name of the current user-agent is a substring of "...foo...", 
then it considers this line as applying to it.

So if the agent being modeled is called "Banjo", and the robots.txt line 
being parsed says "User-Agent: Thing, Woozle, Banjo, Stuff", then the 
library says "OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', 
so this rule is talking to me!"

However, the substring matching currently goes only one way.  So if the 
user-agent object is called "Banjo/1.1 [http://nowhere.int/banjo.html 
[EMAIL PROTECTED]]" and the robots.txt line being parsed says "User-Agent: 
Thing, Woozle, Banjo, Stuff", then the library says "'Banjo/1.1 
[http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 
'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!"

I have the feeling that that's not right -- notably because that means that 
every robot ID string has to appear in toto on the "User-Agent" robots.txt 
line, which is clearly a bad thing.
But before I submit a patch, I'm tempted to ask... what /is/ the proper 
behavior?

Maybe shave the current user-agent's name at the first slash or space 
(getting just "Banjo"), and then seeing if /that/ is a substring of a given 
robots.txt "User-Agent:" line?

--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Re: Perl and LWP robots

2002-03-07 Thread Sean M. Burke


The replies to my request for advice have been very helpful! I'll pick one 
and reply to it:

At 10:01 2002-03-07 -0800, Otis Gospodnetic wrote:
>[about my forthcoming book]
>(i.e. I'm a potential customer :))  When will it be published?

It's probably going into tech edit later this month.  So it'll probably be 
out this summer.  (Altho bear in mind that I live in New Mexico, where 
summer is just about everything between February and December.)


>I think lots of people do want to know about recursive spiders, and I
>bet one of the most frequent obstacles are issues like: queueing, depth
>vs. breadth first crawling, (memory) efficient storage of extracted and
>crawled links, etc.

I'm getting the feeling that I should see spiders as of two kinds:  kinds 
that spider everything under a given URL 
(like  "http://www.speech.cs.cmu.edu/~sburke/pub/";  or "http://www.";), and 
kinds that go hog wide across all of the Web.

The usefulness of the single-host spiders is pretty obvious to me.
But why do people want to write spiders that potentially span all/any hosts?
(Aside from people who are working for Google or similar.)

--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".



[Robots] Perl and LWP robots

2002-03-07 Thread Sean M. Burke


Hi all!
My name is Sean Burke, and I'm writing a book for O'Reilly, which is to 
basically replace the Clinton Wong's now out-of-print /Web Client 
Programming with Perl/.  In my book draft so far, I haven't discussed 
actual recursive spiders (I've only discussed getting a given page, and 
then every page that it links to which is also on the same host), since I 
think that most readers that think they want a recursive spider, really don't.
But it has been suggested that I cover recursive spiders, just for sake of 
completeness.

Aside from basic concepts (don't hammer the server; always obey the 
robots.txt; don't span hosts unless you are really sure that you want to), 
are there any particular bits of wisdom that list members would want me to 
pass on to my readers?

--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".