[Robots] matching and UserAgent: in robots.txt

2002-03-14 Thread Sean M. Burke


I'm a bit perplexed over whether the current Perl library WWW::RobotRules 
implements a certain part of the Robots Exclusion Standard correctly.  So 
forgive me if this seems a simple question, but my reading of the Robots 
Exclusion Standard hasn't really cleared it up in my mind yet.


Basically the current WWW::RobotRules logic is this:
As a WWW:::RobotRules object is parsing the lines in the robots.txt file, 
if it sees a line that says User-Agent: ...foo..., it extracts the foo, 
and if the name of the current user-agent is a substring of ...foo..., 
then it considers this line as applying to it.

So if the agent being modeled is called Banjo, and the robots.txt line 
being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the 
library says OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', 
so this rule is talking to me!

However, the substring matching currently goes only one way.  So if the 
user-agent object is called Banjo/1.1 [http://nowhere.int/banjo.html 
[EMAIL PROTECTED]] and the robots.txt line being parsed says User-Agent: 
Thing, Woozle, Banjo, Stuff, then the library says 'Banjo/1.1 
[http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 
'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!

I have the feeling that that's not right -- notably because that means that 
every robot ID string has to appear in toto on the User-Agent robots.txt 
line, which is clearly a bad thing.
But before I submit a patch, I'm tempted to ask... what /is/ the proper 
behavior?

Maybe shave the current user-agent's name at the first slash or space 
(getting just Banjo), and then seeing if /that/ is a substring of a given 
robots.txt User-Agent: line?

--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] SV: matching and UserAgent: in robots.txt

2002-03-14 Thread Rasmus Mohr


Any idea how widespread the use of this library is? We've observed some
weird behaviors from some of the major search engines' spiders (basically
ignoring robots.txt sections) - maybe this is the explanation?

--
Rasmus T. MohrDirect  : +45 36 910 122
Application Developer Mobile  : +45 28 731 827
Netpointers Intl. ApS Phone   : +45 70 117 117
Vestergade 18 B   Fax : +45 70 115 115
1456 Copenhagen K Email   : mailto:[EMAIL PROTECTED]
Denmark   Website : http://www.netpointers.com

Remember that there are no bugs, only undocumented features.
--

-Oprindelig meddelelse-
Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Pa
vegne af Sean M. Burke
Sendt: 14. marts 2002 11:08
Til: [EMAIL PROTECTED]
Emne: [Robots] matching and UserAgent: in robots.txt



I'm a bit perplexed over whether the current Perl library WWW::RobotRules 
implements a certain part of the Robots Exclusion Standard correctly.  So 
forgive me if this seems a simple question, but my reading of the Robots 
Exclusion Standard hasn't really cleared it up in my mind yet.


Basically the current WWW::RobotRules logic is this:
As a WWW:::RobotRules object is parsing the lines in the robots.txt file, 
if it sees a line that says User-Agent: ...foo..., it extracts the foo, 
and if the name of the current user-agent is a substring of ...foo..., 
then it considers this line as applying to it.

So if the agent being modeled is called Banjo, and the robots.txt line 
being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the 
library says OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', 
so this rule is talking to me!

However, the substring matching currently goes only one way.  So if the 
user-agent object is called Banjo/1.1 [http://nowhere.int/banjo.html 
[EMAIL PROTECTED]] and the robots.txt line being parsed says User-Agent: 
Thing, Woozle, Banjo, Stuff, then the library says 'Banjo/1.1 
[http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 
'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!

I have the feeling that that's not right -- notably because that means that 
every robot ID string has to appear in toto on the User-Agent robots.txt 
line, which is clearly a bad thing.
But before I submit a patch, I'm tempted to ask... what /is/ the proper 
behavior?

Maybe shave the current user-agent's name at the first slash or space 
(getting just Banjo), and then seeing if /that/ is a substring of a given 
robots.txt User-Agent: line?

--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]).  For list server commands, send help in the body of
a message to [EMAIL PROTECTED].

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: matching and UserAgent: in robots.txt

2002-03-14 Thread Tim Bray


Sean M. Burke wrote:

 I'm a bit perplexed over whether the current Perl library WWW::RobotRules 
 implements a certain part of the Robots Exclusion Standard correctly.  So 
 forgive me if this seems a simple question, but my reading of the Robots 
 Exclusion Standard hasn't really cleared it up in my mind yet.


Is this the REP stuff out of LWP?  My opinion, based on having used it
in a BG robot and not getting flamed, is that the LWP
implementation of Robot Exclusion is as close to 100% right as you're
going to get. -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: SV: matching and UserAgent: in robots.txt

2002-03-14 Thread Otis Gospodnetic


LWP?  Very popular in a big Perl community.

--- Rasmus Mohr [EMAIL PROTECTED] wrote:
 
 Any idea how widespread the use of this library is? We've observed
 some
 weird behaviors from some of the major search engines' spiders
 (basically
 ignoring robots.txt sections) - maybe this is the explanation?
 
 --
 Rasmus T. MohrDirect  : +45 36 910 122
 Application Developer Mobile  : +45 28 731 827
 Netpointers Intl. ApS Phone   : +45 70 117 117
 Vestergade 18 B   Fax : +45 70 115 115
 1456 Copenhagen K Email   : mailto:[EMAIL PROTECTED]
 Denmark   Website : http://www.netpointers.com
 
 Remember that there are no bugs, only undocumented features.
 --
 
 -Oprindelig meddelelse-
 Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Pa
 vegne af Sean M. Burke
 Sendt: 14. marts 2002 11:08
 Til: [EMAIL PROTECTED]
 Emne: [Robots] matching and UserAgent: in robots.txt
 
 
 
 I'm a bit perplexed over whether the current Perl library
 WWW::RobotRules 
 implements a certain part of the Robots Exclusion Standard correctly.
  So 
 forgive me if this seems a simple question, but my reading of the
 Robots 
 Exclusion Standard hasn't really cleared it up in my mind yet.
 
 
 Basically the current WWW::RobotRules logic is this:
 As a WWW:::RobotRules object is parsing the lines in the robots.txt
 file, 
 if it sees a line that says User-Agent: ...foo..., it extracts the
 foo, 
 and if the name of the current user-agent is a substring of
 ...foo..., 
 then it considers this line as applying to it.
 
 So if the agent being modeled is called Banjo, and the robots.txt
 line 
 being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the
 
 library says OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo,
 Stuff', 
 so this rule is talking to me!
 
 However, the substring matching currently goes only one way.  So if
 the 
 user-agent object is called Banjo/1.1 [http://nowhere.int/banjo.html
 
 [EMAIL PROTECTED]] and the robots.txt line being parsed says
 User-Agent: 
 Thing, Woozle, Banjo, Stuff, then the library says 'Banjo/1.1 
 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring
 of 
 'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!
 
 I have the feeling that that's not right -- notably because that
 means that 
 every robot ID string has to appear in toto on the User-Agent
 robots.txt 
 line, which is clearly a bad thing.
 But before I submit a patch, I'm tempted to ask... what /is/ the
 proper 
 behavior?
 
 Maybe shave the current user-agent's name at the first slash or space
 
 (getting just Banjo), and then seeing if /that/ is a substring of a
 given 
 robots.txt User-Agent: line?
 
 --
 Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/
 
 
 --
 This message was sent by the Internet robots and spiders discussion
 list
 ([EMAIL PROTECTED]).  For list server commands, send help in the
 body of
 a message to [EMAIL PROTECTED].
 
 --
 This message was sent by the Internet robots and spiders discussion
 list ([EMAIL PROTECTED]).  For list server commands, send help in
 the body of a message to [EMAIL PROTECTED].


__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] better language for writing a Spider ?

2002-03-14 Thread srinivas mohan


Hello,

I am working on a robot develpoment, in java,.
We are developing a search enginealmost the 
complete engine is developed...
We used  java for the devlopment...but the performance
of java api in fetching the web pages is too low,
basically we developed out own URL Connection , as
we dont have some features like timeout...
supported  by the java.net.URLConnection api ..

Though there are better spiders in java..like
Mercator..we could not achive a better performance
with our product...

Now as the performance is  low..we wanted to redevelop
our spider..in a language like c or perl...and use
it with our existing product..

I will be thankful if any one can help me choosing 
the better language..where i can get better
performance..

Thanks in advance
Mohan



__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: SV: matching and User-Agent: in robots.txt

2002-03-14 Thread Nick Arnett


Certainly LWP is widely used, but I think it's an open question as to how
many LWP users use the robots.txt capabilities.  I have used LWP
extensively, but have never bothered with the latter.  My robots target a
handful of sites and really don't recurse, as such, so I just keep an eye on
those sites' policies.  And they tend to be very large, busy sites, so I'm a
mere blip in their stats, I assume... which is not to say that I would
lightly ignore anyone's wishes regarding robots.  But I'm not really doing
the usual search engine robot thing of sucking down every page.  I'm heavily
focused on tools that figure out which pages are most significant, so my
robots behave more like people would... which I hope leaves me a bit more
free.

Going back to the original question... I can't quite see why anyone would
give a robot a name like Banjo/1.1 [http://nowhere.int/banjo.html
[EMAIL PROTECTED]].  But if that's the name, then that's what robots.txt
should reference.  A robots.txt that contains a directive for a robot named
Banjo should either be referring to another robot or it has the wrong
name.

I think the original poster has confused (conflated, actually) the HTTP
User-Agent and From headers.

 $ua = LWP::RobotUA-new($agent_name, $from, [$rules])

 Your robot's name and the mail address of the human responsible for the
robot (i.e. you) is
 required by the constructor.

Create a user-agent object thus:

$ua = LWP::RobotUA-new('Banjo/1.1','http://nowhere.int/banjo.html
[EMAIL PROTECTED]')

The string that gets compared with robots.txt is Banjo/1.1.  That's the
HTTP User-Agent header.  The second parameter is the HTTP From header,
which allows the target site's administrator to find you (easily) if your
robot misbehaves.  Of course, it isn't special to robots.  Any HTTP client
can send a From header (the default behavior of which in some clients led
to much controversy years ago, of course).

From the LWP docs: The from attribute can be set to the e-mail address of
the person responsible for running the application. If this is set, then the
address will be sent to the servers with every request.

Hope that's reasonably clear.

Nick

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Otis Gospodnetic
 Sent: Thursday, March 14, 2002 8:57 AM
 To: [EMAIL PROTECTED]
 Subject: [Robots] Re: SV: matching and UserAgent: in robots.txt



 LWP?  Very popular in a big Perl community.


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Nick Arnett


Having worked in Perl and Python, I'll recommend Python.  Although I haven't
been using it for long, I'm definitely more productive with it.  Performance
seems fine, though I haven't really pushed hard on it.  I'm not seeing long,
mysterious time-outs as I occasionally did with LWP.  And I hit some weird
bug in LWP a few weeks ago, which resulted in a strange error message that I
eventually discovered was coming out of the expat DLL for XML.  Instead of
retrieving the page I wanted, it was misinterpreting a server error.  I wish
I could be more specific, but I never did figure out what was really going
on.  Following an LWP request through the debugger is a long and convoluted
journey...

Nick

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of srinivas mohan
 Sent: Thursday, March 14, 2002 9:48 AM
 To: [EMAIL PROTECTED]
 Subject: [Robots] better language for writing a Spider ?



 Hello,

 I am working on a robot develpoment, in java,.
 We are developing a search enginealmost the
 complete engine is developed...
 We used  java for the devlopment...but the performance
 of java api in fetching the web pages is too low,
 basically we developed out own URL Connection , as
 we dont have some features like timeout...
 supported  by the java.net.URLConnection api ..

 Though there are better spiders in java..like
 Mercator..we could not achive a better performance
 with our product...

 Now as the performance is  low..we wanted to redevelop
 our spider..in a language like c or perl...and use
 it with our existing product..

 I will be thankful if any one can help me choosing
 the better language..where i can get better
 performance..

 Thanks in advance
 Mohan



 __
 Do You Yahoo!?
 Yahoo! Sports - live college hoops coverage
 http://sports.yahoo.com/

 --
 This message was sent by the Internet robots and spiders
 discussion list ([EMAIL PROTECTED]).  For list server commands,
 send help in the body of a message to [EMAIL PROTECTED].


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Otis Gospodnetic


 I am working on a robot develpoment, in java,.
 We are developing a search enginealmost the 
 complete engine is developed...
 We used  java for the devlopment...but the performance
 of java api in fetching the web pages is too low,
 basically we developed out own URL Connection , as
 we dont have some features like timeout...
 supported  by the java.net.URLConnection api ..

Look at Java 1.4, it addresses these issues (socket timeouts,
non-blocking IO, etc.)

 Though there are better spiders in java..like
 Mercator..we could not achive a better performance
 with our product...

I thought Mercator numbers were pretty good, no?

 Now as the performance is  low..we wanted to redevelop
 our spider..in a language like c or perl...and use
 it with our existing product..

You could look at Python, Ultraseek was/is written in it, from what I
remember.
Also, obviously Perl has been used for writing big crawlers, so you can
use that, too.

 I will be thankful if any one can help me choosing 
 the better language..where i can get better performance..

Of course, the choice of a language is not a performance panacea.

Otis


__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Avoiding Bot-Bait pages and wspoison pages

2002-03-14 Thread Stephen Sutherland


Hello everyone ,

How do you guys create your web crawler in such a way
that it would step over bot bait pages like WSPosion? 

Do you simply include them in a list of urls to avoid
? 

or do you keep track of web sites with unusually large
amounts of web page such as a web site with about 200
pages before abandoning or sending an alert ? 

stephen  

__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: matching and UserAgent: in robots.txt

2002-03-14 Thread Nick Arnett




 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of Sean M. Burke

...

 E.g.,  http://www.robotstxt.org/wc/norobots.html says:
 User-agent [...] The robot should be liberal in interpreting
 this field.
 A case insensitive substring match of the name without version
 information
 is recommended.

 ...note the without version information.  Ditto the spec you
 cited, which
 says That is, the User-Agent (HTTP) header consists of one or
 more words,
 and the very first word is taken to be the name, which is
 referred to in
 the robot exclusion files.

Ah, now I see your point.  That does seem to be a problem, since apparently
version numbers were contemplated in User-Agent headers...  Sounds like
something for the LWP author(s).

Or, a convenient excuse for a badly behaved robot... !

Nick


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Tim Bray


At 09:47 AM 14/03/02 -0800, srinivas mohan wrote:
Now as the performance is  low..we wanted to redevelop
our spider..in a language like c or perl...and use
it with our existing product..

I will be thankful if any one can help me choosing 
the better language..where i can get better
performance..

You'll never get better performance until you understand why you
had lousy performance before.  It's not obvious to me why Java
should get in the way.

I've written two very large robots and used perl both times.
There were two good reasons to choose perl:

- A robot fetches pages, analyzes them, and manages a database
  of been-processed and to-process.  The fetching involves no CPU.
  The database is probably the same in whatever language you use.
  THus the leftover computation is picking apart pages looking 
  for URLs and BASE values and so on... perl is hard to beat
  for that type of code.
- Time-to-market was criticial.  Using perl means you have to write
  much less code than in java or C or whatever, so you get done
  quicker.

It's not clear that you can write a robot to run faster than a
well-done perl one.  It is clear you can write one that's much
more maintainable, perl makes it too easy to write obfuscated code.
Another disadvantage of perl is the large memory footprint - since
a robot needs to be highly parallel, you probably can't afford to
have a perl process per execution thread.

Next time I might go with python.  Its regexp engine isn't quite 
as fast, but the maintainability is better.  -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Tim Bray


At 10:36 AM 14/03/02 -0800, Nick Arnett wrote:

  I wish
I could be more specific, but I never did figure out what was really going
on.  Following an LWP request through the debugger is a long and convoluted
journey...

I totally agree with Nick that when LWP works, it's OK, but when
it doesn't, debugging is beyond the scope of mere mortals.  ANd
it just doesn't do timeouts or input throttling.  I tried to
get it to do timeouts, it didn't, I went and found the appropriate
discussion group and half the messages were having trouble with
timeouts... mind you that was early 2000, maybe things have
improved? -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: matching and UserAgent: in robots.txt

2002-03-14 Thread Sean M. Burke


At 12:49 2002-03-14 -0800, Nick Arnett wrote:
[...]That does seem to be a problem, since apparently
version numbers were contemplated in User-Agent headers...  Sounds like
something for the LWP author(s).

Yes, we are (hereby) thinking about it.
I thought I'd seek the wisdom of the list on this before bringing it up 
with the others, tho.


--
Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: SV: matching and User-Agent: in robots.txt

2002-03-14 Thread Thomas Huber



How can the UA be set in Java?

 Create a user-agent object thus:
 
 $ua = LWP::RobotUA-new('Banjo/1.1','http://nowhere.int/banjo.html
 [EMAIL PROTECTED]')





--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread srinivas mohan


Hello
Thank you for the suggestions on selecting a language
for writing a spider..
so i had decided to go with python, but i still have a

small idea of testing my java spider compliing to a 
native code for windows platform...and check for any 
improvement...

can you help me suggesting any open source compilers
to compile my java code to native code...

Thanks in Advance,
Mohan
--- Tim Bray [EMAIL PROTECTED] wrote:
 
 At 10:36 AM 14/03/02 -0800, Nick Arnett wrote:
 
   I wish
 I could be more specific, but I never did figure
 out what was really going
 on.  Following an LWP request through the debugger
 is a long and convoluted
 journey...
 
 I totally agree with Nick that when LWP works, it's
 OK, but when
 it doesn't, debugging is beyond the scope of mere
 mortals.  ANd
 it just doesn't do timeouts or input throttling.  I
 tried to
 get it to do timeouts, it didn't, I went and found
 the appropriate
 discussion group and half the messages were having
 trouble with
 timeouts... mind you that was early 2000, maybe
 things have
 improved? -Tim
 
 
 --
 This message was sent by the Internet robots and
 spiders discussion list ([EMAIL PROTECTED]).  For
 list server commands, send help in the body of a
 message to [EMAIL PROTECTED].


__
Do You Yahoo!?
Yahoo! Sports - live college hoops coverage
http://sports.yahoo.com/

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].