[Robots] matching and UserAgent: in robots.txt
I'm a bit perplexed over whether the current Perl library WWW::RobotRules implements a certain part of the Robots Exclusion Standard correctly. So forgive me if this seems a simple question, but my reading of the Robots Exclusion Standard hasn't really cleared it up in my mind yet. Basically the current WWW::RobotRules logic is this: As a WWW:::RobotRules object is parsing the lines in the robots.txt file, if it sees a line that says User-Agent: ...foo..., it extracts the foo, and if the name of the current user-agent is a substring of ...foo..., then it considers this line as applying to it. So if the agent being modeled is called Banjo, and the robots.txt line being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the library says OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', so this rule is talking to me! However, the substring matching currently goes only one way. So if the user-agent object is called Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]] and the robots.txt line being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the library says 'Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me! I have the feeling that that's not right -- notably because that means that every robot ID string has to appear in toto on the User-Agent robots.txt line, which is clearly a bad thing. But before I submit a patch, I'm tempted to ask... what /is/ the proper behavior? Maybe shave the current user-agent's name at the first slash or space (getting just Banjo), and then seeing if /that/ is a substring of a given robots.txt User-Agent: line? -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] SV: matching and UserAgent: in robots.txt
Any idea how widespread the use of this library is? We've observed some weird behaviors from some of the major search engines' spiders (basically ignoring robots.txt sections) - maybe this is the explanation? -- Rasmus T. MohrDirect : +45 36 910 122 Application Developer Mobile : +45 28 731 827 Netpointers Intl. ApS Phone : +45 70 117 117 Vestergade 18 B Fax : +45 70 115 115 1456 Copenhagen K Email : mailto:[EMAIL PROTECTED] Denmark Website : http://www.netpointers.com Remember that there are no bugs, only undocumented features. -- -Oprindelig meddelelse- Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Pa vegne af Sean M. Burke Sendt: 14. marts 2002 11:08 Til: [EMAIL PROTECTED] Emne: [Robots] matching and UserAgent: in robots.txt I'm a bit perplexed over whether the current Perl library WWW::RobotRules implements a certain part of the Robots Exclusion Standard correctly. So forgive me if this seems a simple question, but my reading of the Robots Exclusion Standard hasn't really cleared it up in my mind yet. Basically the current WWW::RobotRules logic is this: As a WWW:::RobotRules object is parsing the lines in the robots.txt file, if it sees a line that says User-Agent: ...foo..., it extracts the foo, and if the name of the current user-agent is a substring of ...foo..., then it considers this line as applying to it. So if the agent being modeled is called Banjo, and the robots.txt line being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the library says OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', so this rule is talking to me! However, the substring matching currently goes only one way. So if the user-agent object is called Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]] and the robots.txt line being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the library says 'Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me! I have the feeling that that's not right -- notably because that means that every robot ID string has to appear in toto on the User-Agent robots.txt line, which is clearly a bad thing. But before I submit a patch, I'm tempted to ask... what /is/ the proper behavior? Maybe shave the current user-agent's name at the first slash or space (getting just Banjo), and then seeing if /that/ is a substring of a given robots.txt User-Agent: line? -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: matching and UserAgent: in robots.txt
Sean M. Burke wrote: I'm a bit perplexed over whether the current Perl library WWW::RobotRules implements a certain part of the Robots Exclusion Standard correctly. So forgive me if this seems a simple question, but my reading of the Robots Exclusion Standard hasn't really cleared it up in my mind yet. Is this the REP stuff out of LWP? My opinion, based on having used it in a BG robot and not getting flamed, is that the LWP implementation of Robot Exclusion is as close to 100% right as you're going to get. -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: SV: matching and UserAgent: in robots.txt
LWP? Very popular in a big Perl community. --- Rasmus Mohr [EMAIL PROTECTED] wrote: Any idea how widespread the use of this library is? We've observed some weird behaviors from some of the major search engines' spiders (basically ignoring robots.txt sections) - maybe this is the explanation? -- Rasmus T. MohrDirect : +45 36 910 122 Application Developer Mobile : +45 28 731 827 Netpointers Intl. ApS Phone : +45 70 117 117 Vestergade 18 B Fax : +45 70 115 115 1456 Copenhagen K Email : mailto:[EMAIL PROTECTED] Denmark Website : http://www.netpointers.com Remember that there are no bugs, only undocumented features. -- -Oprindelig meddelelse- Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Pa vegne af Sean M. Burke Sendt: 14. marts 2002 11:08 Til: [EMAIL PROTECTED] Emne: [Robots] matching and UserAgent: in robots.txt I'm a bit perplexed over whether the current Perl library WWW::RobotRules implements a certain part of the Robots Exclusion Standard correctly. So forgive me if this seems a simple question, but my reading of the Robots Exclusion Standard hasn't really cleared it up in my mind yet. Basically the current WWW::RobotRules logic is this: As a WWW:::RobotRules object is parsing the lines in the robots.txt file, if it sees a line that says User-Agent: ...foo..., it extracts the foo, and if the name of the current user-agent is a substring of ...foo..., then it considers this line as applying to it. So if the agent being modeled is called Banjo, and the robots.txt line being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the library says OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', so this rule is talking to me! However, the substring matching currently goes only one way. So if the user-agent object is called Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]] and the robots.txt line being parsed says User-Agent: Thing, Woozle, Banjo, Stuff, then the library says 'Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me! I have the feeling that that's not right -- notably because that means that every robot ID string has to appear in toto on the User-Agent robots.txt line, which is clearly a bad thing. But before I submit a patch, I'm tempted to ask... what /is/ the proper behavior? Maybe shave the current user-agent's name at the first slash or space (getting just Banjo), and then seeing if /that/ is a substring of a given robots.txt User-Agent: line? -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] better language for writing a Spider ?
Hello, I am working on a robot develpoment, in java,. We are developing a search enginealmost the complete engine is developed... We used java for the devlopment...but the performance of java api in fetching the web pages is too low, basically we developed out own URL Connection , as we dont have some features like timeout... supported by the java.net.URLConnection api .. Though there are better spiders in java..like Mercator..we could not achive a better performance with our product... Now as the performance is low..we wanted to redevelop our spider..in a language like c or perl...and use it with our existing product.. I will be thankful if any one can help me choosing the better language..where i can get better performance.. Thanks in advance Mohan __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: SV: matching and User-Agent: in robots.txt
Certainly LWP is widely used, but I think it's an open question as to how many LWP users use the robots.txt capabilities. I have used LWP extensively, but have never bothered with the latter. My robots target a handful of sites and really don't recurse, as such, so I just keep an eye on those sites' policies. And they tend to be very large, busy sites, so I'm a mere blip in their stats, I assume... which is not to say that I would lightly ignore anyone's wishes regarding robots. But I'm not really doing the usual search engine robot thing of sucking down every page. I'm heavily focused on tools that figure out which pages are most significant, so my robots behave more like people would... which I hope leaves me a bit more free. Going back to the original question... I can't quite see why anyone would give a robot a name like Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]. But if that's the name, then that's what robots.txt should reference. A robots.txt that contains a directive for a robot named Banjo should either be referring to another robot or it has the wrong name. I think the original poster has confused (conflated, actually) the HTTP User-Agent and From headers. $ua = LWP::RobotUA-new($agent_name, $from, [$rules]) Your robot's name and the mail address of the human responsible for the robot (i.e. you) is required by the constructor. Create a user-agent object thus: $ua = LWP::RobotUA-new('Banjo/1.1','http://nowhere.int/banjo.html [EMAIL PROTECTED]') The string that gets compared with robots.txt is Banjo/1.1. That's the HTTP User-Agent header. The second parameter is the HTTP From header, which allows the target site's administrator to find you (easily) if your robot misbehaves. Of course, it isn't special to robots. Any HTTP client can send a From header (the default behavior of which in some clients led to much controversy years ago, of course). From the LWP docs: The from attribute can be set to the e-mail address of the person responsible for running the application. If this is set, then the address will be sent to the servers with every request. Hope that's reasonably clear. Nick -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Otis Gospodnetic Sent: Thursday, March 14, 2002 8:57 AM To: [EMAIL PROTECTED] Subject: [Robots] Re: SV: matching and UserAgent: in robots.txt LWP? Very popular in a big Perl community. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
Having worked in Perl and Python, I'll recommend Python. Although I haven't been using it for long, I'm definitely more productive with it. Performance seems fine, though I haven't really pushed hard on it. I'm not seeing long, mysterious time-outs as I occasionally did with LWP. And I hit some weird bug in LWP a few weeks ago, which resulted in a strange error message that I eventually discovered was coming out of the expat DLL for XML. Instead of retrieving the page I wanted, it was misinterpreting a server error. I wish I could be more specific, but I never did figure out what was really going on. Following an LWP request through the debugger is a long and convoluted journey... Nick -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of srinivas mohan Sent: Thursday, March 14, 2002 9:48 AM To: [EMAIL PROTECTED] Subject: [Robots] better language for writing a Spider ? Hello, I am working on a robot develpoment, in java,. We are developing a search enginealmost the complete engine is developed... We used java for the devlopment...but the performance of java api in fetching the web pages is too low, basically we developed out own URL Connection , as we dont have some features like timeout... supported by the java.net.URLConnection api .. Though there are better spiders in java..like Mercator..we could not achive a better performance with our product... Now as the performance is low..we wanted to redevelop our spider..in a language like c or perl...and use it with our existing product.. I will be thankful if any one can help me choosing the better language..where i can get better performance.. Thanks in advance Mohan __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
I am working on a robot develpoment, in java,. We are developing a search enginealmost the complete engine is developed... We used java for the devlopment...but the performance of java api in fetching the web pages is too low, basically we developed out own URL Connection , as we dont have some features like timeout... supported by the java.net.URLConnection api .. Look at Java 1.4, it addresses these issues (socket timeouts, non-blocking IO, etc.) Though there are better spiders in java..like Mercator..we could not achive a better performance with our product... I thought Mercator numbers were pretty good, no? Now as the performance is low..we wanted to redevelop our spider..in a language like c or perl...and use it with our existing product.. You could look at Python, Ultraseek was/is written in it, from what I remember. Also, obviously Perl has been used for writing big crawlers, so you can use that, too. I will be thankful if any one can help me choosing the better language..where i can get better performance.. Of course, the choice of a language is not a performance panacea. Otis __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Avoiding Bot-Bait pages and wspoison pages
Hello everyone , How do you guys create your web crawler in such a way that it would step over bot bait pages like WSPosion? Do you simply include them in a list of urls to avoid ? or do you keep track of web sites with unusually large amounts of web page such as a web site with about 200 pages before abandoning or sending an alert ? stephen __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: matching and UserAgent: in robots.txt
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Sean M. Burke ... E.g., http://www.robotstxt.org/wc/norobots.html says: User-agent [...] The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended. ...note the without version information. Ditto the spec you cited, which says That is, the User-Agent (HTTP) header consists of one or more words, and the very first word is taken to be the name, which is referred to in the robot exclusion files. Ah, now I see your point. That does seem to be a problem, since apparently version numbers were contemplated in User-Agent headers... Sounds like something for the LWP author(s). Or, a convenient excuse for a badly behaved robot... ! Nick -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
At 09:47 AM 14/03/02 -0800, srinivas mohan wrote: Now as the performance is low..we wanted to redevelop our spider..in a language like c or perl...and use it with our existing product.. I will be thankful if any one can help me choosing the better language..where i can get better performance.. You'll never get better performance until you understand why you had lousy performance before. It's not obvious to me why Java should get in the way. I've written two very large robots and used perl both times. There were two good reasons to choose perl: - A robot fetches pages, analyzes them, and manages a database of been-processed and to-process. The fetching involves no CPU. The database is probably the same in whatever language you use. THus the leftover computation is picking apart pages looking for URLs and BASE values and so on... perl is hard to beat for that type of code. - Time-to-market was criticial. Using perl means you have to write much less code than in java or C or whatever, so you get done quicker. It's not clear that you can write a robot to run faster than a well-done perl one. It is clear you can write one that's much more maintainable, perl makes it too easy to write obfuscated code. Another disadvantage of perl is the large memory footprint - since a robot needs to be highly parallel, you probably can't afford to have a perl process per execution thread. Next time I might go with python. Its regexp engine isn't quite as fast, but the maintainability is better. -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
At 10:36 AM 14/03/02 -0800, Nick Arnett wrote: I wish I could be more specific, but I never did figure out what was really going on. Following an LWP request through the debugger is a long and convoluted journey... I totally agree with Nick that when LWP works, it's OK, but when it doesn't, debugging is beyond the scope of mere mortals. ANd it just doesn't do timeouts or input throttling. I tried to get it to do timeouts, it didn't, I went and found the appropriate discussion group and half the messages were having trouble with timeouts... mind you that was early 2000, maybe things have improved? -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: matching and UserAgent: in robots.txt
At 12:49 2002-03-14 -0800, Nick Arnett wrote: [...]That does seem to be a problem, since apparently version numbers were contemplated in User-Agent headers... Sounds like something for the LWP author(s). Yes, we are (hereby) thinking about it. I thought I'd seek the wisdom of the list on this before bringing it up with the others, tho. -- Sean M. Burke[EMAIL PROTECTED]http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: SV: matching and User-Agent: in robots.txt
How can the UA be set in Java? Create a user-agent object thus: $ua = LWP::RobotUA-new('Banjo/1.1','http://nowhere.int/banjo.html [EMAIL PROTECTED]') -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
Hello Thank you for the suggestions on selecting a language for writing a spider.. so i had decided to go with python, but i still have a small idea of testing my java spider compliing to a native code for windows platform...and check for any improvement... can you help me suggesting any open source compilers to compile my java code to native code... Thanks in Advance, Mohan --- Tim Bray [EMAIL PROTECTED] wrote: At 10:36 AM 14/03/02 -0800, Nick Arnett wrote: I wish I could be more specific, but I never did figure out what was really going on. Following an LWP request through the debugger is a long and convoluted journey... I totally agree with Nick that when LWP works, it's OK, but when it doesn't, debugging is beyond the scope of mere mortals. ANd it just doesn't do timeouts or input throttling. I tried to get it to do timeouts, it didn't, I went and found the appropriate discussion group and half the messages were having trouble with timeouts... mind you that was early 2000, maybe things have improved? -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED]. __ Do You Yahoo!? Yahoo! Sports - live college hoops coverage http://sports.yahoo.com/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].