Any idea how widespread the use of this library is? We've observed some weird behaviors from some of the major search engines' spiders (basically ignoring robots.txt sections) - maybe this is the explanation?
-------------------------------------------------------------- Rasmus T. Mohr Direct : +45 36 910 122 Application Developer Mobile : +45 28 731 827 Netpointers Intl. ApS Phone : +45 70 117 117 Vestergade 18 B Fax : +45 70 115 115 1456 Copenhagen K Email : mailto:[EMAIL PROTECTED] Denmark Website : http://www.netpointers.com "Remember that there are no bugs, only undocumented features." -------------------------------------------------------------- -----Oprindelig meddelelse----- Fra: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]Pa vegne af Sean M. Burke Sendt: 14. marts 2002 11:08 Til: [EMAIL PROTECTED] Emne: [Robots] matching and "UserAgent:" in robots.txt I'm a bit perplexed over whether the current Perl library WWW::RobotRules implements a certain part of the Robots Exclusion Standard correctly. So forgive me if this seems a simple question, but my reading of the Robots Exclusion Standard hasn't really cleared it up in my mind yet. Basically the current WWW::RobotRules logic is this: As a WWW:::RobotRules object is parsing the lines in the robots.txt file, if it sees a line that says "User-Agent: ...foo...", it extracts the foo, and if the name of the current user-agent is a substring of "...foo...", then it considers this line as applying to it. So if the agent being modeled is called "Banjo", and the robots.txt line being parsed says "User-Agent: Thing, Woozle, Banjo, Stuff", then the library says "OK, 'Banjo' is a substring in 'Thing, Woozle, Banjo, Stuff', so this rule is talking to me!" However, the substring matching currently goes only one way. So if the user-agent object is called "Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]" and the robots.txt line being parsed says "User-Agent: Thing, Woozle, Banjo, Stuff", then the library says "'Banjo/1.1 [http://nowhere.int/banjo.html [EMAIL PROTECTED]]' is NOT a substring of 'Thing, Woozle, Banjo, Stuff', so this rule is NOT talking to me!" I have the feeling that that's not right -- notably because that means that every robot ID string has to appear in toto on the "User-Agent" robots.txt line, which is clearly a bad thing. But before I submit a patch, I'm tempted to ask... what /is/ the proper behavior? Maybe shave the current user-agent's name at the first slash or space (getting just "Banjo"), and then seeing if /that/ is a substring of a given robots.txt "User-Agent:" line? -- Sean M. Burke [EMAIL PROTECTED] http://www.spinn.net/~sburke/ -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]". -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".