New submission from taskinoor hasan sajid <[EMAIL PROTECTED]>: Check the robots.txt file from mathworld.
--> http://mathworld.wolfram.com/robots.txt It contains 2 User-Agent: * lines. >From http://www.robotstxt.org/norobots-rfc.txt "These name tokens are used in User-agent lines in /robots.txt to identify to which specific robots the record applies. The robot must obey the first record in /robots.txt that contains a User- Agent line whose value contains the name token of the robot as a substring. The name comparisons are case-insensitive. If no such record exists, it should obey the first record with a User-agent line with a "*" value, if present. If no record satisfied either condition, or no records are present at all, access is unlimited." But it seems that our robotparser is obeying the 2nd one. the problem occures because robotparser assumes that no robots.txt will contain two * user-agent. it should not have two two such line, but in reality many site may have two. So i have changed robotparser.py as follow: def _add_entry(self, entry): if "*" in entry.useragents: # the default entry is considered last if self.default_entry == None: # this check is added self.default_entry = entry else: self.entries.append(entry) And at the end of parse(self, lines) method if state==2: # self.entries.append(entry) self._add_entry(entry) # necessary if there is no new line at end and last User-Agent is * ---------- components: Library (Lib) messages: 74665 nosy: thsajid severity: normal status: open title: robotparser.py fail when more than one User-Agent: * is present versions: Python 2.5 _______________________________________ Python tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue4108> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com