New submission from Raymond Hettinger: * The can_fetch() method is not checking to see if read() has been called, so it returns false positives if read() has not been called.
* When read() is called, it fails to call modified() so that mtime() returns an incorrect result. The user has to manually call modified() to update the mtime(). >>> from urllib.robotparser import RobotFileParser >>> rp = RobotFileParser('http://en.wikipedia.org/robots.txt') >>> rp.can_fetch('UbiCrawler', 'http://en.wikipedia.org/index.html') True >>> rp.read() >>> rp.can_fetch('UbiCrawler', 'http://en.wikipedia.org/index.html') False >>> rp.mtime() 0 >>> rp.modified() >>> rp.mtime() 1399740268.628497 Suggested improvements: 1) Trigger internal calls to modified() every time the parse is modified using read() or add_entry(). That would assure that mtime() actually reflects the modification time. 2) Raise an exception or return False whenever can_fetch() is called and the mtime() is zero (meaning that the parser has not be initialized with any rules). ---------- components: Library (Lib) messages: 218226 nosy: rhettinger priority: normal severity: normal status: open title: Hazards in robots.txt parser type: behavior versions: Python 2.7, Python 3.4, Python 3.5 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue21469> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com