[issue21469] Hazards in robots.txt parser

Raymond Hettinger Sat, 10 May 2014 09:55:38 -0700

New submission from Raymond Hettinger:

* The can_fetch() method is not checking to see if read() has been called, so 
it returns false positives if read() has not been called.


* When read() is called, it fails to call modified() so that mtime() returns an 
incorrect result.  The user has to manually call modified() to update the 
mtime().

>>> from urllib.robotparser import RobotFileParser
>>> rp = RobotFileParser('http://en.wikipedia.org/robots.txt')
>>> rp.can_fetch('UbiCrawler', 'http://en.wikipedia.org/index.html')
True
>>> rp.read()
>>> rp.can_fetch('UbiCrawler', 'http://en.wikipedia.org/index.html')
False
>>> rp.mtime()
0
>>> rp.modified()
>>> rp.mtime()
1399740268.628497

Suggested improvements:

1) Trigger internal calls to modified() every time the parse is modified using 
read() or add_entry().  That would assure that mtime() actually reflects the 
modification time.

2) Raise an exception or return False whenever can_fetch() is called and the 
mtime() is zero (meaning that the parser has not be initialized with any rules).

----------
components: Library (Lib)
messages: 218226
nosy: rhettinger
priority: normal
severity: normal
status: open
title: Hazards in robots.txt parser
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue21469>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue21469] Hazards in robots.txt parser

Reply via email to