Several defects in robots exclusion protocol (robots.txt) implementation
------------------------------------------------------------------------
Key: DROIDS-109
URL: https://issues.apache.org/jira/browse/DROIDS-109
Project: Droids
Issue Type: Bug
Components: core, norobots
Reporter: Fuad Efendi
Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html is
at least 12 years outdated.
1. Googlebot and many others support query part rules; Droids currently
supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded
before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body;
baseURI.getPath(); returns decoded string; then we call another
URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion)
should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt
(http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
and most probably many more defects (Nutch & BIXO haven't done it in-full yet).
I am working on it right now...
Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.