Andre Burgaud <andre.burg...@gmail.com> added the comment:

Hi Matele,

Thanks for looking into this issue.

I have seen indeed some implementations that were based on the Python 
implementation and that had the same problems. The Crystal implementation in 
particular (as far as I remember, as it was a while ago). As a reference, I 
used the Google implementation https://github.com/google/robotstxt that 
respects the internet draft 
https://datatracker.ietf.org/doc/html/draft-koster-rep-00.

The 2 main points are described in section 
https://datatracker.ietf.org/doc/html/draft-koster-rep-00#section-2.2.2, 
especially in the following paragraph:

   "To evaluate if access to a URI is allowed, a robot MUST match the
   paths in allow and disallow rules against the URI.  The matching
   SHOULD be case sensitive.  The most specific match found MUST be
   used.  The most specific match is the match that has the most octets.
   If an allow and disallow rule is equivalent, the allow SHOULD be
   used."

1) The most specific match found MUST be used.  The most specific match is the 
match that has the most octets.
2) If an allow and disallow rule is equivalent, the allow SHOULD be used.

In the robots.txt example you provided, the longest rule is Allow: 
/wp-admin/admin-ajax.php. Therefore it will take precedence over the other 
shorter Disallow rule for the sub-directory admin-ajax.php that should be 
allowed. To achieve that, the sort of the rule should list the Allow rule first.

I'm currently traveling. I'm sorry if my explanations sound a bit limited. If 
it helps, I can pickup this discussion when I'm back home, after mid-April. In 
particular, I can run new tests with Python 3.10, since I raised this potential 
problem a bit more than two years ago and that I may need to refresh my memory 
:-) 

In the meantime, let me know if there is anything that I could provide to give 
a clearer background. For example, are you referring to the 2 issues I 
highlighted above, or is it something else that you are thinking about. Also, 
could you point me to the other robots checkers that you looked at?

Thanks!

Andre

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue39187>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to