[issue39187] urllib.robotparser does not respect the longest match for the rule

2022-04-08 Thread matele secretaire
matele secretaire added the comment: Thank you -- ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue39187] urllib.robotparser does not respect the longest match for the rule

2022-04-07 Thread Andre Burgaud
Andre Burgaud added the comment: Hi Matele, Thanks for looking into this issue. I have seen indeed some implementations that were based on the Python implementation and that had the same problems. The Crystal implementation in particular (as far as I remember, as it was a while ago). As a

[issue39187] urllib.robotparser does not respect the longest match for the rule

2022-04-06 Thread matele secretaire
matele secretaire added the comment: I can't find a documentation about it, but all of the robots.txt checkers I find behave like this. You can test on this site: https://www.st-info.fr/robots.txt, I believe that this is how it's implemented now in most parsers ? -- nosy:

[issue39187] urllib.robotparser does not respect the longest match for the rule

2020-01-01 Thread Andre Burgaud
Andre Burgaud added the comment: During testing identified a related issue that is fixed by the same sort function implemented to address the longest match rule. This related problem also addressed by this change takes into account the situation when 2 equivalent rules (same path for allow

[issue39187] urllib.robotparser does not respect the longest match for the rule

2020-01-01 Thread Andre Burgaud
Change by Andre Burgaud : -- keywords: +patch pull_requests: +17227 stage: -> patch review pull_request: https://github.com/python/cpython/pull/17794 ___ Python tracker ___

[issue39187] urllib.robotparser does not respect the longest match for the rule

2020-01-01 Thread Andre Burgaud
New submission from Andre Burgaud : As per the current Robots Exclusion Protocol internet draft, https://tools.ietf.org/html/draft-koster-rep-00#section-3.2. a robot should apply the rules respecting the longest match. urllib.robotparser relies on the order of the rules in the robots.txt