Re: [Bug-wget] Does wget check if specified user agent is allowed in robots.txt?

Ángel González Sun, 29 Jun 2014 13:21:07 -0700

On 21/06/14 21:31, Darshit Shah wrote:

Hi,


I responded to your original question on Stack Overflow. However for
completeness and to document facts, I'll add a response here too.

The answer to your question is: No. Sadly enough, Wget does NOT check
for the user agent string it is using when parsing the robots file. It
simply reads rules for `User-Agent: *` and `User-Agent: wget` giving
preference to the rules specified for Wget alone.

This also has another major implication. Wget seems to be reading and
adhering to robots rules ONLY for * and wget. Which means that not
only does Wget ignore the correct robots exclusion rules, it even
follows the wrong set of rules if Wget is using a different User-Agent
and the website provides a set of rules for Wget.


I'm not convinced this is wrong. You *are* using wget after all.

I don't think you should compare with the User-Agent, as that'sdifferent than

the robots.txt identifier. For instance Bing uses “bingbot” for robots.txt
but an user-agent of

Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

If we want to make it configurable, it should be a new setting (preferably
a wgetrc-only one)


Best regards

Re: [Bug-wget] Does wget check if specified user agent is allowed in robots.txt?

Reply via email to