On 21/06/14 21:31, Darshit Shah wrote:
Hi,
I responded to your original question on Stack Overflow. However for
completeness and to document facts, I'll add a response here too.
The answer to your question is: No. Sadly enough, Wget does NOT check
for the user agent string it is using when parsing the robots file. It
simply reads rules for `User-Agent: *` and `User-Agent: wget` giving
preference to the rules specified for Wget alone.
This also has another major implication. Wget seems to be reading and
adhering to robots rules ONLY for * and wget. Which means that not
only does Wget ignore the correct robots exclusion rules, it even
follows the wrong set of rules if Wget is using a different User-Agent
and the website provides a set of rules for Wget.
I'm not convinced this is wrong. You *are* using wget after all.
I don't think you should compare with the User-Agent, as that's
different than
the robots.txt identifier. For instance Bing uses “bingbot” for robots.txt
but an user-agent of
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
If we want to make it configurable, it should be a new setting (preferably
a wgetrc-only one)
Best regards