Hi! > >>> Why not just put "robots=off" in your .wgetrc? > hey hey > the "robots.txt" didn't just appear in the website; someone's > put it there and thought about it. what's in there has a good reason. Weeeell, from my own experience, the #1 reason is that webmasters do not want webgrabbers of any kind to download the site in order to force the visitor to interactively browse the site and thus click advertisement banners.
> The only reason is > you might be indexing old, doubled or invalid data, That is cute, someone who believes that all people in the internet do what they do to make life easier for everyone. If you said "one reason is" or even "one reason might be", I would not be that cynical, sorry. > or your indexing mech might loop on it, or crash the server. who knows. I have yet to find a site which forces wGet into a "loop" as you said. Others on the list probably can estimate the theoretical likelyhood of such events. > ask the webmaster or sysadmin before you 'hack' the site. LOL! hack! Please provide a serious definition of "to hack" that includes "automatically downloading pages that could be downloaded with any interactive web-browser" If the robots.txt said that no user-agent may access the page, you would be right. But then: How would anyone know of the existence of this page then? [rant] Then again, maybe the page has a high percentage of cgi, JavaScript, iFrames and thus only allows IE 6.0.123b to access the site. Then wget could maybe slow down the server, especially as it is probably a w-ows box :> But I ask: Is this a bad thing? Whuahaha! [/rant] Ok, sorry vor my sarcasm, but I think you overestimate the benefits of robots.txt for mankind. CU Jens