Re: robots.txt

Jens Rösner Sun, 09 Jun 2002 11:51:10 -0700

Hi!

> >>> Why not just put "robots=off" in your .wgetrc?
> hey hey
> the "robots.txt" didn't just appear in the website; someone's
> put it there and thought about it. what's in there has a good reason.
Weeeell, from my own experience, the #1 reason is that webmasters 
do not want webgrabbers of any kind to download the site in order to
force 
the visitor to interactively browse the site and thus click
advertisement banners.


> The only reason is 
> you might be indexing old, doubled or invalid data, 
That is cute, someone who believes that all people in the 
internet do what they do to make life easier for everyone.
If you said "one reason is" or even "one reason might be", 
I would not be that cynical, sorry.

> or your indexing mech might loop on it, or crash the server. who knows.
I have yet to find a site which forces wGet into a "loop" as you said.
Others on the list probably can estimate the theoretical likelyhood of
such events.

> ask the webmaster or sysadmin before you 'hack' the site.
LOL!
hack! Please provide a serious definition of "to hack" that includes 
"automatically downloading pages that could be downloaded with any
interactive web-browser"
If the robots.txt said that no user-agent may access the page, you would
be right.
But then: How would anyone know of the existence of this page then?
[rant]
Then again, maybe the page has a high percentage of cgi, JavaScript,
iFrames and thus only allows 
IE 6.0.123b to access the site. Then wget could maybe slow down the
server, especially as it is 
probably a w-ows box :> But I ask: Is this a bad thing?
Whuahaha!
[/rant]

Ok, sorry vor my sarcasm, but I think you overestimate the benefits of
robots.txt for mankind.

CU
Jens

Re: robots.txt

Reply via email to