Re: [Robots] Yahoo evolving robots.txt, finally
Walter Underwood wrote: Nah, they would have e-mailed me directly by now. I used to work with them at Inktomi. How about dropping them an e-mail to invite them here? Yahoo limits crawler access to its own site. I haven't tried in the last 9 or 10 months, but the way it was back then, if you crawled the message boards, the crawler's IP address would be blocked for increasingly long time periods -- a day, two days, etc. I tried slowing down our gathering, but couldn't find a speed at which they wouldn't eventually block it. And of course they never responded to any questions about what they'd consider acceptable. And yet, their own servers don't seem to have a robots.txt that defines any limitations. Sure would be nice if *they* would tell *us* what's acceptable when crawling Yahoo! Nick -- Nick Arnett Director, Business Intelligence Services LiveWorld Inc. Phone/fax: (408) 551-0427 [EMAIL PROTECTED] ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Yahoo evolving robots.txt, finally
I'm standing firm on my suggestions. Adding a delay for crawlers is a good idea in concept, and allowing fractional seconds is a way for webmasters to request reasonable constraints. Is it such a stretch to allow a robot that you use to promote your business unmitigated access to your site, but require other robots to throttle down to a few pages per second? As for preferred scanning windows, many organizations have a huge surge of traffic from customers during their normal operating hours, but are relatively calm otherwise. Requesting that robots only scan outside of peak hours is a nice compromise between keeping them out entirely and keeping them out when you're too busy serving pages to human readers. I just read Walter's response to this thread, and he mentions bytes-per-day and pages-per-day limits. Those are fine in the abstract and may be helpful. But if a robot is limited to 100MB a day and it decides to take them all in one draw during your peak traffic hours, then volume limits alone are not sufficient. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Saturday, March 13, 2004 4:31 AM To: Internet robots, spiders, web-walkers, etc. Subject: RE: [Robots] Yahoo evolving robots.txt, finally --- Matthew Meadows <[EMAIL PROTECTED]> wrote: > I agree with Walter. So do I, partially. :) > There's a lot of variables that should have > been > considered for this new value. If nothing else the specification > should have called for the time in milliseconds, or otherwise allow > for fractional seconds. I disagree that level of granularity is needed. See my earlier email. > In addition, it seems a bit presumptuous for Yahoo > to think that they can force a de facto standard just by implementing > it first. That's how things work in real life. Think web browsers 10 years ago and various Netscape, then IE extensions. Now lots of them are considered standard. > With this line of thinking webmasters would eventually be required to > update their robots.txt file for dozens of individual bots. In theory, yes. In reality, I agree with Walter, this extension will prove to be as useless as "", and will therefore not be supported by any big crawlers. > It's hard enough to get them to do it now for the general case, this > additional fragmentation is not going to make anybody's job easier. Is > Google going to implement their own extensions, then MSN, AltaVista, > and AllTheWeb? Not likely. In order for them to remain competitive, they have to keep fetching web pages at high rates. robots.txt only limits them. I can't think of an extension to robots.txt that would let them do a better job. Actually, I can. :) > Finally, if we're going to start specifying the criteria for > scheduling, let's consider some other alternatives, like preferred > scanning windows. Same as crawl-delay - everyone would want crawlers to visit their sites at night, which would saturate crawlers' networks, so search engines won't push that extension. (actually, big crawlers run from multiple points around the planet, so maybe my statement is flawed) Otis > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > On Behalf Of Walter Underwood > Sent: Friday, March 12, 2004 3:37 PM > To: Internet robots, spiders, web-walkers, etc. > Subject: Re: [Robots] Yahoo evolving robots.txt, finally > > > --On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED] > wrote: > > > > I am surprised that after all that talk about adding new semantic > > elements to robots.txt several years ago, nobody commented that the > > > new Yahoo crawler (former Inktomi crawler) took a brave step in > that > > direction by adding "Crawl-delay:" syntax. > > > > http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html > > > > Time to update your robots.txt parsers! > > No, time to tell Yahoo to go back and do a better job. > > Does crawl-delay allow decimals? Negative numbers? Could this spec be > a bit better quality? The words "positive integer" would improve > things a > lot. > > Sigh. It would have been nice if they'd discussed this on the list > first. "crawl-delay" is a pretty dumb idea. Any value over one second > means it takes forever to index a site. Ultraseek has had a "spider > throttle" option to add this sort of delay, but it is > almost never used, because Ultraseek reads 25 pages from one site, > then > moves to another. There are many kinds of rate control. > > wunder > -- > Walter Underwood > Principal Architect > Verity Ultraseek > > ___
Re: [Robots] Yahoo evolving robots.txt, finally
--On Saturday, March 13, 2004 2:22 AM -0800 [EMAIL PROTECTED] wrote: > >> Does crawl-delay allow decimals? > > You think people really want to be able to tell a crawler to fetch a > page at most every 5.6 seconds, and not 5? 0.5s would be useful. Ultraseek has used a float for the delay for the past six years. >> Could this spec be a bit better quality? > > It's not a spec, it's an implementation, ... > >> The words "positive integer" would improve things a lot. > > That's just common sense to me. :) Well, different peoples' common sense leads to incompatible implementations. Which is why these things should be specified. I think negative delays would be goofy, too, but we all know that someone will try it. > I am sure their people are on the list, they are just being quiet, and > will probably remain silent now that their idea has been called dumb. Nah, they would have e-mailed me directly by now. I used to work with them at Inktomi. I called it a dumb idea because it has obvious problems. These could have been solved by trying to learn from the rest of the robot community. Crawl-delay isn't useful in our crawler, and there have been better rate-limit approaches proposed as far back as 1996. Most sites have pages/day or bytes/day limit, not instantaneous rate limits, so crawl-delay is controlling the wrong thing. Note that Google has implemented Allow lines with a limited wildcard syntax, so Yahoo isn't alone in being incompatible. wunder -- Walter Underwood Principal Architect Verity Ultraseek ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Yahoo evolving robots.txt, finally
--- Matthew Meadows <[EMAIL PROTECTED]> wrote: > I agree with Walter. So do I, partially. :) > There's a lot of variables that should have > been > considered for this new value. If nothing else the specification > should > have called for the time in milliseconds, or otherwise allow for > fractional seconds. I disagree that level of granularity is needed. See my earlier email. > In addition, it seems a bit presumptuous for Yahoo > to think that they can force a de facto standard just by implementing > it first. That's how things work in real life. Think web browsers 10 years ago and various Netscape, then IE extensions. Now lots of them are considered standard. > With this line of thinking webmasters would eventually be > required to update their robots.txt file for dozens of individual > bots. In theory, yes. In reality, I agree with Walter, this extension will prove to be as useless as "", and will therefore not be supported by any big crawlers. > It's hard enough to get them to do it now for the general case, this > additional fragmentation is not going to make anybody's job easier. > Is > Google going to implement their own extensions, then MSN, AltaVista, > and AllTheWeb? Not likely. In order for them to remain competitive, they have to keep fetching web pages at high rates. robots.txt only limits them. I can't think of an extension to robots.txt that would let them do a better job. Actually, I can. :) > Finally, if we're going to start specifying the criteria for > scheduling, let's consider some other alternatives, like preferred > scanning windows. Same as crawl-delay - everyone would want crawlers to visit their sites at night, which would saturate crawlers' networks, so search engines won't push that extension. (actually, big crawlers run from multiple points around the planet, so maybe my statement is flawed) Otis > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] > On Behalf Of Walter Underwood > Sent: Friday, March 12, 2004 3:37 PM > To: Internet robots, spiders, web-walkers, etc. > Subject: Re: [Robots] Yahoo evolving robots.txt, finally > > > --On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED] > wrote: > > > > I am surprised that after all that talk about adding new semantic > > elements to robots.txt several years ago, nobody commented that the > > > new Yahoo crawler (former Inktomi crawler) took a brave step in > that > > direction by adding "Crawl-delay:" syntax. > > > > http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html > > > > Time to update your robots.txt parsers! > > No, time to tell Yahoo to go back and do a better job. > > Does crawl-delay allow decimals? Negative numbers? Could this spec be > a > bit better quality? The words "positive integer" would improve things > a > lot. > > Sigh. It would have been nice if they'd discussed this on the list > first. "crawl-delay" is a pretty dumb idea. Any value over one second > means it takes forever to index a site. Ultraseek > has had a "spider throttle" option to add this sort of delay, but it > is > almost never used, because Ultraseek reads 25 pages from one site, > then > moves to another. There are many kinds of rate control. > > wunder > -- > Walter Underwood > Principal Architect > Verity Ultraseek > > ___ > Robots mailing list > [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots > ___ > Robots mailing list > [EMAIL PROTECTED] > http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Yahoo evolving robots.txt, finally
> > http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html > > > > Time to update your robots.txt parsers! > > No, time to tell Yahoo to go back and do a better job. They made the first step. Think web browsers 10 years ago, standards, and non-standard extensions. > Does crawl-delay allow decimals? You think people really want to be able to tell a crawler to fetch a page at most every 5.6 seconds, and not 5? > Negative numbers? What would that do? Prevent crawling? Disallow? > Could this spec be a bit better quality? It's not a spec, it's an implementation, and they exposed it to the masses first, even if other web crawlers had the ability to do this all along. > The words "positive integer" would improve things a lot. That's just common sense to me. :) > Sigh. It would have been nice if they'd discussed this on the > list first. "crawl-delay" is a pretty dumb idea. Any value over > one second means it takes forever to index a site. I am sure their people are on the list, they are just being quiet, and will probably remain silent now that their idea hsa been called dumb. You have a good point with the second sentence. > Ultraseek > has had a "spider throttle" option to add this sort of delay, > but it is almost never used, because Ultraseek reads 25 pages > from one site, then moves to another. There are many kinds of > rate control. I believe the same will happen to Yahoo's crawler and their extension. Webmasters will see it, add it to their robots.txt with some unacceptable values. Yahoo will have to override the specified values if they want to compete with others. The syntax will stick in robots.txt, but will be useless, just you describe it in the Ultraseek's case. Otis ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Yahoo evolving robots.txt, finally
I agree with Walter. There's a lot of variables that should have been considered for this new value. If nothing else the specification should have called for the time in milliseconds, or otherwise allow for fractional seconds. In addition, it seems a bit presumptuous for Yahoo to think that they can force a de facto standard just by implementing it first. With this line of thinking webmasters would eventually be required to update their robots.txt file for dozens of individual bots. It's hard enough to get them to do it now for the general case, this additional fragmentation is not going to make anybody's job easier. Is Google going to implement their own extensions, then MSN, AltaVista, and AllTheWeb? Finally, if we're going to start specifying the criteria for scheduling, let's consider some other alternatives, like preferred scanning windows. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Walter Underwood Sent: Friday, March 12, 2004 3:37 PM To: Internet robots, spiders, web-walkers, etc. Subject: Re: [Robots] Yahoo evolving robots.txt, finally --On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED] wrote: > > I am surprised that after all that talk about adding new semantic > elements to robots.txt several years ago, nobody commented that the > new Yahoo crawler (former Inktomi crawler) took a brave step in that > direction by adding "Crawl-delay:" syntax. > > http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html > > Time to update your robots.txt parsers! No, time to tell Yahoo to go back and do a better job. Does crawl-delay allow decimals? Negative numbers? Could this spec be a bit better quality? The words "positive integer" would improve things a lot. Sigh. It would have been nice if they'd discussed this on the list first. "crawl-delay" is a pretty dumb idea. Any value over one second means it takes forever to index a site. Ultraseek has had a "spider throttle" option to add this sort of delay, but it is almost never used, because Ultraseek reads 25 pages from one site, then moves to another. There are many kinds of rate control. wunder -- Walter Underwood Principal Architect Verity Ultraseek ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Yahoo evolving robots.txt, finally
--On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED] wrote: > > I am surprised that after all that talk about adding new semantic > elements to robots.txt several years ago, nobody commented that the new > Yahoo crawler (former Inktomi crawler) took a brave step in that > direction by adding "Crawl-delay:" syntax. > > http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html > > Time to update your robots.txt parsers! No, time to tell Yahoo to go back and do a better job. Does crawl-delay allow decimals? Negative numbers? Could this spec be a bit better quality? The words "positive integer" would improve things a lot. Sigh. It would have been nice if they'd discussed this on the list first. "crawl-delay" is a pretty dumb idea. Any value over one second means it takes forever to index a site. Ultraseek has had a "spider throttle" option to add this sort of delay, but it is almost never used, because Ultraseek reads 25 pages from one site, then moves to another. There are many kinds of rate control. wunder -- Walter Underwood Principal Architect Verity Ultraseek ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Yahoo evolving robots.txt, finally
Hello, In fact, it's not really relevant to ad this because only the crawler decide to index X documents per second. A crawler is not a "vacuum cleaner", it have not to stop website traffic during crawling. If not...it's not a crawler ! Just image if any website decided to delay indexing in about 1 url per 5 seconds...Google will take 1 year to revisit all the web (sic !)... Bad game isnt'it ? Verticrawl.com team <[EMAIL PROTECTED]> wrote .. > I am surprised that after all that talk about adding new semantic > elements to robots.txt several years ago, nobody commented that the new > Yahoo crawler (former Inktomi crawler) took a brave step in that > direction by adding "Crawl-delay:" syntax. > > http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html > > Time to update your robots.txt parsers! > > Otis Gospodnetic > > ___ > Robots mailing list > [EMAIL PROTECTED] > http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] Yahoo evolving robots.txt, finally
I am surprised that after all that talk about adding new semantic elements to robots.txt several years ago, nobody commented that the new Yahoo crawler (former Inktomi crawler) took a brave step in that direction by adding "Crawl-delay:" syntax. http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html Time to update your robots.txt parsers! Otis Gospodnetic ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots