RE: [Robots] Hit Rate - testing is this mailing list alive?
will you take me off your list. Many thanks. [EMAIL PROTECTED] -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Andrew Daviel Sent: Tuesday, November 04, 2003 1:01 PM To: Internet robots, spiders, web-walkers, etc. Subject: RE: [Robots] Hit Rate - testing is this mailing linst alive? On Tue, 4 Nov 2003 [EMAIL PROTECTED] wrote: > Hello Robots list > > Well maybe this list can finally put to rest a great deal of the "30 second wait" issue. > > Can we all collectively research into an adaptive routine? Interesting topic... With one hat on, I operate one of those little servers with thousands of pages. I guess I'm lucky; I don't pay bandwidth and the connection is naturally limited to a T-1. With my other hat on, at TRIUMF we have started to have issues with bandwidth management. We now have a gigabit link to the research networks with no byte charges, so don't care if someone sucks our site from ESnet (CERN, Fermilab, Los Alamos etc.). However, we have a 100Mbit link to commercial backbone and can't afford to fill it - P2P is a problem. Our current "solution" is to limit outgoing traffic to 1Mbit - except our central webserver and mailserver. So we would be financially embarrassed if a lot of robots from the commercial side all decided to mirror our servers. I guess what I'm trying to say is that the issue of instantaneous hit rate is not really a problem any more, but that volume might be. However, I guess that the people running robots also have finite storage and have to pay for bandwidth, so that perhaps this is a non-problem except where there is a serious asymmetry between source and destination. -- Andrew Daviel, TRIUMF, Canada Tel. +1 (604) 222-7376 [EMAIL PROTECTED] ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Hit Rate - testing is this mailing linst alive?
--On Tuesday, November 4, 2003 10:05 AM + Alan Perkins <[EMAIL PROTECTED]> wrote: > > What's the current accepted practice for hit rate? Ultraseek uses one request at a time for a server with no extra pause in between. Each file is parsed before sending the next response, so there is a bit of slack. The spider requests 25 URLs from a server, then moves on. This usually works out to one or two requests per second on a server. If there are network delays or large documents, it will slow down a lot. For "if-modified-since" requests with a "not modified" response, it can go much faster. The aggregate spidering rate is higher, because there can be many spider threads making requests. wunder -- Walter Underwood Principal Architect Verity Ultraseek ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Hit Rate - testing is this mailing linst alive?
On Tue, 4 Nov 2003 [EMAIL PROTECTED] wrote: > Hello Robots list > > Well maybe this list can finally put to rest a great deal of the "30 second wait" > issue. > > Can we all collectively research into an adaptive routine? Interesting topic... With one hat on, I operate one of those little servers with thousands of pages. I guess I'm lucky; I don't pay bandwidth and the connection is naturally limited to a T-1. With my other hat on, at TRIUMF we have started to have issues with bandwidth management. We now have a gigabit link to the research networks with no byte charges, so don't care if someone sucks our site from ESnet (CERN, Fermilab, Los Alamos etc.). However, we have a 100Mbit link to commercial backbone and can't afford to fill it - P2P is a problem. Our current "solution" is to limit outgoing traffic to 1Mbit - except our central webserver and mailserver. So we would be financially embarrassed if a lot of robots from the commercial side all decided to mirror our servers. I guess what I'm trying to say is that the issue of instantaneous hit rate is not really a problem any more, but that volume might be. However, I guess that the people running robots also have finite storage and have to pay for bandwidth, so that perhaps this is a non-problem except where there is a serious asymmetry between source and destination. -- Andrew Daviel, TRIUMF, Canada Tel. +1 (604) 222-7376 [EMAIL PROTECTED] ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
RE: [Robots] Hit Rate - testing is this mailing linst alive?
Hello Robots list Well maybe this list can finally put to rest a great deal of the "30 second wait" issue. Can we all collectively research into an adaptive routine? We all need a common code routine that all our spidering modules and connective programs can use. Especially when we wish to get as close to the Ethernet optimum (or about 80% of true max, I believe) without getting ourselves into the DoS Zone ( >80% of Ethernet max ), where signal collisions will start failures and the repeat signals and competing signals will effectively collapse the Ethernet communications medium. Can we not, therefore, settle the issue of finding the balancing point in determining optimum throughput from networks and servers at any given time? Can we not determine the optimum mathematical formula, then program this into our libraries of code; so our spiders can all follow this formula? So in this effort: Has anyone found, started to build , or can recommend the building blocks, of an such adaptive routine? Can this list supply us all with THE defacto real time adaptive throttling routine? A routine that will track and adapt to the ever changing conditions, by taking in real time network measurements, feeding them through the formula, and the result is optimum wait time, before connecting to the same server again. The wait time resets after each ACK package from the target server. Any formula suggestions? One of the variables in the formula, should be from our spider configs initially set through users input, as some users will need to max out their dedicated network communication lines (such as adapter card to adapter card, isolation work of very controlled networks). Suggest a "0" input to do this work. The default setting or "1" , will result inn the optimal time determined by the formula. Any other integer would simply multiple the time delay between server connections. In this way the user could throttle it down to the needs of the local network and servers. -Thomas Kay -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 2003-11-04 10:21 AM To: [EMAIL PROTECTED]; Internet robots, spiders, web-walkers, etc. Subject: [Robots] Hit Rate - testing is this mailing linst alive? Alan Perkins writes: > What's the current accepted practice for hit rate? In general, leave an interval several times longer than the time taken for the last response. e.g. if a site responds in 20 ms, you can hit it again the same second. If a site takes 4 seconds to response, leave it at least 30 seconds before trying again. > B) The number of robots you are running (e.g. 30 seconds per site per > robot, or 30 seconds per site across all your robots?) Generally, take into account all your robots. If you use a mercator style distribution strategy, this is a non-issue. > D) Some other factor (e.g. server response time, etc.) Server response time is the biggest factor. > E) None of the above (i.e. anything goes) > > It's clear from the log files I study that some of the big players are > not sticking to 30 seconds. There are good reasons for this and I > consider it a good thing (in moderation). E.g. retrieving one page from > a site every 30 seconds only allows 2880 pages per day to be retrieved > from a site and this has obvious "freshness" implications when indexing > large sites. Many large sites are split across several servers. Often these can be hit in parallel - if your robot is clever enough. Richard ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Hit Rate - testing is this mailing linst alive?
I thought I would post some of my experience with download rates We have built a large scale crawler that has crawled over 2.4 billion urls and continues to crawl at upwards of 500 pages/second. In tuning the download policy we found that both the hit rate and number of pages downloaded per day both come into play when trying to tread lightly. An easy but delayed measure of whether you are treading lightly or not is to monitor such sites as www.webmasterworld.com and the like. A more direct measure is the volume and types of complaints that come over email. From our experience, the bulk of the complaints come from the webmasters/businesses/etc. who purchased 1-5 Gb of traffic per month but have a site consisting of thousands if not tens of thousands of pages. We were quick to find out that there are *many* of these folks out on the Internet. The problem is obvious. If the crawler downloads the whole site in a shot (even with a 30 second delay) the aggregrate bandwidth usage sometimes puts that entity over their alloted limit causing their ISP to charge them extra. Guess who's to blame in that circumstance? Although we have always adhered to a 30 second policy, which I believe is very conservative in 2004, we still receive the you-are-hitting-our-site-to-hard type of complaints. Usually these arise when we touch upon to many 404s and the webmaster has decided to have their web server email them every time one is encountered. Just thought I'd pass along some information from the trenches -- Christian Storm, Ph.D. www.turnitin.com ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] Hit Rate - testing is this mailing linst alive?
Alan Perkins writes: > What's the current accepted practice for hit rate? In general, leave an interval several times longer than the time taken for the last response. e.g. if a site responds in 20 ms, you can hit it again the same second. If a site takes 4 seconds to response, leave it at least 30 seconds before trying again. > B) The number of robots you are running (e.g. 30 seconds per site per > robot, or 30 seconds per site across all your robots?) Generally, take into account all your robots. If you use a mercator style distribution strategy, this is a non-issue. > D) Some other factor (e.g. server response time, etc.) Server response time is the biggest factor. > E) None of the above (i.e. anything goes) > > It's clear from the log files I study that some of the big players are > not sticking to 30 seconds. There are good reasons for this and I > consider it a good thing (in moderation). E.g. retrieving one page from > a site every 30 seconds only allows 2880 pages per day to be retrieved > from a site and this has obvious "freshness" implications when indexing > large sites. Many large sites are split across several servers. Often these can be hit in parallel - if your robot is clever enough. Richard ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Hit Rate - testing is this mailing linst alive?
Alan Perkins wrote: > >> > Hit rate > This directive could indicate to a robot how long to wait between > requests to the server. Currently it is accepted practice to wait at > least 30 seconds between requests, but this is too fast for some sites, > too slow for others. > >> > > What's the current accepted practice for hit rate? Does it vary > according to With the availability of persistent connections, a robot that drops the connection or keeps the connection open for 30 seconds without requesting another resource would not do the server any good. Large sites generally have good connectivity and robots can request resources at a higher rate without any performance degradation, regardless of response code. If a robot does find that a site is responding slowly (latency or throughput) it should reduce the hit rate or even suspend crawling temporarily to avoid overloading a server. > A) The HTTP response (e.g. no need to wait 30 seconds after a 304) I would recommend waiting after the server has closed the connection (not maintained the persistent connection), as long as the connection is open sending another request instead of waiting and keeping the connection open but inactive is the best choice. > > B) The number of robots you are running (e.g. 30 seconds per site per > robot, or 30 seconds per site across all your robots?) Running multiple robots in parallel increases the number of open connections required at the server, a single persistent connection is more server-friendly (and usually easier to manage too, I see some sites crawl the same resources in parallel robots, which apparantly do not communicate status information in real-time). > C) The number of active robots on the Web (e.g. 1000 robots isn't many, > 10 million robots is - and if too many unrelated robots hit a site, > that's another effective DDOS attack) The number of other robots hitting a site is not a known factor, although performance metrics can give an indiciation whether or not a site is under heavy load. > D) Some other factor (e.g. server response time, etc.) > E) None of the above (i.e. anything goes) DOS monitors may raise alerts or block traffic if robots hit a site too hard, too frequently, with too many parallel process etc. -- Klaus Johannes Rusch [EMAIL PROTECTED] http://www.atmedia.net/KlausRusch/ ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Hit Rate - testing is this mailing linst alive?
On Tue, 4 Nov 2003, Alan Perkins wrote: > Here's a question to test whether the list is alive and active... I have a feeling the bandwidth and other resources of web sites have gone up so much that really robots do not pose a DoS threat any more. "Hit me as hard as you like as long as I am in your index." It is spam and viri that steal the attention and are orders of magnitude worse problems for everybody. So, apparently, all problems of robots were solved, and discussion died away. But no need to rush closing the list, in case at some point something new appears. Jaakko -- Foreca Ltd [EMAIL PROTECTED] Pursimiehenkatu 29-31 B, FIN-00150 Helsinki, Finland http://www.foreca.com ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] Hit Rate - testing is this mailing linst alive?
Here's a question to test whether the list is alive and active... Reading "Evaluation of the Standard for Robots Exclusion", Martijn Koster, 1996 (http://www.robotstxt.org/wc/eval.html), we find: >> Hit rate This directive could indicate to a robot how long to wait between requests to the server. Currently it is accepted practice to wait at least 30 seconds between requests, but this is too fast for some sites, too slow for others. >> What's the current accepted practice for hit rate? Does it vary according to A) The HTTP response (e.g. no need to wait 30 seconds after a 304) B) The number of robots you are running (e.g. 30 seconds per site per robot, or 30 seconds per site across all your robots?) C) The number of active robots on the Web (e.g. 1000 robots isn't many, 10 million robots is - and if too many unrelated robots hit a site, that's another effective DDOS attack) D) Some other factor (e.g. server response time, etc.) E) None of the above (i.e. anything goes) It's clear from the log files I study that some of the big players are not sticking to 30 seconds. There are good reasons for this and I consider it a good thing (in moderation). E.g. retrieving one page from a site every 30 seconds only allows 2880 pages per day to be retrieved from a site and this has obvious "freshness" implications when indexing large sites. In summary, I'm just wondering what your thoughts are for an acceptable hit rate and means of measuring it in 2004? Alan Perkins e-Brand Management Limited ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Is this mailing linst alive?
On Nov 3, 2003, at 11:16 PM, Nick Arnett wrote: [EMAIL PROTECTED] wrote: I've created a robot, www.dead-links.com and i wonder if this list is alive. It is alive, but very, very quiet. Yeah, this robots thing is just a fad, it'll never catch on. -Tim ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
Re: [Robots] Is this mailing linst alive?
[EMAIL PROTECTED] wrote: I've created a robot, www.dead-links.com and i wonder if this list is alive. It is alive, but very, very quiet. Nick ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots