RE: [Robots] Hit Rate - testing is this mailing list alive?

2003-11-04 Thread Martha Ballou
will you take me off your list.  Many thanks. [EMAIL PROTECTED]

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Andrew Daviel
Sent: Tuesday, November 04, 2003 1:01 PM
To: Internet robots, spiders, web-walkers, etc.
Subject: RE: [Robots] Hit Rate - testing is this mailing linst alive?


On Tue, 4 Nov 2003 [EMAIL PROTECTED] wrote:

> Hello Robots list
>
> Well maybe this list can finally put to rest a great deal of the "30
second wait" issue.
>
> Can we all collectively research into an adaptive routine?

Interesting topic...

With one hat on, I operate one of those little servers with thousands of
pages. I guess I'm lucky; I don't pay bandwidth and the connection is
naturally limited to a T-1.

With my other hat on, at TRIUMF we have started to have issues with
bandwidth management. We now have a gigabit link to the research networks
with no byte charges, so don't care if someone sucks our site from ESnet
(CERN, Fermilab, Los Alamos etc.).
However, we have a 100Mbit link to commercial backbone and can't afford to
fill it - P2P is a problem. Our current "solution" is to limit outgoing
traffic to 1Mbit - except our central webserver and mailserver.
So we would be financially embarrassed if a lot of robots from the
commercial side all decided to mirror our servers.

I guess what I'm trying to say is that the issue of instantaneous
hit rate is not really a problem any more, but that volume might be.
However, I guess that the people running robots also have finite storage
and have to pay for bandwidth, so that perhaps this is a non-problem
except where there is a serious asymmetry between source and destination.


--
Andrew Daviel, TRIUMF, Canada
Tel. +1 (604) 222-7376
[EMAIL PROTECTED]


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread Walter Underwood
--On Tuesday, November 4, 2003 10:05 AM + Alan Perkins <[EMAIL PROTECTED]> wrote:
> 
> What's the current accepted practice for hit rate? 

Ultraseek uses one request at a time for a server with no
extra pause in between. Each file is parsed before sending
the next response, so there is a bit of slack. The spider
requests 25 URLs from a server, then moves on.

This usually works out to one or two requests per second
on a server. If there are network delays or large documents,
it will slow down a lot. For "if-modified-since" requests with
a "not modified" response, it can go much faster.

The aggregate spidering rate is higher, because there can
be many spider threads making requests.

wunder
--
Walter Underwood
Principal Architect
Verity Ultraseek

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread Andrew Daviel
On Tue, 4 Nov 2003 [EMAIL PROTECTED] wrote:

> Hello Robots list
> 
> Well maybe this list can finally put to rest a great deal of the "30 second wait" 
> issue.
> 
> Can we all collectively research into an adaptive routine?

Interesting topic...

With one hat on, I operate one of those little servers with thousands of 
pages. I guess I'm lucky; I don't pay bandwidth and the connection is 
naturally limited to a T-1.

With my other hat on, at TRIUMF we have started to have issues with 
bandwidth management. We now have a gigabit link to the research networks 
with no byte charges, so don't care if someone sucks our site from ESnet
(CERN, Fermilab, Los Alamos etc.).
However, we have a 100Mbit link to commercial backbone and can't afford to 
fill it - P2P is a problem. Our current "solution" is to limit outgoing
traffic to 1Mbit - except our central webserver and mailserver.
So we would be financially embarrassed if a lot of robots from the 
commercial side all decided to mirror our servers.

I guess what I'm trying to say is that the issue of instantaneous 
hit rate is not really a problem any more, but that volume might be.
However, I guess that the people running robots also have finite storage 
and have to pay for bandwidth, so that perhaps this is a non-problem 
except where there is a serious asymmetry between source and destination.


-- 
Andrew Daviel, TRIUMF, Canada
Tel. +1 (604) 222-7376
[EMAIL PROTECTED]


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread thomas.kay
Hello Robots list

Well maybe this list can finally put to rest a great deal of the "30 second wait" 
issue.

Can we all collectively research into an adaptive routine?

We all need a common code routine that all our spidering modules and connective 
programs can use.  

Especially when we wish to get as close to the Ethernet optimum (or about 80% of true 
max, I believe) without getting ourselves into the DoS Zone ( >80% of Ethernet max ), 
where signal collisions will start failures and the repeat signals and competing 
signals will effectively collapse the Ethernet communications medium.  

Can we not, therefore, settle the issue of finding the balancing point in determining 
optimum throughput from networks and servers at any given time?   

Can we not determine the optimum mathematical formula, then program this into our 
libraries of code; so our spiders can all follow this formula?

So in this effort: Has anyone found, started to build , or can recommend the building 
blocks, of an such adaptive routine?

Can this list supply us all with THE defacto real time adaptive throttling routine?  

A routine that will track and adapt to the ever changing conditions, by taking in real 
time network measurements, feeding them through the formula, and the result is optimum 
wait time, before connecting to the same server again.  The wait time resets after 
each ACK package from the target server. 

Any formula suggestions?

One of the variables in the formula, should be from our spider configs initially set 
through users input, as some users will need to max out their dedicated network 
communication lines (such as adapter card to adapter card, isolation work of very 
controlled networks). Suggest a "0" input to do this work.  The default setting or "1" 
, will result inn the optimal time determined by the formula.  Any other integer would 
simply multiple the time delay between server connections.  In this way the user could 
throttle it down to the needs of the local network and servers.  

-Thomas Kay



-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: 2003-11-04 10:21 AM
To: [EMAIL PROTECTED]; Internet robots, spiders, web-walkers, etc.
Subject: [Robots] Hit Rate - testing is this mailing linst alive?


Alan Perkins writes:
 > What's the current accepted practice for hit rate?

In general, leave an interval several times longer than the time
taken for the last response. e.g. if a site responds in 20 ms,
you can hit it again the same second. If a site takes 4 seconds
to response, leave it at least 30 seconds before trying again.

 > B) The number of robots you are running (e.g. 30 seconds per site per
 > robot, or 30 seconds per site across all your robots?)

Generally, take into account all your robots. If you use a mercator
style distribution strategy, this is a non-issue.

 > D) Some other factor (e.g. server response time, etc.)

Server response time is the biggest factor.

 > E) None of the above (i.e. anything goes)
 > 
 > It's clear from the log files I study that some of the big players are
 > not sticking to 30 seconds.  There are good reasons for this and I
 > consider it a good thing (in moderation).  E.g. retrieving one page from
 > a site every 30 seconds only allows 2880 pages per day to be retrieved
 > from a site and this has obvious "freshness" implications when indexing
 > large sites.

Many large sites are split across several servers. Often these can be
hit in parallel - if your robot is clever enough.

Richard
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread Christian Storm
I thought I would post some of my experience with download rates

We have built a large scale crawler that has crawled over 2.4 billion urls and continues 
to crawl at upwards of 500 pages/second.  In tuning the download policy we
found that both the hit rate and number of pages downloaded per day both come into play
when trying to tread lightly.  An easy but delayed measure of whether you are treading 
lightly or not is to monitor such sites as www.webmasterworld.com and the like. A more 
direct measure is the volume and types of complaints that come over email.

From our experience, the bulk of the complaints come from the webmasters/businesses/etc. 
who purchased 1-5 Gb of traffic per month but have a site
consisting of thousands if not tens of thousands of pages.  We were quick to find out
that there are *many* of these folks out on the Internet.  The problem is obvious.  If
the crawler downloads the whole site in a shot (even with a 30 second delay) the 
aggregrate bandwidth usage sometimes puts that entity over their alloted limit causing
their ISP to charge them extra.  Guess who's to blame in that circumstance?  Although we 
have always adhered to a 30 second policy, which I believe is very conservative in 2004,
we still receive the you-are-hitting-our-site-to-hard type of complaints.  Usually these
arise when we touch upon to many 404s and the webmaster has decided to have their web
server email them every time one is encountered.

Just thought I'd pass along some information from the trenches

--
Christian Storm, Ph.D.
www.turnitin.com
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread richard
Alan Perkins writes:
 > What's the current accepted practice for hit rate?

In general, leave an interval several times longer than the time
taken for the last response. e.g. if a site responds in 20 ms,
you can hit it again the same second. If a site takes 4 seconds
to response, leave it at least 30 seconds before trying again.

 > B) The number of robots you are running (e.g. 30 seconds per site per
 > robot, or 30 seconds per site across all your robots?)

Generally, take into account all your robots. If you use a mercator
style distribution strategy, this is a non-issue.

 > D) Some other factor (e.g. server response time, etc.)

Server response time is the biggest factor.

 > E) None of the above (i.e. anything goes)
 > 
 > It's clear from the log files I study that some of the big players are
 > not sticking to 30 seconds.  There are good reasons for this and I
 > consider it a good thing (in moderation).  E.g. retrieving one page from
 > a site every 30 seconds only allows 2880 pages per day to be retrieved
 > from a site and this has obvious "freshness" implications when indexing
 > large sites.

Many large sites are split across several servers. Often these can be
hit in parallel - if your robot is clever enough.

Richard
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread Klaus Johannes Rusch
Alan Perkins wrote:

> >>
> Hit rate
> This directive could indicate to a robot how long to wait between
> requests to the server. Currently it is accepted practice to wait at
> least 30 seconds between requests, but this is too fast for some sites,
> too slow for others.
> >>
>
> What's the current accepted practice for hit rate?  Does it vary
> according to

With the availability of persistent connections, a robot that drops the
connection or keeps the connection open for 30 seconds without requesting
another resource would not do the server any good. Large sites generally
have good connectivity and robots can request resources at a higher rate
without any performance degradation, regardless of response code. If a robot
does find that a site is responding slowly (latency or throughput) it should
reduce the hit rate or even suspend crawling temporarily to avoid
overloading a server.

> A) The HTTP response (e.g. no need to wait 30 seconds after a 304)

I would recommend waiting after the server has closed the connection (not
maintained the persistent connection), as long as the connection is open
sending another request instead of waiting and keeping the connection open
but inactive is the best choice.

>
> B) The number of robots you are running (e.g. 30 seconds per site per
> robot, or 30 seconds per site across all your robots?)

Running multiple robots in parallel increases the number of open connections
required at the server, a single persistent connection is more
server-friendly (and usually easier to manage too, I see some sites crawl
the same resources in parallel robots, which apparantly do not communicate
status information in real-time).

> C) The number of active robots on the Web (e.g. 1000 robots isn't many,
> 10 million robots is - and if too many unrelated robots hit a site,
> that's another effective DDOS attack)

The number of other robots hitting a site is not a known factor, although
performance metrics can give an indiciation whether or not a site is under
heavy load.

> D) Some other factor (e.g. server response time, etc.)

> E) None of the above (i.e. anything goes)

DOS monitors may raise alerts or block traffic if robots hit a site too
hard, too frequently, with too many parallel process etc.


--
Klaus Johannes Rusch
[EMAIL PROTECTED]
http://www.atmedia.net/KlausRusch/


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread Jaakko Hyvätti
On Tue, 4 Nov 2003, Alan Perkins wrote:
> Here's a question to test whether the list is alive and active...

  I have a feeling the bandwidth and other resources of web sites have
gone up so much that really robots do not pose a DoS threat any more. "Hit
me as hard as you like as long as I am in your index."  It is spam and
viri that steal the attention and are orders of magnitude worse problems
for everybody.

  So, apparently, all problems of robots were solved, and discussion died
away.  But no need to rush closing the list, in case at some point
something new appears.

Jaakko

-- 
Foreca Ltd   [EMAIL PROTECTED]
Pursimiehenkatu 29-31 B, FIN-00150 Helsinki, Finland http://www.foreca.com
___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] Hit Rate - testing is this mailing linst alive?

2003-11-04 Thread Alan Perkins
Here's a question to test whether the list is alive and active...

Reading "Evaluation of the Standard for Robots Exclusion", Martijn
Koster, 1996 (http://www.robotstxt.org/wc/eval.html), we find:

>>
Hit rate
This directive could indicate to a robot how long to wait between
requests to the server. Currently it is accepted practice to wait at
least 30 seconds between requests, but this is too fast for some sites,
too slow for others. 
>>

What's the current accepted practice for hit rate?  Does it vary
according to

A) The HTTP response (e.g. no need to wait 30 seconds after a 304)
B) The number of robots you are running (e.g. 30 seconds per site per
robot, or 30 seconds per site across all your robots?)
C) The number of active robots on the Web (e.g. 1000 robots isn't many,
10 million robots is - and if too many unrelated robots hit a site,
that's another effective DDOS attack)
D) Some other factor (e.g. server response time, etc.)
E) None of the above (i.e. anything goes)

It's clear from the log files I study that some of the big players are
not sticking to 30 seconds.  There are good reasons for this and I
consider it a good thing (in moderation).  E.g. retrieving one page from
a site every 30 seconds only allows 2880 pages per day to be retrieved
from a site and this has obvious "freshness" implications when indexing
large sites.

In summary, I'm just wondering what your thoughts are for an acceptable
hit rate and means of measuring it in 2004?

Alan Perkins
e-Brand Management Limited


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Is this mailing linst alive?

2003-11-04 Thread Tim Bray
On Nov 3, 2003, at 11:16 PM, Nick Arnett wrote:

[EMAIL PROTECTED] wrote:

I've created a robot, www.dead-links.com and i wonder if this list is 
alive.
It is alive, but very, very quiet.
Yeah, this robots thing is just a fad, it'll never catch on. -Tim

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] Is this mailing linst alive?

2003-11-04 Thread Nick Arnett
[EMAIL PROTECTED] wrote:

I've created a robot, www.dead-links.com and i wonder if this list is alive.
It is alive, but very, very quiet.

Nick

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots