Re: htdig: Digging throug HTTP and wait between documents

Gilles Detillieux Tue, 8 Dec 1998 13:11:12 -0500
According to Klaus Mueller:
> is it possible to set a small wait time between digging two documents from
> one server to prevent from server overload?

The retriever code seems to try to spread the load among the servers it
accesses, but there doesn't seem to be anything to prevent rapid fire
against a single server, if you're indexing only one or two servers.

A quick fix would be to add a sleep() call just before c.connect()
in Document::RetrieveHTTP() (file htdig/Document.cc).  That would slow
the whole dig down, whether it's accessing the same server repeatedly,
or interleaving its requests.

A proper fix would involve keeping track of the time each host was
accessed last, and before any access to a host, if the last access was
more recent than the number of seconds in some new config parameter,
then it would sleep for the difference in time.  By recording the time
at each c.close(), and checking it before c.connect(), it would ensure
a minimum idle time between each connection.

If you want to get really fancy, you could make the amount of delay
dependent on the URL you're accessing.  E.g. you may want a bigger
delay for something with .cgi or /cgi-bin/, if you're indexing these,
than you'd use for .html files.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
----------------------------------------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the body of the message.
Re: htdig: Digging throug HTTP and wait between documents

Reply via email to