Hi all, 

> The volunteer who is hosting http://openean.kaufkauf.net/id/, a huge set of 
> GoodRelations product model data, is experiencing a problematic amount of 
> traffic from unidentified crawlers located in Ireland (DERI?), the 
> Netherlands (VUA?), and the USA.
> 


Another crawler used from DERI is the LDSpider[1] which we use to crawl data 
for the SWSE search engine and recently for the BTC 2010 dataset. 
Along these lines we admittedly have been doing an unusually large amount of 
crawling in the past month or two.

> The crawling has been so intense that he had to temporarily block all traffic 
> to this dataset.
> 
> In case you are operating any kind of Semantic Web crawlers that tried to 
> access this dataset, please
> 
> 1. check your crawler for bugs that create excessive traffic (e.g. by 
> redundant requests),

> 2. identify your crawler agent properly in the HTTP header, indicating a 
> contact person, and

User-agent of the LDSpider:
  * ldspider (http://code.google.com/p/ldspider/wiki/Robots)

> 3. implement some bandwidth throttling technique that limits the bandwidth 
> consumption on a single host to a moderate amount.


The LDSpider uses a delay policy similar to the one proposed in the IRLBot 
system. 
We have the following delay times per PLD (in the case of 
http://openean.kaufkauf.net/id the PLD is kaufkauf.net)
 * 500 ms for lookups which return content (200 resp code)
 * 250 ms for lookups which return no content (e.g. 30X, 40X, 50X).

There are also solutions for server side bandwidth throttling (e.g.  see [2]).

Please see also the reply of Andreas Harth at the semantic-web mailing list [3].

Best
   Juergen

[1] http://code.google.com/p/ldspider/
[2] http://code.google.com/p/ldspider/wiki/ServerConfig
[3] http://lists.w3.org/Archives/Public/semantic-web/2010Jun/0048.html

Reply via email to