Re: Think before you write Semantic Web crawlers

Martin Hepp Tue, 21 Jun 2011 22:09:21 -0700

Hi Daniel,
Thanks for the link! I will relay this to relevant site-owners.

However, I still challenge Andreas' statement that the site-owners are to blame 
for publishing large amounts of data on small servers.


One can publish 10,000 PDF documents on a tiny server without being hit by 
DoS-style crazy crawlers. Why should the same not hold if I publish RDF?

But for sure, it is necessary to advise all publishers of large RDF datasets to 
protect themselves against hungry crawlers and actual DoS attacks.

Imagine if a large site was brought down by a botnet that is exploiting 
Semantic Sitemap information for DoS attacks, focussing on the large dump 
files. 
This could end LOD experiments for that site.


Best

Martin
 

On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:

> 
> Hi Martin,
> 
> Have you tried to put a Squid [1]  as reverse proxy in front of your servers 
> and use delay pools [2] to catch hungry crawlers?
> 
> Cheers,
> Daniel
> 
> [1] http://www.squid-cache.org/
> [2] http://wiki.squid-cache.org/Features/DelayPools
> 
> On 21.06.2011, at 09:49, Martin Hepp wrote:
> 
>> Hi all:
>> 
>> For the third time in a few weeks, we had massive complaints from 
>> site-owners that Semantic Web crawlers from Universities visited their sites 
>> in a way close to a denial-of-service attack, i.e., crawling data with 
>> maximum bandwidth in a parallelized approach.
>> 
>> It's clear that a single, stupidly written crawler script, run from a 
>> powerful University network, can quickly create terrible traffic load. 
>> 
>> Many of the scripts we saw
>> 
>> - ignored robots.txt,
>> - ignored clear crawling speed limitations in robots.txt,
>> - did not identify themselves properly in the HTTP request header or lacked 
>> contact information therein, 
>> - used no mechanisms at all for limiting the default crawling speed and 
>> re-crawling delays.
>> 
>> This irresponsible behavior can be the final reason for site-owners to say 
>> farewell to academic/W3C-sponsored semantic technology.
>> 
>> So please, please - advise all of your colleagues and students to NOT write 
>> simple crawler scripts for the billion triples challenge or whatsoever 
>> without familiarizing themselves with the state of the art in "friendly 
>> crawling".
>> 
>> Best wishes
>> 
>> Martin Hepp
>> 
>

Re: Think before you write Semantic Web crawlers

Reply via email to