Re: Think before you write Semantic Web crawlers

Steve Harris Wed, 22 Jun 2011 07:57:55 -0700

Yes, exactly.

I think that the problem is at least partly (and I say this as an ex-academic) 
that few people in academia have the slightest idea how much it costs to run a 
farm of servers in the Real World™.


From the point of view of the crawler they're trying to get as much data as 
possible in a short a time as possible, but don't realise that the poor guy at 
the other end just got his 95th percentile shot through the roof, and now has a 
several thousand dollar bandwidth bill heading his way.

You can cap bandwidth, but that then might annoy paying customers, which is 
clearly not good.

- Steve

On 2011-06-22, at 12:54, Hugh Glaser wrote:

> Hi Chris.
> One way to do the caching really efficiently:
> http://lists.w3.org/Archives/Public/semantic-web/2007Jun/0012.html
> Which is what rkb has always done.
> But of course caching does not solve the problem of one bad crawler.
> It actually makes it worse.
> You add a cache write cost to the query, without a significant probability of 
> a future cache hit. And increase disk usage.
> 
> Hugh
> 
> ----- Reply message -----
> From: "Christopher Gutteridge" <c...@ecs.soton.ac.uk>
> To: "Martin Hepp" <martin.h...@ebusiness-unibw.org>
> Cc: "Daniel Herzig" <her...@kit.edu>, "semantic-...@w3.org" 
> <semantic-...@w3.org>, "public-lod@w3.org" <public-lod@w3.org>
> Subject: Think before you write Semantic Web crawlers
> Date: Wed, Jun 22, 2011 9:18 am
> 
> 
> 
> The difference between these two scenarios is that there's almost no CPU 
> involvement in serving the PDF file, but naive RDF sites use lots of cycles 
> to generate the response to a query for an RDF document.
> 
> Right now queries to data.southampton.ac.uk (eg. 
> http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made 
> live, but this is not efficient. My colleague, Dave Challis, has prepared a 
> SPARQL endpoint which caches results which we can turn on if the load gets 
> too high, which should at least mitigate the problem. Very few datasets 
> change in a 24 hours period.
> 
> Martin Hepp wrote:
> 
> Hi Daniel,
> Thanks for the link! I will relay this to relevant site-owners.
> 
> However, I still challenge Andreas' statement that the site-owners are to 
> blame for publishing large amounts of data on small servers.
> 
> One can publish 10,000 PDF documents on a tiny server without being hit by 
> DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
> 
> But for sure, it is necessary to advise all publishers of large RDF datasets 
> to protect themselves against hungry crawlers and actual DoS attacks.
> 
> Imagine if a large site was brought down by a botnet that is exploiting 
> Semantic Sitemap information for DoS attacks, focussing on the large dump 
> files.
> This could end LOD experiments for that site.
> 
> 
> Best
> 
> Martin
> 
> 
> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
> 
> 
> 
> Hi Martin,
> 
> Have you tried to put a Squid [1]  as reverse proxy in front of your servers 
> and use delay pools [2] to catch hungry crawlers?
> 
> Cheers,
> Daniel
> 
> [1] http://www.squid-cache.org/
> [2] http://wiki.squid-cache.org/Features/DelayPools
> 
> On 21.06.2011, at 09:49, Martin Hepp wrote:
> 
> 
> 
> Hi all:
> 
> For the third time in a few weeks, we had massive complaints from site-owners 
> that Semantic Web crawlers from Universities visited their sites in a way 
> close to a denial-of-service attack, i.e., crawling data with maximum 
> bandwidth in a parallelized approach.
> 
> It's clear that a single, stupidly written crawler script, run from a 
> powerful University network, can quickly create terrible traffic load.
> 
> Many of the scripts we saw
> 
> - ignored robots.txt,
> - ignored clear crawling speed limitations in robots.txt,
> - did not identify themselves properly in the HTTP request header or lacked 
> contact information therein,
> - used no mechanisms at all for limiting the default crawling speed and 
> re-crawling delays.
> 
> This irresponsible behavior can be the final reason for site-owners to say 
> farewell to academic/W3C-sponsored semantic technology.
> 
> So please, please - advise all of your colleagues and students to NOT write 
> simple crawler scripts for the billion triples challenge or whatsoever 
> without familiarizing themselves with the state of the art in "friendly 
> crawling".
> 
> Best wishes
> 
> Martin Hepp
> 
> 
> 
> 
> 
> 
> 
> 
> --
> Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248
> 
> You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/
> 
> 

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Re: Think before you write Semantic Web crawlers

Reply via email to