Yes, exactly. I think that the problem is at least partly (and I say this as an ex-academic) that few people in academia have the slightest idea how much it costs to run a farm of servers in the Real World™.
From the point of view of the crawler they're trying to get as much data as possible in a short a time as possible, but don't realise that the poor guy at the other end just got his 95th percentile shot through the roof, and now has a several thousand dollar bandwidth bill heading his way. You can cap bandwidth, but that then might annoy paying customers, which is clearly not good. - Steve On 2011-06-22, at 12:54, Hugh Glaser wrote: > Hi Chris. > One way to do the caching really efficiently: > http://lists.w3.org/Archives/Public/semantic-web/2007Jun/0012.html > Which is what rkb has always done. > But of course caching does not solve the problem of one bad crawler. > It actually makes it worse. > You add a cache write cost to the query, without a significant probability of > a future cache hit. And increase disk usage. > > Hugh > > ----- Reply message ----- > From: "Christopher Gutteridge" <c...@ecs.soton.ac.uk> > To: "Martin Hepp" <martin.h...@ebusiness-unibw.org> > Cc: "Daniel Herzig" <her...@kit.edu>, "semantic-...@w3.org" > <semantic-...@w3.org>, "public-lod@w3.org" <public-lod@w3.org> > Subject: Think before you write Semantic Web crawlers > Date: Wed, Jun 22, 2011 9:18 am > > > > The difference between these two scenarios is that there's almost no CPU > involvement in serving the PDF file, but naive RDF sites use lots of cycles > to generate the response to a query for an RDF document. > > Right now queries to data.southampton.ac.uk (eg. > http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made > live, but this is not efficient. My colleague, Dave Challis, has prepared a > SPARQL endpoint which caches results which we can turn on if the load gets > too high, which should at least mitigate the problem. Very few datasets > change in a 24 hours period. > > Martin Hepp wrote: > > Hi Daniel, > Thanks for the link! I will relay this to relevant site-owners. > > However, I still challenge Andreas' statement that the site-owners are to > blame for publishing large amounts of data on small servers. > > One can publish 10,000 PDF documents on a tiny server without being hit by > DoS-style crazy crawlers. Why should the same not hold if I publish RDF? > > But for sure, it is necessary to advise all publishers of large RDF datasets > to protect themselves against hungry crawlers and actual DoS attacks. > > Imagine if a large site was brought down by a botnet that is exploiting > Semantic Sitemap information for DoS attacks, focussing on the large dump > files. > This could end LOD experiments for that site. > > > Best > > Martin > > > On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote: > > > > Hi Martin, > > Have you tried to put a Squid [1] as reverse proxy in front of your servers > and use delay pools [2] to catch hungry crawlers? > > Cheers, > Daniel > > [1] http://www.squid-cache.org/ > [2] http://wiki.squid-cache.org/Features/DelayPools > > On 21.06.2011, at 09:49, Martin Hepp wrote: > > > > Hi all: > > For the third time in a few weeks, we had massive complaints from site-owners > that Semantic Web crawlers from Universities visited their sites in a way > close to a denial-of-service attack, i.e., crawling data with maximum > bandwidth in a parallelized approach. > > It's clear that a single, stupidly written crawler script, run from a > powerful University network, can quickly create terrible traffic load. > > Many of the scripts we saw > > - ignored robots.txt, > - ignored clear crawling speed limitations in robots.txt, > - did not identify themselves properly in the HTTP request header or lacked > contact information therein, > - used no mechanisms at all for limiting the default crawling speed and > re-crawling delays. > > This irresponsible behavior can be the final reason for site-owners to say > farewell to academic/W3C-sponsored semantic technology. > > So please, please - advise all of your colleagues and students to NOT write > simple crawler scripts for the billion triples challenge or whatsoever > without familiarizing themselves with the state of the art in "friendly > crawling". > > Best wishes > > Martin Hepp > > > > > > > > > -- > Christopher Gutteridge -- http://id.ecs.soton.ac.uk/person/1248 > > You should read the ECS Web Team blog: http://blogs.ecs.soton.ac.uk/webteam/ > > -- Steve Harris, CTO, Garlik Limited 1-3 Halford Road, Richmond, TW10 6AW, UK +44 20 8439 8203 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD