I wonder, are ways to link RDF data so that convential crawlers do not crawl it, but only the semantic web aware ones do? I am not sure how the current practice of linking by link tag in the html headers could cause this, but it may be case that those heavy loads come from a crawlers having nothing to do with semantic web... Maybe we should start linking to our rdf/xml, turtle, ntriples files and publishing sitemap info in RDFa...
Best, Jiri On 06/22/2011 09:00 AM, Steve Harris wrote: > While I don't agree with Andreas exactly that it's the site owners fault, > this is something that publishers of non-semantic data have to deal with. > > If you publish a large collection of interlinked data which looks interesting > to conventional crawlers and is expensive to generate, conventional web > crawlers will be all over it. The main difference is that a greater > percentage of those are written properly, to follow robots.txt and the > guidelines about hit frequency (maximum 1 request per second per domain, no > parallel crawling). > > Has someone published similar guidelines for semantic web crawlers? > > The ones that don't behave themselves get banned, either in robots.txt, or > explicitly by the server. > > - Steve > > On 2011-06-22, at 06:07, Martin Hepp wrote: > >> Hi Daniel, >> Thanks for the link! I will relay this to relevant site-owners. >> >> However, I still challenge Andreas' statement that the site-owners are to >> blame for publishing large amounts of data on small servers. >> >> One can publish 10,000 PDF documents on a tiny server without being hit by >> DoS-style crazy crawlers. Why should the same not hold if I publish RDF? >> >> But for sure, it is necessary to advise all publishers of large RDF datasets >> to protect themselves against hungry crawlers and actual DoS attacks. >> >> Imagine if a large site was brought down by a botnet that is exploiting >> Semantic Sitemap information for DoS attacks, focussing on the large dump >> files. >> This could end LOD experiments for that site. >> >> >> Best >> >> Martin >> >> >> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote: >> >>> >>> Hi Martin, >>> >>> Have you tried to put a Squid [1] as reverse proxy in front of your >>> servers and use delay pools [2] to catch hungry crawlers? >>> >>> Cheers, >>> Daniel >>> >>> [1] http://www.squid-cache.org/ >>> [2] http://wiki.squid-cache.org/Features/DelayPools >>> >>> On 21.06.2011, at 09:49, Martin Hepp wrote: >>> >>>> Hi all: >>>> >>>> For the third time in a few weeks, we had massive complaints from >>>> site-owners that Semantic Web crawlers from Universities visited their >>>> sites in a way close to a denial-of-service attack, i.e., crawling data >>>> with maximum bandwidth in a parallelized approach. >>>> >>>> It's clear that a single, stupidly written crawler script, run from a >>>> powerful University network, can quickly create terrible traffic load. >>>> >>>> Many of the scripts we saw >>>> >>>> - ignored robots.txt, >>>> - ignored clear crawling speed limitations in robots.txt, >>>> - did not identify themselves properly in the HTTP request header or >>>> lacked contact information therein, >>>> - used no mechanisms at all for limiting the default crawling speed and >>>> re-crawling delays. >>>> >>>> This irresponsible behavior can be the final reason for site-owners to say >>>> farewell to academic/W3C-sponsored semantic technology. >>>> >>>> So please, please - advise all of your colleagues and students to NOT >>>> write simple crawler scripts for the billion triples challenge or >>>> whatsoever without familiarizing themselves with the state of the art in >>>> "friendly crawling". >>>> >>>> Best wishes >>>> >>>> Martin Hepp >>>> >>> >> >> >
signature.asc
Description: OpenPGP digital signature