Re: Think before you write Semantic Web crawlers

Jiří Procházka Wed, 22 Jun 2011 03:42:27 -0700

I wonder, are ways to link RDF data so that convential crawlers do not
crawl it, but only the semantic web aware ones do?
I am not sure how the current practice of linking by link tag in the
html headers could cause this, but it may be case that those heavy loads
come from a crawlers having nothing to do with semantic web...
Maybe we should start linking to our rdf/xml, turtle, ntriples files and
publishing sitemap info in RDFa...


Best,
Jiri

On 06/22/2011 09:00 AM, Steve Harris wrote:
> While I don't agree with Andreas exactly that it's the site owners fault, 
> this is something that publishers of non-semantic data have to deal with.
> 
> If you publish a large collection of interlinked data which looks interesting 
> to conventional crawlers and is expensive to generate, conventional web 
> crawlers will be all over it. The main difference is that a greater 
> percentage of those are written properly, to follow robots.txt and the 
> guidelines about hit frequency (maximum 1 request per second per domain, no 
> parallel crawling).
> 
> Has someone published similar guidelines for semantic web crawlers?
> 
> The ones that don't behave themselves get banned, either in robots.txt, or 
> explicitly by the server. 
> 
> - Steve
> 
> On 2011-06-22, at 06:07, Martin Hepp wrote:
> 
>> Hi Daniel,
>> Thanks for the link! I will relay this to relevant site-owners.
>>
>> However, I still challenge Andreas' statement that the site-owners are to 
>> blame for publishing large amounts of data on small servers.
>>
>> One can publish 10,000 PDF documents on a tiny server without being hit by 
>> DoS-style crazy crawlers. Why should the same not hold if I publish RDF?
>>
>> But for sure, it is necessary to advise all publishers of large RDF datasets 
>> to protect themselves against hungry crawlers and actual DoS attacks.
>>
>> Imagine if a large site was brought down by a botnet that is exploiting 
>> Semantic Sitemap information for DoS attacks, focussing on the large dump 
>> files. 
>> This could end LOD experiments for that site.
>>
>>
>> Best
>>
>> Martin
>>
>>
>> On Jun 21, 2011, at 10:24 AM, Daniel Herzig wrote:
>>
>>>
>>> Hi Martin,
>>>
>>> Have you tried to put a Squid [1]  as reverse proxy in front of your 
>>> servers and use delay pools [2] to catch hungry crawlers?
>>>
>>> Cheers,
>>> Daniel
>>>
>>> [1] http://www.squid-cache.org/
>>> [2] http://wiki.squid-cache.org/Features/DelayPools
>>>
>>> On 21.06.2011, at 09:49, Martin Hepp wrote:
>>>
>>>> Hi all:
>>>>
>>>> For the third time in a few weeks, we had massive complaints from 
>>>> site-owners that Semantic Web crawlers from Universities visited their 
>>>> sites in a way close to a denial-of-service attack, i.e., crawling data 
>>>> with maximum bandwidth in a parallelized approach.
>>>>
>>>> It's clear that a single, stupidly written crawler script, run from a 
>>>> powerful University network, can quickly create terrible traffic load. 
>>>>
>>>> Many of the scripts we saw
>>>>
>>>> - ignored robots.txt,
>>>> - ignored clear crawling speed limitations in robots.txt,
>>>> - did not identify themselves properly in the HTTP request header or 
>>>> lacked contact information therein, 
>>>> - used no mechanisms at all for limiting the default crawling speed and 
>>>> re-crawling delays.
>>>>
>>>> This irresponsible behavior can be the final reason for site-owners to say 
>>>> farewell to academic/W3C-sponsored semantic technology.
>>>>
>>>> So please, please - advise all of your colleagues and students to NOT 
>>>> write simple crawler scripts for the billion triples challenge or 
>>>> whatsoever without familiarizing themselves with the state of the art in 
>>>> "friendly crawling".
>>>>
>>>> Best wishes
>>>>
>>>> Martin Hepp
>>>>
>>>
>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: Think before you write Semantic Web crawlers

Reply via email to