Think before you write Semantic Web crawlers

2011-06-21 Thread Martin Hepp
Hi all: For the third time in a few weeks, we had massive complaints from site-owners that Semantic Web crawlers from Universities visited their sites in a way close to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a parallelized approach. It's clear that a single,

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Henry Story
A solution to stupid crawlers would be to put the linked data behind https endpoints, and use WebID for authentication. You could still allow everyone access, but at least you would force the crawler to identify himself, and use these WebIDs to learn who was making the crawler. This could

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Christopher Gutteridge
Would some kind of caching crawler mitigate this issue? Have someone write a well behaved crawler which allowed you to download a recent .ttl.tgz of various sites. Of course, that assumes the student is able to find such a cache. Asking people nicely will only work in a very small community.

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Martin Hepp
Hi Christoper, Henry, all: The main problem is, imho: 1. the basic attitude of Semantic Web research that the works done in the past or in other communities were irrelevant historical relicts (databases, middleware, EDI) and that the old fellows were simply too stupid to understand the power

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Henry Story
On 21 Jun 2011, at 10:48, Christopher Gutteridge wrote: Would some kind of caching crawler mitigate this issue? Have someone write a well behaved crawler which allowed you to download a recent .ttl.tgz of various sites. Of course, that assumes the student is able to find such a cache.

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Henry Story
On 21 Jun 2011, at 11:44, Martin Hepp wrote: Hi Christoper, Henry, all: The main problem is, imho: 1. the basic attitude of Semantic Web research that the works done in the past or in other communities were irrelevant historical relicts (databases, middleware, EDI) and that the old

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Kingsley Idehen
On 6/21/11 9:41 AM, Henry Story wrote: A solution to stupid crawlers would be to put the linked data behind https endpoints, and use WebID for authentication. You could still allow everyone access, but at least you would force the crawler to identify himself, and use these WebIDs to learn who

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Kingsley Idehen
On 6/21/11 10:44 AM, Martin Hepp wrote: Hi Christoper, Henry, all: The main problem is, imho: 1. the basic attitude of Semantic Web research that the works done in the past or in other communities were irrelevant historical relicts (databases, middleware, EDI) and that the old fellows were

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Kingsley Idehen
On 6/21/11 10:54 AM, Henry Story wrote: Then you could just redirect him straight to the n3 dump of graphs of your site (I say graphs because your site not necessarily being consistent, the crawler may be interested in keeping information about which pages said what) Redirect may be a bit

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Kingsley Idehen
On 6/21/11 11:23 AM, Kingsley Idehen wrote: A looong time ago, very early LOD days, we (LOD community) talked about the importance of dumps with the heuristic you describe in mind (no WebID then, but it was clear something would emerge). Unfortunately, SPARQL endpoints have become the first

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Martin Hepp
Thanks for the hint, but I am not talking about my servers. I am talking about a site-owner somewhere in Kentucky running a small shop on www.godaddy.com who adds RDF to his site, informs PingTheSemanticWeb, and what he gets in turn are wild-west crawlers that bring down his tiny server by

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Martin Hepp
Yes, RDF data dumps without traffic control mechanisms are an invitation to denial-of-service attacks. On Jun 21, 2011, at 12:28 PM, Kingsley Idehen wrote: On 6/21/11 11:23 AM, Kingsley Idehen wrote: A looong time ago, very early LOD days, we (LOD community) talked about the importance of

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Henry Story
On 21 Jun 2011, at 12:23, Kingsley Idehen wrote: On 6/21/11 10:54 AM, Henry Story wrote: Then you could just redirect him straight to the n3 dump of graphs of your site (I say graphs because your site not necessarily being consistent, the crawler may be interested in keeping information

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Kingsley Idehen
On 6/21/11 12:06 PM, Henry Story wrote: On 21 Jun 2011, at 12:23, Kingsley Idehen wrote: On 6/21/11 10:54 AM, Henry Story wrote: Then you could just redirect him straight to the n3 dump of graphs of your site (I say graphs because your site not necessarily being consistent, the crawler may

OWL ontology database

2011-06-21 Thread Aliabbas Petiwala
Is neo4j http://neo4j.org a good option to consider for this ? does it provide seamless integration with dbpedia, free base etc? On 6/19/11, Marco Brandizi brand...@ebi.ac.uk wrote: Hi Aliabbas, It all depends on what you want to represent and which tasks you want to perform. OWL-based

Re: Help needed: *brief* online poll about blank-nodes

2011-06-21 Thread Alejandro Mallea
Hi all, We would like to say thanks to all of you who have replied to the poll on blank nodes. We have got interesting answers and feedback, and we will be making the results available online early next week (with the exception of the data sets). If you still want to participate, the poll will be

Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers)

2011-06-21 Thread Martin Hepp
Hi Andreas: I do not publish large datasets, and the complaint was not about someone using them. The complaint was about stupid crawlers bombarding sites with unlimited crawling throughput close to a Denial-of-Service attack. You may want to ask the Sindice guys re implementing polite yet

Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers)

2011-06-21 Thread Dieter Fensel
-1. Obviously it is not useful to kill the web server of small shops due to academic experiments. At 02:29 PM 6/21/2011, Andreas Harth wrote: Dear Martin, I agree with you in that software accessing large portions of the web should adhere to basic principles (such as robots.txt). However, I

Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers)

2011-06-21 Thread David Wood
Concur. Small companies, too, are sometimes surprised by large EC2 invoices. If people are *using* your data, that's good. If poorly behaved bots are simply costing you money because their creators can't be bothered to support the robot exclusion protocol, that's bad. Regards, Dave On

Re: Think before you publish large datasets (was: Re: Think before you write Semantic Web crawlers)

2011-06-21 Thread Alan Ruttenberg
In the words of the great Al Franken: It's easier to put on slippers than to carpet the world. http://www.quotationspage.com/quotes/Al_Franken/ While I don't support poorly written software, it's probably a good idea to publish at your web site some recipes for defenses against poor spidering.

Re: Squaring the HTTP-range-14 circle [was Re: Schema.org in RDF ...]

2011-06-21 Thread David Booth
On Sat, 2011-06-18 at 23:05 -0500, Pat Hayes wrote: Really (sorry to keep raining on the parade, but) it is not as simple as this. Look, it is indeed easy to not bother distinguishing male from female dogs. One simply talks of dogs without mentioning gender, and there is a lot that can be said

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Daniel Herzig
Hi Martin, Have you tried to put a Squid [1] as reverse proxy in front of your servers and use delay pools [2] to catch hungry crawlers? May be that helps. Cheers, Daniel [1] http://www.squid-cache.org/ [2] http://wiki.squid-cache.org/Features/DelayPools On 21.06.2011, at 09:49, Martin Hepp

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Daniel Herzig
Hi Martin, Have you tried to put a Squid [1] as reverse proxy in front of your servers and use delay pools [2] to catch hungry crawlers? Cheers, Daniel [1] http://www.squid-cache.org/ [2] http://wiki.squid-cache.org/Features/DelayPools On 21.06.2011, at 09:49, Martin Hepp wrote: Hi all:

Re: Think before you write Semantic Web crawlers

2011-06-21 Thread Martin Hepp
Hi Daniel, Thanks for the link! I will relay this to relevant site-owners. However, I still challenge Andreas' statement that the site-owners are to blame for publishing large amounts of data on small servers. One can publish 10,000 PDF documents on a tiny server without being hit by DoS-style