Hi all:
For the third time in a few weeks, we had massive complaints from site-owners
that Semantic Web crawlers from Universities visited their sites in a way close
to a denial-of-service attack, i.e., crawling data with maximum bandwidth in a
parallelized approach.
It's clear that a single,
A solution to stupid crawlers would be to put the linked data behind https
endpoints, and use WebID
for authentication. You could still allow everyone access, but at least you
would force the crawler to identify
himself, and use these WebIDs to learn who was making the crawler. This could
Would some kind of caching crawler mitigate this issue? Have someone
write a well behaved crawler which allowed you to download a recent
.ttl.tgz of various sites. Of course, that assumes the student is able
to find such a cache.
Asking people nicely will only work in a very small community.
Hi Christoper, Henry, all:
The main problem is, imho:
1. the basic attitude of Semantic Web research that the works done in the past
or in other communities were irrelevant historical relicts (databases,
middleware, EDI) and that the old fellows were simply too stupid to understand
the power
On 21 Jun 2011, at 10:48, Christopher Gutteridge wrote:
Would some kind of caching crawler mitigate this issue? Have someone write a
well behaved crawler which allowed you to download a recent .ttl.tgz of
various sites. Of course, that assumes the student is able to find such a
cache.
On 21 Jun 2011, at 11:44, Martin Hepp wrote:
Hi Christoper, Henry, all:
The main problem is, imho:
1. the basic attitude of Semantic Web research that the works done in the
past or in other communities were irrelevant historical relicts (databases,
middleware, EDI) and that the old
On 6/21/11 9:41 AM, Henry Story wrote:
A solution to stupid crawlers would be to put the linked data behind https
endpoints, and use WebID
for authentication. You could still allow everyone access, but at least you
would force the crawler to identify
himself, and use these WebIDs to learn who
On 6/21/11 10:44 AM, Martin Hepp wrote:
Hi Christoper, Henry, all:
The main problem is, imho:
1. the basic attitude of Semantic Web research that the works done in the past or in
other communities were irrelevant historical relicts (databases, middleware, EDI) and
that the old fellows were
On 6/21/11 10:54 AM, Henry Story wrote:
Then you could just redirect him straight to the n3 dump of graphs of
your site (I say graphs because your site not necessarily being
consistent, the crawler may be interested in keeping information about
which pages said what)
Redirect may be a bit
On 6/21/11 11:23 AM, Kingsley Idehen wrote:
A looong time ago, very early LOD days, we (LOD community) talked
about the importance of dumps with the heuristic you describe in mind
(no WebID then, but it was clear something would emerge).
Unfortunately, SPARQL endpoints have become the first
Thanks for the hint, but I am not talking about my servers.
I am talking about a site-owner somewhere in Kentucky running a small shop on
www.godaddy.com who adds RDF to his site, informs PingTheSemanticWeb, and what
he gets in turn are wild-west crawlers that bring down his tiny server by
Yes, RDF data dumps without traffic control mechanisms are an invitation to
denial-of-service attacks.
On Jun 21, 2011, at 12:28 PM, Kingsley Idehen wrote:
On 6/21/11 11:23 AM, Kingsley Idehen wrote:
A looong time ago, very early LOD days, we (LOD community) talked about the
importance of
On 21 Jun 2011, at 12:23, Kingsley Idehen wrote:
On 6/21/11 10:54 AM, Henry Story wrote:
Then you could just redirect him straight to the n3 dump of graphs of your
site (I say graphs because your site not necessarily being consistent, the
crawler may be interested in keeping information
On 6/21/11 12:06 PM, Henry Story wrote:
On 21 Jun 2011, at 12:23, Kingsley Idehen wrote:
On 6/21/11 10:54 AM, Henry Story wrote:
Then you could just redirect him straight to the n3 dump of graphs of your site
(I say graphs because your site not necessarily being consistent, the crawler
may
Is neo4j http://neo4j.org a good option to consider for this ? does it
provide seamless integration with dbpedia, free base etc?
On 6/19/11, Marco Brandizi brand...@ebi.ac.uk wrote:
Hi Aliabbas,
It all depends on what you want to represent and which tasks you want to
perform.
OWL-based
Hi all,
We would like to say thanks to all of you who have replied to the poll on
blank nodes. We have got interesting answers and feedback, and we will be
making the results available online early next week (with the exception of
the data sets). If you still want to participate, the poll will be
Hi Andreas:
I do not publish large datasets, and the complaint was not about someone using
them. The complaint was about stupid crawlers bombarding sites with unlimited
crawling throughput close to a Denial-of-Service attack.
You may want to ask the Sindice guys re implementing polite yet
-1.
Obviously it is not useful to kill the web server of small shops due to
academic experiments.
At 02:29 PM 6/21/2011, Andreas Harth wrote:
Dear Martin,
I agree with you in that software accessing large portions of the web
should adhere to basic principles (such as robots.txt).
However, I
Concur. Small companies, too, are sometimes surprised by large EC2 invoices.
If people are *using* your data, that's good. If poorly behaved bots are
simply costing you money because their creators can't be bothered to support
the robot exclusion protocol, that's bad.
Regards,
Dave
On
In the words of the great Al Franken: It's easier to put on slippers
than to carpet the world.
http://www.quotationspage.com/quotes/Al_Franken/
While I don't support poorly written software, it's probably a good
idea to publish at your web site some recipes for defenses against
poor spidering.
On Sat, 2011-06-18 at 23:05 -0500, Pat Hayes wrote:
Really (sorry to keep raining on the parade, but) it is not as simple
as this. Look, it is indeed easy to not bother distinguishing male
from female dogs. One simply talks of dogs without mentioning gender,
and there is a lot that can be said
Hi Martin,
Have you tried to put a Squid [1] as reverse proxy in front of your servers
and use delay pools [2] to catch hungry crawlers?
May be that helps.
Cheers,
Daniel
[1] http://www.squid-cache.org/
[2] http://wiki.squid-cache.org/Features/DelayPools
On 21.06.2011, at 09:49, Martin Hepp
Hi Martin,
Have you tried to put a Squid [1] as reverse proxy in front of your servers
and use delay pools [2] to catch hungry crawlers?
Cheers,
Daniel
[1] http://www.squid-cache.org/
[2] http://wiki.squid-cache.org/Features/DelayPools
On 21.06.2011, at 09:49, Martin Hepp wrote:
Hi all:
Hi Daniel,
Thanks for the link! I will relay this to relevant site-owners.
However, I still challenge Andreas' statement that the site-owners are to blame
for publishing large amounts of data on small servers.
One can publish 10,000 PDF documents on a tiny server without being hit by
DoS-style
24 matches
Mail list logo