Re: DBpedia hosting burden

Kingsley Idehen Wed, 14 Apr 2010 11:12:38 -0700

Dan Brickley wrote:

(trimming cc: list to LOD and DBPedia)


On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen <kide...@openlinksw.com> wrote:

My comment wasn't a "what is DBpedia?" lecture. It was about clarifying
the crux of the matter i.e., bandwidth consumption and its effects on
other DBpedia users (as well as our own non-DBpedia related Web properties).

(Leigh)

I was just curious about usage volumes. We all talk about how central
dbpedia is in the LOD cloud picture, and wondered if there was any
publicly accessible metrics to help add some detail that.

Well here is the critical detail: people typically crawl DBpedia. They
crawl it more than any other Data Space in the LOD cloud. They do so
because DBpedia is still quite central to to the burgeoning Web of
Linked Data.


Have you considered blocking DBpedia crawlers more aggressively, and

nudging them to alternative ways of accessing the data?


Yes.

Some have cleaned up their act for sure.

Problem is, there are others doing the same thing, who then complainabout the instance in very generic fashion.

While it is a
shame to say 'no' to people trying to use linked data, this would be
more saying 'yes, but not like that...'.

I think we have an outstanding blog post / technical note about theDBpedia instance that hasn't been published (possibly due to the 3.5 andDBpedia-Live work we are doing), said note will cover how to work withthe instance etc..

When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs
via SPARQL, which is still ultimately Export from DBpedia and Import to
my data space mindset.


That's useful to know, thanks. Do you have the impression that these
folk are typically trying to copy the entire thing, or to make some
filtered subset (by geographical view, topic, property etc).

Many (and to some degree quite natural) attempt to export the wholething. Even when they're nudged to use OFFSET and LIMIT, end result ismultiple hits en route to complete export.

 Can
studying these logs help provide different downloadable dumps that
would discourage crawlers?

We do have a solution in mind, basically, we are going to have adifferent place for the descriptor resources and redirect crawlersthere via 303's etc..

That's as simple and precise as this matter is.

 From a SPARQL perspective, DBpedia is quite microscopic, its when you
factor in Crawler mentality and network bandwith that issues arise, and
we deliberately have protection in place for Crawlers.


Looking at http://wiki.dbpedia.org/OnlineAccess#h28-14 I don't see
anything discouraging crawlers. Where is the 'best practice' or
'acceptable use' advice we should all be following, to avoid putting
needless burden on your servers and bandwidth?


We'll get the guide out.

As you mention, DBpedia is an important and central resource, thanks
both to the work of the Wikipedia community, and those in the DBpedia
project who enrich and make available all that information. It's
therefore important that the SemWeb / Linked Data community takes care
to remember that these things don't come for free, that bills need
paying and that de-referencing is a privilege not a right.

"Bills" the major operative word in a world where the "Bill Payer" and"Database Maintainer" is a footnote (at best) re. perception of whatconstitutes the DBpedia Project.

Our own ISPs even had to get in contact with us (last quarter of 2009)re. the amount of bandwidth being consumed by DBpedia etc..

 If there
are things we can do as a technology community to lower the cost of
hosting / distributing such data, or to nudge consumers of it in the
direction of more sustainable habits, we should do so. If there's not
so much the rest of us can do but say 'thanks!', ... then, ...er,
'thanks!'. Much appreciated!

For us, the most important thing is perspective. DBpedia is anotherspace on a public network, thus it can't magically rewrite theunderlying physics of wide area networking where access is open to theworld. Thus, we can make a note about proper behavior and explain howwe protect the instance such that everyone has a chance of using it(rather than a select few resource guzzlers).

Are there any scenarios around eg. BitTorrent that could be explored?
What if each of the static files in http://dbpedia.org/sitemap.xml
were available as torrents (or magnet: URIs)?

When we set up the Descriptor Resource host, these would certainly beconsidered.

 I realise that would
only address part of the problem/cost, but it's a widely used
technology for distributing large files; can we bend it to our needs?

Also, we encourage use of gzip over HTTP  :-)

Kingsley

cheers,

Dan



--

Regards,

Kingsley IdehenPresident & CEOOpenLink SoftwareWeb: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen

Twitter/Identi.ca: kidehen

Re: DBpedia hosting burden

Reply via email to