Re: DBpedia hosting burden

2010-04-18 Thread Daniel Koller
Dan,

I just setup some torrent files containing the current english and german
dbpedia content: (.. as a test/proof of concept, was just curious to see how
fast a network effect via p2p networks).

To try, go to http://dakoller.net/dbpedia_torrents/dbpedia_torrents.html.

I presume to get it working you need just the first people downloading (and
keep spreading it around w/ their Torrent-Clients)... as long as the
*.torrent-files are consistent. (layout of the link page courtesy of the
dbpedia-people)

Kind regards,

Daniel

On Wed, Apr 14, 2010 at 9:04 PM, Dan Brickley dan...@danbri.org wrote:

 On Wed, Apr 14, 2010 at 8:11 PM, Kingsley Idehen kide...@openlinksw.com
 wrote:


  Some have cleaned up their act for sure.
 
  Problem is, there are others doing the same thing, who then complain
 about
  the instance in very generic fashion.

 They're lucky it exists at all. I'd refer them to this Louis CK sketch
 -
 http://videosift.com/video/Louie-CK-on-Conan-Oct-1st-2008?fromdupe=We-live-in-an-amazing-amazing-world-and-we-complain
 (if it stays online...).

  While it is a
  shame to say 'no' to people trying to use linked data, this would be
  more saying 'yes, but not like that...'.
 
 
  I think we have an outstanding blog post / technical note about the
 DBpedia
  instance that hasn't been published (possibly due to the 3.5 and
  DBpedia-Live work we are doing), said note will cover how to work with
 the
  instance etc..
 [..]
  We do have a solution in mind, basically, we are going to have a
 different
  place for the descriptor resources and redirect crawlers there  via 303's
  etc..
 [...]
  We'll get the guide out.


 That sounds useful

  As you mention, DBpedia is an important and central resource, thanks
  both to the work of the Wikipedia community, and those in the DBpedia
  project who enrich and make available all that information. It's
  therefore important that the SemWeb / Linked Data community takes care
  to remember that these things don't come for free, that bills need
  paying and that de-referencing is a privilege not a right.
 
  Bills the major operative word in a world where the Bill Payer and
  Database Maintainer is a footnote (at best) re. perception of what
  constitutes the DBpedia Project.

 Yes, I'm sure some are thoughtless and take it for granted; but also
 that others are well aware of the burdens.

 (For that matter, I'm not myself so sure how Wikipedia cover their
 costs or what their longer-term plan is...).


  For us, the most important thing is perspective. DBpedia is another space
 on
  a public network, thus it can't magically rewrite the underlying physics
 of
  wide area networking where access is open to the world.  Thus, we can
 make a
  note about proper behavior and explain how we protect the instance such
 that
  everyone has a chance of using it (rather than a select few resource
  guzzlers).

 This I think is something others can help with, when presenting LOD
 and related concepts: to encourage good habits that spread the cost of
 keeping this great dataset globally available. So all those making
 slides, tutorials, blog posts or software tools have a role to play
 here.

  Are there any scenarios around eg. BitTorrent that could be explored?
  What if each of the static files in http://dbpedia.org/sitemap.xml
  were available as torrents (or magnet: URIs)?
 
  When we set up the Descriptor Resource host, these would certainly be
  considered.

 Ok, let's take care to explore that then; it would probably help
 others too. There must be dozens of companies and research
 organizations who could put some bandwidth resources into this, if
 only there was a short guide to setting up a GUI-less bittorrent tool
 and configuring it appropriately. Are there any bittorrent experts on
 these mailing lists who could suggest next practical steps here (not
 necessarily dbpedia-specific)?

 (ah I see a reply from Ivan; copying it in here...)

  If I were The Emperor of LOD I'd ask all grand dukes of datasources to
  put fresh dumps at some torrent with control of UL/DL ratio :) For
  reason I can't understand this idea is proposed few times per year but
  never tried.

 I suspect BitTorrent is in some ways somehow 'taboo' technology, since
 it is most famous for being used to distributed materials that
 copyright-owners often don't want distributed. I have no detailed idea
 how torrent files are made, how trackers work, etc. I started poking
 around magnet: a bit recently but haven't got a sense for how solid
 that work is yet. Could a simple Wiki page be used for sharing
 torrents? (plus published hash of files elsewhere for integrity
 checks). What would it take to get started?

 Perhaps if http://wiki.dbpedia.org/Downloads35 had the sha1 for each
 download published (rdfa?), then others could experiment with torrents
 and downloaders could cross-check against an authoritative description
 of the file from dbpedia?

   I realise that would
  only address part 

Re: DBpedia hosting burden

2010-04-16 Thread Hugh Glaser
Since I haven't seen it mentioned yet, I thought I would.

I use dbpedia all the time, but never access it, so there is zero load on
the servers.
And for dbpedia to be at the heart of the LOD cloud, does not mean that
there needs to even be much of a server there.

OK, I do access it very occasionally when my system stumbles across (via
sameas, etc) a new dbpedia URI.

What I mean is that I do use a lot of dbpedia URIs, but that does not mean
that I need to resolve them, or SPARQL the dbpedia server with them.
When someone uses the name Barack Obama it doesn't mean they have to
overload the White House press office by asking it for all sorts of personal
details; in fact they might not want to know what the White House thinks
about him - they might be using his name to ask what Al-Jazeera says about
him.
In the same way, when I get a dbpdia URI, that enables me to look up on some
site I care about what the site says about the NIR.

And in term of finding dbpedia URIs, if I want to find a dbpedia URI, I look
whatever I want up in wikipedia, and then use the implied dbpedia URI.

OK, I accept the problems about people who spider it, or want to do complex
queries over it, but that is actually not my view of the LOD world.
My view is that for many applications, I am looking at some small bit of
stuff (say LOD researchers), and so I need to do a few URI resolutions of
the Things that I am interested in, usually in response to some demand.
Possibly I do this transparently using something like the SWCL.

In the general scheme of things, I think that the role of dbpedia
will/should be the provision of URIs with the ability to resolve them when
necessary (and with a reasonable expectation that the client will have a
decent caching policy). SPARQL is a whole different ball-game, and should be
separated out, looking at doing caching, downloads etc..

But the role of dbpedia is to provide URIs and occasional URI resolution to
RDF or equivalent - anything that interferes with that should be challenged.

Best
Hugh

PS.
The situation always reminds me of my mobile.
I use it all the time, but never make or receive calls.
The existence of the mobile in my pocket changes everything about whether my
wife and I need to speak. Because we could if we wanted to, and we know the
other could, we don't need to make the call to say I am on the train.




Re: DBpedia hosting burden

2010-04-16 Thread Nathan
Hugh Glaser wrote:
 decent caching policy). SPARQL is a whole different ball-game, and should be
 separated out

+^1 in every way




Re: DBpedia hosting burden

2010-04-16 Thread Kingsley Idehen

Hugh Glaser wrote:

Since I haven't seen it mentioned yet, I thought I would.

I use dbpedia all the time, but never access it, so there is zero load on
the servers.
And for dbpedia to be at the heart of the LOD cloud, does not mean that
there needs to even be much of a server there.

OK, I do access it very occasionally when my system stumbles across (via
sameas, etc) a new dbpedia URI.

What I mean is that I do use a lot of dbpedia URIs, but that does not mean
that I need to resolve them, or SPARQL the dbpedia server with them.
When someone uses the name Barack Obama it doesn't mean they have to
overload the White House press office by asking it for all sorts of personal
details; in fact they might not want to know what the White House thinks
about him - they might be using his name to ask what Al-Jazeera says about
him.
In the same way, when I get a dbpdia URI, that enables me to look up on some
site I care about what the site says about the NIR.

And in term of finding dbpedia URIs, if I want to find a dbpedia URI, I look
whatever I want up in wikipedia, and then use the implied dbpedia URI.

OK, I accept the problems about people who spider it, or want to do complex
queries over it, but that is actually not my view of the LOD world.
  


Spidering is what we constrain so as to preserve bandwidth.  Even when 
you spider via SPARQL we force you down the OFFSET and LIMIT route.  Key 
point is that these are features (self protection and preservation) as 
opposed to bugs or shortcomings  (as these issues are sometimes framed).


Complex queries, absolutely not a problem, remember, this is what the 
Anytime Query feature is all about, its why we can host faceted 
navigation inside the Quad Store etc.. Complex queries don't chew up 
network bandwidth.



My view is that for many applications, I am looking at some small bit of
stuff (say LOD researchers), and so I need to do a few URI resolutions of
the Things that I am interested in, usually in response to some demand.
Possibly I do this transparently using something like the SWCL.

In the general scheme of things, I think that the role of dbpedia
will/should be the provision of URIs with the ability to resolve them when
necessary (and with a reasonable expectation that the client will have a
decent caching policy). SPARQL is a whole different ball-game, and should be
separated out, looking at doing caching, downloads etc..
  

The DBpedia SPARQL endpoint is an endpoint for handling SPARQL Queries.

The Descriptor Resources that are the product of URI de-referencing are 
the typical targets of crawlers, at least first call before CONSTRUCT 
and DESCRIBE etc.. We already have solutions for these resources (which 
includes a reverse proxy setup and cache directives etc.). In addition, 
we may also 303 to other locations (URLs)  as part of URI de-referencing 
fulfillment etc..

But the role of dbpedia is to provide URIs and occasional URI resolution to
RDF or equivalent - anything that interferes with that should be challenged.
  
DBpedia instance is about providing Sa PARQL endpoint and access to 
Descriptor Resources (nee. Information Resources) via Data Object URI  
de-referencing,  the instance can do both, but and enforces what it 
seeks to offer.


We will make a guide so that everyone is clear :-)


Kingsley

Best
Hugh

PS.
The situation always reminds me of my mobile.
I use it all the time, but never make or receive calls.
The existence of the mobile in my pocket changes everything about whether my
wife and I need to speak. Because we could if we wanted to, and we know the
other could, we don't need to make the call to say I am on the train.


  



--

Regards,

Kingsley Idehen	  
President  CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: DBpedia hosting burden

2010-04-16 Thread Karl Dubost
Kingsley,

Le 15 avr. 2010 à 02:58, Dan Brickley a écrit :
 On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen kide...@openlinksw.com 
 wrote:
 Well here is the critical detail: people typically crawl DBpedia.
 
 Have you considered blocking DBpedia crawlers more aggressively, and
 nudging them to alternative ways of accessing the data? 


Would it be possible to have hard data about http resources for DBpedia?

* Volume of data, hourly, daily and weekly bandwidth?
* type of http resources
* traffic peaks
* use of a cdn?
* etc. 


-- 
Karl Dubost
Montréal, QC, Canada
http://www.la-grange.net/karl/




Re: DBpedia hosting burden

2010-04-16 Thread Kingsley Idehen

Karl Dubost wrote:

Kingsley,

Le 15 avr. 2010 à 02:58, Dan Brickley a écrit :
  

On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen kide...@openlinksw.com wrote:


Well here is the critical detail: people typically crawl DBpedia.
  

Have you considered blocking DBpedia crawlers more aggressively, and
nudging them to alternative ways of accessing the data? 




Would it be possible to have hard data about http resources for DBpedia?

* Volume of data, hourly, daily and weekly bandwidth?
* type of http resources
* traffic peaks
* use of a cdn?
* etc. 



  

Karl,

Yes, but that means HTTP log analysis report etc..

Post guide, we might make time for something like that. There have been 
enough HTTP log requests over the months etc..



--

Regards,

Kingsley Idehen	  
President  CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: DBpedia hosting burden

2010-04-15 Thread Dan Brickley
On Wed, Apr 14, 2010 at 11:50 PM, Daniel Koller dakol...@googlemail.com wrote:
 Dan,
 ...I just setup some torrent files containing the current english and german
 dbpedia content: (.. as a test/proof of concept, was just curious to see how
 fast a network effect via p2p networks).
 To try, go to http://dakoller.net/dbpedia_torrents/dbpedia_torrents.html.
 I presume to get it working you need just the first people downloading (and
 keep spreading it around w/ their Torrent-Clients)... as long as the
 *.torrent-files are consistent. (layout of the link page courtesy of the
 dbpedia-people)

Thanks! OK, let's see if my laptop has enough disk space left ;)
could you post an 'ls -l' too, so we have an idea of the file sizes?

Transmission.app on OSX says Downloading from 1 or 1 peers now (for
a few of them), and from 0 of 0 peers for others. Perhaps you have
some limits/queue in place?

Now this is where my grip on the protocol is weak --- I'm behind NAT
currently, and I forget how this works - can other peers find my
machine via your public seeder?

I'll try this on an ubuntu box too. Would be nice if someone could
join with a single simple script...

cheers,

Dan
I was working my way down the list in
http://dakoller.net/dbpedia_torrents/dbpedia_torrents.html
although when I got to Raw Infobox Property Definitions the first two
links 404'd...



Re: DBpedia hosting burden

2010-04-15 Thread Malte Kiesel

Ivan Mikhailov wrote:


If I were The Emperor of LOD I'd ask all grand dukes of datasources to
put fresh dumps at some torrent with control of UL/DL ratio :)


Last time I checked (which was quite a while ago though), loading 
DBpedia in a normal triple store such as Jena TDB didn't work very well 
due to many issues with the DBpedia RDF (e.g., problems with the URIs of 
external links scraped from Wikipedia).


I don't know whether this is a bug in TDB or DBpedia but I guess this is 
one of the problems causing people to use DBpedia online only - even if, 
due to performance reasons, running it locally would be far better.


Regards
Malte



Re: DBpedia hosting burden

2010-04-15 Thread Ivan Mikhailov
 Last time I checked (which was quite a while ago though), loading 
 DBpedia in a normal triple store such as Jena TDB didn't work very well 
 due to many issues with the DBpedia RDF (e.g., problems with the URIs of 
 external links scraped from Wikipedia).

Agree. Common errors in LOD are:

-- single quoted and double quoted strings with newlines;
-- bnode predicates (but SPARQL processor may ignore them!);
-- variables, but triples with variables are ignored;
-- literal subjects, but triples with them are ignored;
-- '/', '#', '%' and '+' in local part of QName (Qname with path);
-- invalid symbols between '' and '', i.e. in relative IRIs.

That's why my own TURTLE parser is configurable to selectively report or
ignore these errors. In addition I can relax TURTLE syntax to include
popular violations like redundant delimiters and/or try to recover from
lexical errors as much as it is possible, even if I should lose some ill
triples together with some limited number of proper triples around them
(GIGO mode, for Garbage In Garbage Out).

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com





Re: DBpedia hosting burden

2010-04-15 Thread Andy Seaborne
I ran the files from 
http://www.openjena.org/~afs/DBPedia35-parse-log-2010-04-15.txt through 
an N-Triples parser with checking:


The report is here (it's 25K lines long):

http://www.openjena.org/~afs/DBPedia35-parse-log-2010-04-15.txt

It covers both strict errors and warnings of ill-advised forms.

A few examples:

Bad IRI: =?(''[[Nepenthes
Bad IRI: http://www.european-athletics.org‎

Bad lexical forms for the value space:
1967-02-31^^http://www.w3.org/2001/XMLSchema#date
(there is no February the 31st)


Warning of well known ports of other protocols:
http://stream1.securenetsystems.net:443

Warning about explicit about port 80:

http://bibliotecadigitalhispanica.bne.es:80/

and use of . and .. in absolute URIs which are all from the standard 
list of IRI warnings.


Bad IRI: http://dbpedia.org/resource/.. Code: 
8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment /../ not 
at the beginning of a relative reference, or it contains a /./ These 
should be removed.


Andy

Software used:

The IRI checker, by Jeremy Carroll, is available from
http://www.openjena.org/iri/ and Maven.

The lexical form checking is done by Apache Xerces.

The N-triples parser is the one from TDB v0.8.5 which bundles the above 
two together.



On 15/04/2010 9:54 AM, Malte Kiesel wrote:

Ivan Mikhailov wrote:


If I were The Emperor of LOD I'd ask all grand dukes of datasources to
put fresh dumps at some torrent with control of UL/DL ratio :)


Last time I checked (which was quite a while ago though), loading
DBpedia in a normal triple store such as Jena TDB didn't work very well
due to many issues with the DBpedia RDF (e.g., problems with the URIs of
external links scraped from Wikipedia).

I don't know whether this is a bug in TDB or DBpedia but I guess this is
one of the problems causing people to use DBpedia online only - even if,
due to performance reasons, running it locally would be far better.

Regards
Malte





Re: DBpedia hosting burden

2010-04-15 Thread Kingsley Idehen

Andy Seaborne wrote:



On 15/04/2010 2:44 PM, Kingsley Idehen wrote:

Andy,

Great stuff, this is also why we are going to leave the current DBpedia
3.5 instance to stew for a while (until end of this week or a little
later).

DBpedia users:
Now is the time to identify problems with the DBpedia 3.5 dataset dumps.
We don't want to continue reloading DBpedia (Static Edition and then
recalibrating DBpedia-Live) based on faulty datasets related matters, we
do have other operational priorities etc..


Faulty is a bit strong.


Imperfect then, however subjective that might be :-)


Many of the warnings are legal RDF, but bad lexical forms for the 
datatype, or IRIs that trigger some of the standard warnings (but they 
are still legal IRIs).  Should they be included or not? Seems to me 
you can argue both for and against.


external_links_en.nt.bz2  is the largest source of broken IRIs.

DBpedia is a wonderful and important dataset, and being derived from 
elsewhere is unlikely to ever be perfect (for some definition of 
perfect).  Better to have the data than to wait for perfection.

That's been the approach thus far.

Anyway, as I said, we have a window of opportunity to identify current 
issues prior to performing a 3.5.1 reload. I just don't want to reduce 
the reload cycles due to other items on our todo etc..




Andy




--

Regards,

Kingsley Idehen	  
President  CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: DBpedia hosting burden

2010-04-15 Thread Kingsley Idehen

Kingsley Idehen wrote:

Andy Seaborne wrote:



On 15/04/2010 2:44 PM, Kingsley Idehen wrote:

Andy,

Great stuff, this is also why we are going to leave the current DBpedia
3.5 instance to stew for a while (until end of this week or a little
later).

DBpedia users:
Now is the time to identify problems with the DBpedia 3.5 dataset 
dumps.

We don't want to continue reloading DBpedia (Static Edition and then
recalibrating DBpedia-Live) based on faulty datasets related 
matters, we

do have other operational priorities etc..


Faulty is a bit strong.


Imperfect then, however subjective that might be :-)


Many of the warnings are legal RDF, but bad lexical forms for the 
datatype, or IRIs that trigger some of the standard warnings (but 
they are still legal IRIs).  Should they be included or not? Seems to 
me you can argue both for and against.


external_links_en.nt.bz2  is the largest source of broken IRIs.

DBpedia is a wonderful and important dataset, and being derived from 
elsewhere is unlikely to ever be perfect (for some definition of 
perfect).  Better to have the data than to wait for perfection.

That's been the approach thus far.




Actually meant to say:


Anyway, as I said, we have a window of opportunity to identify current 
issues prior to performing a 3.5.1 reload. ** I jwant to reduce the 
reload cycles due to other items on our todo etc..  ***


:-)

--

Regards,

Kingsley Idehen	  
President  CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: DBpedia hosting burden

2010-04-15 Thread Ian Davis
On Wed, Apr 14, 2010 at 8:04 PM, Dan Brickley dan...@danbri.org wrote:

 Bills the major operative word in a world where the Bill Payer and
 Database Maintainer is a footnote (at best) re. perception of what
 constitutes the DBpedia Project.


If dbpedia.org linked to the sparql endpoints of mirrors then that
would be a way of sharing the burden.

Ian



Re: DBpedia hosting burden

2010-04-15 Thread Kingsley Idehen

Ian Davis wrote:

On Wed, Apr 14, 2010 at 8:04 PM, Dan Brickley dan...@danbri.org wrote:
  

Bills the major operative word in a world where the Bill Payer and
Database Maintainer is a footnote (at best) re. perception of what
constitutes the DBpedia Project.
  


If dbpedia.org linked to the sparql endpoints of mirrors then that
would be a way of sharing the burden.

Ian


  

Ian,

When you use the term: SPARQL Mirror (note: Leigh's comments yesterday 
re. not orienting towards this), you open up a different set of issues. 
I don't want to revisit SPARQL and SPARQL extensions debate etc.. Esp. 
as Virtuoso's SPARQL extensions are integral part of what makes the 
DBpedia SPARQL endpoint viable, amongst other things.


The burden issue is basically veering away from the key points, which are:

1. Use the DBpedia instance properly
2. When the instance enforces restrictions, understand that this is a 
Virtuoso *feature* not a bug or server shortcoming.


Beyond the dbpedia.org instance, there are other locations for:

1. Data Sets
2. SPARQL endpoints (like yours and a few others, where functionality 
mirroring isn't an expectation).


Descriptor Resource vhandling ia mirrors, BitTorrents, Reverse Proxies, 
Cache directives, and some 303 heuristics etc.. Are the real issues of 
interest.


Note: I can send wild SPARQL CONSTRUCTs, DESCRIBES, and HTTP GETs for 
Resource Descriptors to a zillion mirrors (maybe next year's April 
Fool's joke re. beauty of Linked Data crawling) and it will only make 
broaden the scope of my dysfunctional behavior. The behavior itself has 
to be handled (one or a zillion mirrors).


Anyway, we will publish our guide for working with DBpedia very soon. I 
believe this will add immense clarity to this matter.


--

Regards,

Kingsley Idehen	  
President  CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: DBpedia hosting burden

2010-04-15 Thread Dan Brickley
On Thu, Apr 15, 2010 at 9:57 PM, Kingsley Idehen kide...@openlinksw.com wrote:
 Ian Davis wrote:

 When you use the term: SPARQL Mirror (note: Leigh's comments yesterday re.
 not orienting towards this), you open up a different set of issues. I don't
 want to revisit SPARQL and SPARQL extensions debate etc.. Esp. as Virtuoso's
 SPARQL extensions are integral part of what makes the DBpedia SPARQL
 endpoint viable, amongst other things.

Having the same dataset available via different implementations of
SPARQL can only be healthy. If certain extensions are necessary, this
will only highlight their importance. If there are public services
offering SPARQL-based access to the DBpedia datasets (or subsets) out
there on the Web, it would be rather useful if we could have them
linked from a single easy to find page, along with information about
any restrictions, quirks, subsetting, or value-adding features special
to that service. I suggest using a section in
http://en.wikipedia.org/wiki/DBpedia for this, unless someone cares to
handle that on dbpedia.org.

 The burden issue is basically veering away from the key points, which are:

 1. Use the DBpedia instance properly
 2. When the instance enforces restrictions, understand that this is a
 Virtuoso *feature* not a bug or server shortcoming.

Yes, the showcase implementation needs to be used properly if it is
going to survive the increasing attention developer LOD is getting. It
is perfectly reasonable of you to make clear when there are limits
they are for everyone's benefit.

 Beyond the dbpedia.org instance, there are other locations for:

 1. Data Sets
 2. SPARQL endpoints (like yours and a few others, where functionality
 mirroring isn't an expectation).

Is there a list somewhere of related SPARQL endpoints? (also other
Wikipedia-derrived datasets in RDF)

 Descriptor Resource vhandling ia mirrors, BitTorrents, Reverse Proxies,
 Cache directives, and some 303 heuristics etc.. Are the real issues of
 interest.

(am chatting with Daniel Koller in Skype now re the BitTorrent experiments...)

 Note: I can send wild SPARQL CONSTRUCTs, DESCRIBES, and HTTP GETs for
 Resource Descriptors to a zillion mirrors (maybe next year's April Fool's
 joke re. beauty of Linked Data crawling) and it will only make broaden the
 scope of my dysfunctional behavior. The behavior itself has to be handled
 (one or a zillion mirrors).

Sure. But on balance, more mirrors rather than fewer should benefit
everyone, particularly if 'good behaviour' is documented and
enforced...

 Anyway, we will publish our guide for working with DBpedia very soon. I
 believe this will add immense clarity to this matter.

Great!

cheers,

Dan



Re: DBpedia hosting burden

2010-04-15 Thread Dan Brickley
On Thu, Apr 15, 2010 at 9:57 PM, Kingsley Idehen kide...@openlinksw.com wrote:
 Ian Davis wrote:

 When you use the term: SPARQL Mirror (note: Leigh's comments yesterday re.
 not orienting towards this), you open up a different set of issues. I don't
 want to revisit SPARQL and SPARQL extensions debate etc.. Esp. as Virtuoso's
 SPARQL extensions are integral part of what makes the DBpedia SPARQL
 endpoint viable, amongst other things.

Having the same dataset available via different implementations of
SPARQL can only be healthy. If certain extensions are necessary, this
will only highlight their importance. If there are public services
offering SPARQL-based access to the DBpedia datasets (or subsets) out
there on the Web, it would be rather useful if we could have them
linked from a single easy to find page, along with information about
any restrictions, quirks, subsetting, or value-adding features special
to that service. I suggest using a section in
http://en.wikipedia.org/wiki/DBpedia for this, unless someone cares to
handle that on dbpedia.org.

 The burden issue is basically veering away from the key points, which are:

 1. Use the DBpedia instance properly
 2. When the instance enforces restrictions, understand that this is a
 Virtuoso *feature* not a bug or server shortcoming.

Yes, the showcase implementation needs to be used properly if it is
going to survive the increasing attention developer LOD is getting. It
is perfectly reasonable of you to make clear when there are limits
they are for everyone's benefit.

 Beyond the dbpedia.org instance, there are other locations for:

 1. Data Sets
 2. SPARQL endpoints (like yours and a few others, where functionality
 mirroring isn't an expectation).

Is there a list somewhere of related SPARQL endpoints? (also other
Wikipedia-derrived datasets in RDF)

 Descriptor Resource vhandling ia mirrors, BitTorrents, Reverse Proxies,
 Cache directives, and some 303 heuristics etc.. Are the real issues of
 interest.

(am chatting with Daniel Koller in Skype now re the BitTorrent experiments...)

 Note: I can send wild SPARQL CONSTRUCTs, DESCRIBES, and HTTP GETs for
 Resource Descriptors to a zillion mirrors (maybe next year's April Fool's
 joke re. beauty of Linked Data crawling) and it will only make broaden the
 scope of my dysfunctional behavior. The behavior itself has to be handled
 (one or a zillion mirrors).

Sure. But on balance, more mirrors rather than fewer should benefit
everyone, particularly if 'good behaviour' is documented and
enforced...

 Anyway, we will publish our guide for working with DBpedia very soon. I
 believe this will add immense clarity to this matter.

Great!

cheers,

Dan



Re: DBpedia hosting burden

2010-04-15 Thread Kingsley Idehen

Dan Brickley wrote:

On Thu, Apr 15, 2010 at 9:57 PM, Kingsley Idehen kide...@openlinksw.com wrote:
  

Ian Davis wrote:

When you use the term: SPARQL Mirror (note: Leigh's comments yesterday re.
not orienting towards this), you open up a different set of issues. I don't
want to revisit SPARQL and SPARQL extensions debate etc.. Esp. as Virtuoso's
SPARQL extensions are integral part of what makes the DBpedia SPARQL
endpoint viable, amongst other things.



Having the same dataset available via different implementations of
SPARQL can only be healthy. If certain extensions are necessary, this
will only highlight their importance. If there are public services
offering SPARQL-based access to the DBpedia datasets (or subsets) out
there on the Web, it would be rather useful if we could have them
linked from a single easy to find page, along with information about
any restrictions, quirks, subsetting, or value-adding features special
to that service. I suggest using a section in
http://en.wikipedia.org/wiki/DBpedia for this, unless someone cares to
handle that on dbpedia.org.
  

+1

  

The burden issue is basically veering away from the key points, which are:

1. Use the DBpedia instance properly
2. When the instance enforces restrictions, understand that this is a
Virtuoso *feature* not a bug or server shortcoming.



Yes, the showcase implementation needs to be used properly if it is
going to survive the increasing attention developer LOD is getting. It
is perfectly reasonable of you to make clear when there are limits
they are for everyone's benefit.
  


Yep, and as promised we will publish a document, this is certainly a 
missing piece of the puzzle right now.
  

Beyond the dbpedia.org instance, there are other locations for:

1. Data Sets
2. SPARQL endpoints (like yours and a few others, where functionality
mirroring isn't an expectation).



Is there a list somewhere of related SPARQL endpoints? (also other
Wikipedia-derrived datasets in RDF)

  


See: http://delicious.com/kidehen/sparql_endpoint, that's how I track 
SPARQL endpoints, at the current time.



Descriptor Resource vhandling ia mirrors, BitTorrents, Reverse Proxies,
Cache directives, and some 303 heuristics etc.. Are the real issues of
interest.



(am chatting with Daniel Koller in Skype now re the BitTorrent experiments...)
  


Yes, seeing progress.
  

Note: I can send wild SPARQL CONSTRUCTs, DESCRIBES, and HTTP GETs for
Resource Descriptors to a zillion mirrors (maybe next year's April Fool's
joke re. beauty of Linked Data crawling) and it will only make broaden the
scope of my dysfunctional behavior. The behavior itself has to be handled
(one or a zillion mirrors).



Sure. But on balance, more mirrors rather than fewer should benefit
everyone, particularly if 'good behaviour' is documented and
enforced...
  


Yes, LinkedData DNS remains a personal aspiration of mine, but no matter 
what we build, enforcement needs to be understood as a *feature* rather 
than a bug or deficiency etc..
  

Anyway, we will publish our guide for working with DBpedia very soon. I
believe this will add immense clarity to this matter.



Great!

cheers,

Dan

  



--

Regards,

Kingsley Idehen	  
President  CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: DBpedia hosting burden

2010-04-14 Thread Kingsley Idehen

Dan Brickley wrote:

(trimming cc: list to LOD and DBPedia)

On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen kide...@openlinksw.com wrote:

  

My comment wasn't a what is DBpedia? lecture. It was about clarifying
the crux of the matter i.e., bandwidth consumption and its effects on
other DBpedia users (as well as our own non-DBpedia related Web properties).


(Leigh)
  

I was just curious about usage volumes. We all talk about how central
dbpedia is in the LOD cloud picture, and wondered if there was any
publicly accessible metrics to help add some detail that.

  

Well here is the critical detail: people typically crawl DBpedia. They
crawl it more than any other Data Space in the LOD cloud. They do so
because DBpedia is still quite central to to the burgeoning Web of
Linked Data.



Have you considered blocking DBpedia crawlers more aggressively, and
nudging them to alternative ways of accessing the data? 


Yes.

Some have cleaned up their act for sure.

Problem is, there are others doing the same thing, who then complain 
about the instance in very generic fashion.



While it is a
shame to say 'no' to people trying to use linked data, this would be
more saying 'yes, but not like that...'.
  


I think we have an outstanding blog post / technical note about the 
DBpedia instance that hasn't been published (possibly due to the 3.5 and 
DBpedia-Live work we are doing), said note will cover how to work with 
the instance etc..
  

When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs
via SPARQL, which is still ultimately Export from DBpedia and Import to
my data space mindset.



That's useful to know, thanks. Do you have the impression that these
folk are typically trying to copy the entire thing, or to make some
filtered subset (by geographical view, topic, property etc).
Many (and to some degree quite natural) attempt to export the whole 
thing. Even when they're nudged to use OFFSET and LIMIT, end result is 
multiple hits en route to complete export.

 Can
studying these logs help provide different downloadable dumps that
would discourage crawlers?
  


We do have a solution in mind, basically, we are going to have a 
different place for the descriptor resources and redirect crawlers 
there  via 303's etc..
  

That's as simple and precise as this matter is.

 From a SPARQL perspective, DBpedia is quite microscopic, its when you
factor in Crawler mentality and network bandwith that issues arise, and
we deliberately have protection in place for Crawlers.



Looking at http://wiki.dbpedia.org/OnlineAccess#h28-14 I don't see
anything discouraging crawlers. Where is the 'best practice' or
'acceptable use' advice we should all be following, to avoid putting
needless burden on your servers and bandwidth?
  


We'll get the guide out.

As you mention, DBpedia is an important and central resource, thanks
both to the work of the Wikipedia community, and those in the DBpedia
project who enrich and make available all that information. It's
therefore important that the SemWeb / Linked Data community takes care
to remember that these things don't come for free, that bills need
paying and that de-referencing is a privilege not a right.


Bills the major operative word in a world where the Bill Payer and 
Database Maintainer is a footnote (at best) re. perception of what 
constitutes the DBpedia Project.


Our own ISPs even had to get in contact with us (last quarter of 2009) 
re. the amount of bandwidth being consumed by DBpedia etc..



 If there
are things we can do as a technology community to lower the cost of
hosting / distributing such data, or to nudge consumers of it in the
direction of more sustainable habits, we should do so. If there's not
so much the rest of us can do but say 'thanks!', ... then, ...er,
'thanks!'. Much appreciated!
  


For us, the most important thing is perspective. DBpedia is another 
space on a public network, thus it can't magically rewrite the 
underlying physics of wide area networking where access is open to the 
world.  Thus, we can make a note about proper behavior and explain how 
we protect the instance such that everyone has a chance of using it 
(rather than a select few resource guzzlers).

Are there any scenarios around eg. BitTorrent that could be explored?
What if each of the static files in http://dbpedia.org/sitemap.xml
were available as torrents (or magnet: URIs)?
When we set up the Descriptor Resource host, these would certainly be 
considered.

 I realise that would
only address part of the problem/cost, but it's a widely used
technology for distributing large files; can we bend it to our needs?
  

Also, we encourage use of gzip over HTTP  :-)

Kingsley

cheers,

Dan

  



--

Regards,

Kingsley Idehen	  
President  CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: DBpedia hosting burden

2010-04-14 Thread Kingsley Idehen

Nathan wrote:

Dan Brickley wrote:
  

(trimming cc: list to LOD and DBPedia)

On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen kide...@openlinksw.com wrote:



My comment wasn't a what is DBpedia? lecture. It was about clarifying
the crux of the matter i.e., bandwidth consumption and its effects on
other DBpedia users (as well as our own non-DBpedia related Web properties).
  

(Leigh)


I was just curious about usage volumes. We all talk about how central
dbpedia is in the LOD cloud picture, and wondered if there was any
publicly accessible metrics to help add some detail that.



Well here is the critical detail: people typically crawl DBpedia. They
crawl it more than any other Data Space in the LOD cloud. They do so
because DBpedia is still quite central to to the burgeoning Web of
Linked Data.
  

Have you considered blocking DBpedia crawlers more aggressively, and
nudging them to alternative ways of accessing the data? While it is a
shame to say 'no' to people trying to use linked data, this would be
more saying 'yes, but not like that...'.



When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs
via SPARQL, which is still ultimately Export from DBpedia and Import to
my data space mindset.
  

That's useful to know, thanks. Do you have the impression that these
folk are typically trying to copy the entire thing, or to make some
filtered subset (by geographical view, topic, property etc). Can
studying these logs help provide different downloadable dumps that
would discourage crawlers?



That's as simple and precise as this matter is.

 From a SPARQL perspective, DBpedia is quite microscopic, its when you
factor in Crawler mentality and network bandwith that issues arise, and
we deliberately have protection in place for Crawlers.
  

Looking at http://wiki.dbpedia.org/OnlineAccess#h28-14 I don't see
anything discouraging crawlers. Where is the 'best practice' or
'acceptable use' advice we should all be following, to avoid putting
needless burden on your servers and bandwidth?

As you mention, DBpedia is an important and central resource, thanks
both to the work of the Wikipedia community, and those in the DBpedia
project who enrich and make available all that information. It's
therefore important that the SemWeb / Linked Data community takes care
to remember that these things don't come for free, that bills need
paying and that de-referencing is a privilege not a right. If there
are things we can do as a technology community to lower the cost of
hosting / distributing such data, or to nudge consumers of it in the
direction of more sustainable habits, we should do so. If there's not
so much the rest of us can do but say 'thanks!', ... then, ...er,
'thanks!'. Much appreciated!

Are there any scenarios around eg. BitTorrent that could be explored?
What if each of the static files in http://dbpedia.org/sitemap.xml
were available as torrents (or magnet: URIs)? I realise that would
only address part of the problem/cost, but it's a widely used
technology for distributing large files; can we bend it to our needs?




I'd like to add; could the /data/* and /page/* resources all be made
static files? (if they are not already) + make use of http caching etc.
  


Yes.

perhaps even the non-sparql dependant parts could be hosted on another
machine purely for static content? perhaps an interim proxy which
cache's said resources permanently (then cache rebuild on request when a
new dataset is upgraded)
  

Yes.

Kingsley

regards!

  



--

Regards,

Kingsley Idehen	  
President  CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: DBpedia hosting burden

2010-04-14 Thread Ross Singer
On Wed, Apr 14, 2010 at 1:58 PM, Dan Brickley dan...@danbri.org wrote:
 (trimming cc: list to LOD and DBPedia)

Using Dan's trimmed list to continue...

 On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen kide...@openlinksw.com 
 wrote:
 When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs
 via SPARQL, which is still ultimately Export from DBpedia and Import to
 my data space mindset.

Is this necessarily true?  Couldn't the CONSTRUCT and/or DESCRIBE
queries be used to find resources and view the whole graph (or
specialized subsets) to determine if it's actually what is being
sought?

Is it better for DBpedia to do SELECTs and then retrieve the resource
URIs individually?

I suppose rather than assume that the data is all being exported into
another space (which, I would think, definitely happening -- having
data locally aids tremendously in indexing, for example) it could be a
case of people just using SPARQL the way it seems that SPARQL should
work?

-Ross.



Re: DBpedia hosting burden

2010-04-14 Thread Kingsley Idehen

Ross Singer wrote:

On Wed, Apr 14, 2010 at 1:58 PM, Dan Brickley dan...@danbri.org wrote:
  

(trimming cc: list to LOD and DBPedia)



Using Dan's trimmed list to continue...
  

On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen kide...@openlinksw.com wrote:


When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs
via SPARQL, which is still ultimately Export from DBpedia and Import to
my data space mindset.
  


Is this necessarily true?  Couldn't the CONSTRUCT and/or DESCRIBE
queries be used to find resources and view the whole graph (or
specialized subsets) to determine if it's actually what is being
sought?
  


I meant: the are sending a series of these query patterns with the same 
goal in mind: an export from DBpedia for import into their own Data Spaces.

Is it better for DBpedia to do SELECTs and then retrieve the resource
URIs individually?
  
You can, and should use the full gamut of SPARQL queries, the issue is 
how they are used.


On our side, we've always had the ability to protect the server. In 
recent times, we simply up the ante re. protection against problematic 
behavior.


My only concern is that the tightening of control is sometimes 
misconstrued as a problem with the instance etc..



I suppose rather than assume that the data is all being exported into
another space (which, I would think, definitely happening -- having
data locally aids tremendously in indexing, for example) it could be a
case of people just using SPARQL the way it seems that SPARQL should
work?
  
Hence the onus is on us to make a smart server, which we've had since 
day one. Again, the issue is: when the server protects itself, the 
behavior is being misconstrued as an instance problem.


If you make a local instance of Virtuoso + DBpedia, you will see what I 
mean, and basically it would come down to what Nathan explained in this 
recent post [1]. Key excerpt:


...The public lod and dbpedia endpoints really do no justice as to just 
how powerful and fast Virtuoso is, queries which take a few seconds on 
the public endpoint return in hundredths of a second on my local (low 
spec) server...


Links:

1. 
http://webr3.org/blog/experiments/linked-data-extractor-prototype-details/


Kingsley

-Ross.

  



--

Regards,

Kingsley Idehen	  
President  CEO 
OpenLink Software 
Web: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen 









Re: DBpedia hosting burden

2010-04-14 Thread Ivan Mikhailov
Dan,

 Are there any scenarios around eg. BitTorrent that could be explored?
 What if each of the static files in http://dbpedia.org/sitemap.xml
 were available as torrents (or magnet: URIs)? I realise that would
 only address part of the problem/cost, but it's a widely used
 technology for distributing large files; can we bend it to our needs?

If I were The Emperor of LOD I'd ask all grand dukes of datasources to
put fresh dumps at some torrent with control of UL/DL ratio :) For
reason I can't understand this idea is proposed few times per year but
never tried.

Other approach is to implement scalable and safe patch/diff on RDF
graphs plus subscription on them. That's what I'm writing ATM. Using
this toolkit, it would be quite cheap to place a local copy of LOD on
any appropriate box in any workgroup. A local copy will not require any
hi-end equipment for two reasons: the database can be much smaller than
the public one (one may install only a subset of LOD) and it will
usually less sensitive to RAM/disk ratio (small number of clients will
result in better locality because any given individual tend to browse
interrelated data whereas a crowd produces chaotic sequence of
requests). Crawlers and mobile apps will not migrate to local copies,
but some complicated queries will go away from the bottleneck server and
that would be good enough.

Best Regards,

Ivan Mikhailov
OpenLink Software
http://virtuoso.openlinksw.com




Re: DBpedia hosting burden

2010-04-14 Thread Dan Brickley
On Wed, Apr 14, 2010 at 8:11 PM, Kingsley Idehen kide...@openlinksw.com wrote:


 Some have cleaned up their act for sure.

 Problem is, there are others doing the same thing, who then complain about
 the instance in very generic fashion.

They're lucky it exists at all. I'd refer them to this Louis CK sketch
- 
http://videosift.com/video/Louie-CK-on-Conan-Oct-1st-2008?fromdupe=We-live-in-an-amazing-amazing-world-and-we-complain
(if it stays online...).

 While it is a
 shame to say 'no' to people trying to use linked data, this would be
 more saying 'yes, but not like that...'.


 I think we have an outstanding blog post / technical note about the DBpedia
 instance that hasn't been published (possibly due to the 3.5 and
 DBpedia-Live work we are doing), said note will cover how to work with the
 instance etc..
[..]
 We do have a solution in mind, basically, we are going to have a different
 place for the descriptor resources and redirect crawlers there  via 303's
 etc..
[...]
 We'll get the guide out.


That sounds useful

 As you mention, DBpedia is an important and central resource, thanks
 both to the work of the Wikipedia community, and those in the DBpedia
 project who enrich and make available all that information. It's
 therefore important that the SemWeb / Linked Data community takes care
 to remember that these things don't come for free, that bills need
 paying and that de-referencing is a privilege not a right.

 Bills the major operative word in a world where the Bill Payer and
 Database Maintainer is a footnote (at best) re. perception of what
 constitutes the DBpedia Project.

Yes, I'm sure some are thoughtless and take it for granted; but also
that others are well aware of the burdens.

(For that matter, I'm not myself so sure how Wikipedia cover their
costs or what their longer-term plan is...).


 For us, the most important thing is perspective. DBpedia is another space on
 a public network, thus it can't magically rewrite the underlying physics of
 wide area networking where access is open to the world.  Thus, we can make a
 note about proper behavior and explain how we protect the instance such that
 everyone has a chance of using it (rather than a select few resource
 guzzlers).

This I think is something others can help with, when presenting LOD
and related concepts: to encourage good habits that spread the cost of
keeping this great dataset globally available. So all those making
slides, tutorials, blog posts or software tools have a role to play
here.

 Are there any scenarios around eg. BitTorrent that could be explored?
 What if each of the static files in http://dbpedia.org/sitemap.xml
 were available as torrents (or magnet: URIs)?

 When we set up the Descriptor Resource host, these would certainly be
 considered.

Ok, let's take care to explore that then; it would probably help
others too. There must be dozens of companies and research
organizations who could put some bandwidth resources into this, if
only there was a short guide to setting up a GUI-less bittorrent tool
and configuring it appropriately. Are there any bittorrent experts on
these mailing lists who could suggest next practical steps here (not
necessarily dbpedia-specific)?

(ah I see a reply from Ivan; copying it in here...)

 If I were The Emperor of LOD I'd ask all grand dukes of datasources to
 put fresh dumps at some torrent with control of UL/DL ratio :) For
 reason I can't understand this idea is proposed few times per year but
 never tried.

I suspect BitTorrent is in some ways somehow 'taboo' technology, since
it is most famous for being used to distributed materials that
copyright-owners often don't want distributed. I have no detailed idea
how torrent files are made, how trackers work, etc. I started poking
around magnet: a bit recently but haven't got a sense for how solid
that work is yet. Could a simple Wiki page be used for sharing
torrents? (plus published hash of files elsewhere for integrity
checks). What would it take to get started?

Perhaps if http://wiki.dbpedia.org/Downloads35 had the sha1 for each
download published (rdfa?), then others could experiment with torrents
and downloaders could cross-check against an authoritative description
of the file from dbpedia?

  I realise that would
 only address part of the problem/cost, but it's a widely used
 technology for distributing large files; can we bend it to our needs?


 Also, we encourage use of gzip over HTTP  :-)

Are there any RDF toolkits in need of a patch to their default setup
in this regard? Tutorials that need fixing, etc?

cheers,

Dan


ps. re big datasets, Library of Congress apparently are going to have
complete twitter archive - see
http://twitter.com/librarycongress/status/12172217971  -
http://blogs.loc.gov/loc/2010/04/how-tweet-it-is-library-acquires-entire-twitter-archive/