Re: DBpedia hosting burden
Dan, I just setup some torrent files containing the current english and german dbpedia content: (.. as a test/proof of concept, was just curious to see how fast a network effect via p2p networks). To try, go to http://dakoller.net/dbpedia_torrents/dbpedia_torrents.html. I presume to get it working you need just the first people downloading (and keep spreading it around w/ their Torrent-Clients)... as long as the *.torrent-files are consistent. (layout of the link page courtesy of the dbpedia-people) Kind regards, Daniel On Wed, Apr 14, 2010 at 9:04 PM, Dan Brickley dan...@danbri.org wrote: On Wed, Apr 14, 2010 at 8:11 PM, Kingsley Idehen kide...@openlinksw.com wrote: Some have cleaned up their act for sure. Problem is, there are others doing the same thing, who then complain about the instance in very generic fashion. They're lucky it exists at all. I'd refer them to this Louis CK sketch - http://videosift.com/video/Louie-CK-on-Conan-Oct-1st-2008?fromdupe=We-live-in-an-amazing-amazing-world-and-we-complain (if it stays online...). While it is a shame to say 'no' to people trying to use linked data, this would be more saying 'yes, but not like that...'. I think we have an outstanding blog post / technical note about the DBpedia instance that hasn't been published (possibly due to the 3.5 and DBpedia-Live work we are doing), said note will cover how to work with the instance etc.. [..] We do have a solution in mind, basically, we are going to have a different place for the descriptor resources and redirect crawlers there via 303's etc.. [...] We'll get the guide out. That sounds useful As you mention, DBpedia is an important and central resource, thanks both to the work of the Wikipedia community, and those in the DBpedia project who enrich and make available all that information. It's therefore important that the SemWeb / Linked Data community takes care to remember that these things don't come for free, that bills need paying and that de-referencing is a privilege not a right. Bills the major operative word in a world where the Bill Payer and Database Maintainer is a footnote (at best) re. perception of what constitutes the DBpedia Project. Yes, I'm sure some are thoughtless and take it for granted; but also that others are well aware of the burdens. (For that matter, I'm not myself so sure how Wikipedia cover their costs or what their longer-term plan is...). For us, the most important thing is perspective. DBpedia is another space on a public network, thus it can't magically rewrite the underlying physics of wide area networking where access is open to the world. Thus, we can make a note about proper behavior and explain how we protect the instance such that everyone has a chance of using it (rather than a select few resource guzzlers). This I think is something others can help with, when presenting LOD and related concepts: to encourage good habits that spread the cost of keeping this great dataset globally available. So all those making slides, tutorials, blog posts or software tools have a role to play here. Are there any scenarios around eg. BitTorrent that could be explored? What if each of the static files in http://dbpedia.org/sitemap.xml were available as torrents (or magnet: URIs)? When we set up the Descriptor Resource host, these would certainly be considered. Ok, let's take care to explore that then; it would probably help others too. There must be dozens of companies and research organizations who could put some bandwidth resources into this, if only there was a short guide to setting up a GUI-less bittorrent tool and configuring it appropriately. Are there any bittorrent experts on these mailing lists who could suggest next practical steps here (not necessarily dbpedia-specific)? (ah I see a reply from Ivan; copying it in here...) If I were The Emperor of LOD I'd ask all grand dukes of datasources to put fresh dumps at some torrent with control of UL/DL ratio :) For reason I can't understand this idea is proposed few times per year but never tried. I suspect BitTorrent is in some ways somehow 'taboo' technology, since it is most famous for being used to distributed materials that copyright-owners often don't want distributed. I have no detailed idea how torrent files are made, how trackers work, etc. I started poking around magnet: a bit recently but haven't got a sense for how solid that work is yet. Could a simple Wiki page be used for sharing torrents? (plus published hash of files elsewhere for integrity checks). What would it take to get started? Perhaps if http://wiki.dbpedia.org/Downloads35 had the sha1 for each download published (rdfa?), then others could experiment with torrents and downloaders could cross-check against an authoritative description of the file from dbpedia? I realise that would only address part
Re: DBpedia hosting burden
Since I haven't seen it mentioned yet, I thought I would. I use dbpedia all the time, but never access it, so there is zero load on the servers. And for dbpedia to be at the heart of the LOD cloud, does not mean that there needs to even be much of a server there. OK, I do access it very occasionally when my system stumbles across (via sameas, etc) a new dbpedia URI. What I mean is that I do use a lot of dbpedia URIs, but that does not mean that I need to resolve them, or SPARQL the dbpedia server with them. When someone uses the name Barack Obama it doesn't mean they have to overload the White House press office by asking it for all sorts of personal details; in fact they might not want to know what the White House thinks about him - they might be using his name to ask what Al-Jazeera says about him. In the same way, when I get a dbpdia URI, that enables me to look up on some site I care about what the site says about the NIR. And in term of finding dbpedia URIs, if I want to find a dbpedia URI, I look whatever I want up in wikipedia, and then use the implied dbpedia URI. OK, I accept the problems about people who spider it, or want to do complex queries over it, but that is actually not my view of the LOD world. My view is that for many applications, I am looking at some small bit of stuff (say LOD researchers), and so I need to do a few URI resolutions of the Things that I am interested in, usually in response to some demand. Possibly I do this transparently using something like the SWCL. In the general scheme of things, I think that the role of dbpedia will/should be the provision of URIs with the ability to resolve them when necessary (and with a reasonable expectation that the client will have a decent caching policy). SPARQL is a whole different ball-game, and should be separated out, looking at doing caching, downloads etc.. But the role of dbpedia is to provide URIs and occasional URI resolution to RDF or equivalent - anything that interferes with that should be challenged. Best Hugh PS. The situation always reminds me of my mobile. I use it all the time, but never make or receive calls. The existence of the mobile in my pocket changes everything about whether my wife and I need to speak. Because we could if we wanted to, and we know the other could, we don't need to make the call to say I am on the train.
Re: DBpedia hosting burden
Hugh Glaser wrote: decent caching policy). SPARQL is a whole different ball-game, and should be separated out +^1 in every way
Re: DBpedia hosting burden
Hugh Glaser wrote: Since I haven't seen it mentioned yet, I thought I would. I use dbpedia all the time, but never access it, so there is zero load on the servers. And for dbpedia to be at the heart of the LOD cloud, does not mean that there needs to even be much of a server there. OK, I do access it very occasionally when my system stumbles across (via sameas, etc) a new dbpedia URI. What I mean is that I do use a lot of dbpedia URIs, but that does not mean that I need to resolve them, or SPARQL the dbpedia server with them. When someone uses the name Barack Obama it doesn't mean they have to overload the White House press office by asking it for all sorts of personal details; in fact they might not want to know what the White House thinks about him - they might be using his name to ask what Al-Jazeera says about him. In the same way, when I get a dbpdia URI, that enables me to look up on some site I care about what the site says about the NIR. And in term of finding dbpedia URIs, if I want to find a dbpedia URI, I look whatever I want up in wikipedia, and then use the implied dbpedia URI. OK, I accept the problems about people who spider it, or want to do complex queries over it, but that is actually not my view of the LOD world. Spidering is what we constrain so as to preserve bandwidth. Even when you spider via SPARQL we force you down the OFFSET and LIMIT route. Key point is that these are features (self protection and preservation) as opposed to bugs or shortcomings (as these issues are sometimes framed). Complex queries, absolutely not a problem, remember, this is what the Anytime Query feature is all about, its why we can host faceted navigation inside the Quad Store etc.. Complex queries don't chew up network bandwidth. My view is that for many applications, I am looking at some small bit of stuff (say LOD researchers), and so I need to do a few URI resolutions of the Things that I am interested in, usually in response to some demand. Possibly I do this transparently using something like the SWCL. In the general scheme of things, I think that the role of dbpedia will/should be the provision of URIs with the ability to resolve them when necessary (and with a reasonable expectation that the client will have a decent caching policy). SPARQL is a whole different ball-game, and should be separated out, looking at doing caching, downloads etc.. The DBpedia SPARQL endpoint is an endpoint for handling SPARQL Queries. The Descriptor Resources that are the product of URI de-referencing are the typical targets of crawlers, at least first call before CONSTRUCT and DESCRIBE etc.. We already have solutions for these resources (which includes a reverse proxy setup and cache directives etc.). In addition, we may also 303 to other locations (URLs) as part of URI de-referencing fulfillment etc.. But the role of dbpedia is to provide URIs and occasional URI resolution to RDF or equivalent - anything that interferes with that should be challenged. DBpedia instance is about providing Sa PARQL endpoint and access to Descriptor Resources (nee. Information Resources) via Data Object URI de-referencing, the instance can do both, but and enforces what it seeks to offer. We will make a guide so that everyone is clear :-) Kingsley Best Hugh PS. The situation always reminds me of my mobile. I use it all the time, but never make or receive calls. The existence of the mobile in my pocket changes everything about whether my wife and I need to speak. Because we could if we wanted to, and we know the other could, we don't need to make the call to say I am on the train. -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: DBpedia hosting burden
Kingsley, Le 15 avr. 2010 à 02:58, Dan Brickley a écrit : On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen kide...@openlinksw.com wrote: Well here is the critical detail: people typically crawl DBpedia. Have you considered blocking DBpedia crawlers more aggressively, and nudging them to alternative ways of accessing the data? Would it be possible to have hard data about http resources for DBpedia? * Volume of data, hourly, daily and weekly bandwidth? * type of http resources * traffic peaks * use of a cdn? * etc. -- Karl Dubost Montréal, QC, Canada http://www.la-grange.net/karl/
Re: DBpedia hosting burden
Karl Dubost wrote: Kingsley, Le 15 avr. 2010 à 02:58, Dan Brickley a écrit : On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen kide...@openlinksw.com wrote: Well here is the critical detail: people typically crawl DBpedia. Have you considered blocking DBpedia crawlers more aggressively, and nudging them to alternative ways of accessing the data? Would it be possible to have hard data about http resources for DBpedia? * Volume of data, hourly, daily and weekly bandwidth? * type of http resources * traffic peaks * use of a cdn? * etc. Karl, Yes, but that means HTTP log analysis report etc.. Post guide, we might make time for something like that. There have been enough HTTP log requests over the months etc.. -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: DBpedia hosting burden
On Wed, Apr 14, 2010 at 11:50 PM, Daniel Koller dakol...@googlemail.com wrote: Dan, ...I just setup some torrent files containing the current english and german dbpedia content: (.. as a test/proof of concept, was just curious to see how fast a network effect via p2p networks). To try, go to http://dakoller.net/dbpedia_torrents/dbpedia_torrents.html. I presume to get it working you need just the first people downloading (and keep spreading it around w/ their Torrent-Clients)... as long as the *.torrent-files are consistent. (layout of the link page courtesy of the dbpedia-people) Thanks! OK, let's see if my laptop has enough disk space left ;) could you post an 'ls -l' too, so we have an idea of the file sizes? Transmission.app on OSX says Downloading from 1 or 1 peers now (for a few of them), and from 0 of 0 peers for others. Perhaps you have some limits/queue in place? Now this is where my grip on the protocol is weak --- I'm behind NAT currently, and I forget how this works - can other peers find my machine via your public seeder? I'll try this on an ubuntu box too. Would be nice if someone could join with a single simple script... cheers, Dan I was working my way down the list in http://dakoller.net/dbpedia_torrents/dbpedia_torrents.html although when I got to Raw Infobox Property Definitions the first two links 404'd...
Re: DBpedia hosting burden
Ivan Mikhailov wrote: If I were The Emperor of LOD I'd ask all grand dukes of datasources to put fresh dumps at some torrent with control of UL/DL ratio :) Last time I checked (which was quite a while ago though), loading DBpedia in a normal triple store such as Jena TDB didn't work very well due to many issues with the DBpedia RDF (e.g., problems with the URIs of external links scraped from Wikipedia). I don't know whether this is a bug in TDB or DBpedia but I guess this is one of the problems causing people to use DBpedia online only - even if, due to performance reasons, running it locally would be far better. Regards Malte
Re: DBpedia hosting burden
Last time I checked (which was quite a while ago though), loading DBpedia in a normal triple store such as Jena TDB didn't work very well due to many issues with the DBpedia RDF (e.g., problems with the URIs of external links scraped from Wikipedia). Agree. Common errors in LOD are: -- single quoted and double quoted strings with newlines; -- bnode predicates (but SPARQL processor may ignore them!); -- variables, but triples with variables are ignored; -- literal subjects, but triples with them are ignored; -- '/', '#', '%' and '+' in local part of QName (Qname with path); -- invalid symbols between '' and '', i.e. in relative IRIs. That's why my own TURTLE parser is configurable to selectively report or ignore these errors. In addition I can relax TURTLE syntax to include popular violations like redundant delimiters and/or try to recover from lexical errors as much as it is possible, even if I should lose some ill triples together with some limited number of proper triples around them (GIGO mode, for Garbage In Garbage Out). Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com
Re: DBpedia hosting burden
I ran the files from http://www.openjena.org/~afs/DBPedia35-parse-log-2010-04-15.txt through an N-Triples parser with checking: The report is here (it's 25K lines long): http://www.openjena.org/~afs/DBPedia35-parse-log-2010-04-15.txt It covers both strict errors and warnings of ill-advised forms. A few examples: Bad IRI: =?(''[[Nepenthes Bad IRI: http://www.european-athletics.org‎ Bad lexical forms for the value space: 1967-02-31^^http://www.w3.org/2001/XMLSchema#date (there is no February the 31st) Warning of well known ports of other protocols: http://stream1.securenetsystems.net:443 Warning about explicit about port 80: http://bibliotecadigitalhispanica.bne.es:80/ and use of . and .. in absolute URIs which are all from the standard list of IRI warnings. Bad IRI: http://dbpedia.org/resource/.. Code: 8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment /../ not at the beginning of a relative reference, or it contains a /./ These should be removed. Andy Software used: The IRI checker, by Jeremy Carroll, is available from http://www.openjena.org/iri/ and Maven. The lexical form checking is done by Apache Xerces. The N-triples parser is the one from TDB v0.8.5 which bundles the above two together. On 15/04/2010 9:54 AM, Malte Kiesel wrote: Ivan Mikhailov wrote: If I were The Emperor of LOD I'd ask all grand dukes of datasources to put fresh dumps at some torrent with control of UL/DL ratio :) Last time I checked (which was quite a while ago though), loading DBpedia in a normal triple store such as Jena TDB didn't work very well due to many issues with the DBpedia RDF (e.g., problems with the URIs of external links scraped from Wikipedia). I don't know whether this is a bug in TDB or DBpedia but I guess this is one of the problems causing people to use DBpedia online only - even if, due to performance reasons, running it locally would be far better. Regards Malte
Re: DBpedia hosting burden
Andy Seaborne wrote: On 15/04/2010 2:44 PM, Kingsley Idehen wrote: Andy, Great stuff, this is also why we are going to leave the current DBpedia 3.5 instance to stew for a while (until end of this week or a little later). DBpedia users: Now is the time to identify problems with the DBpedia 3.5 dataset dumps. We don't want to continue reloading DBpedia (Static Edition and then recalibrating DBpedia-Live) based on faulty datasets related matters, we do have other operational priorities etc.. Faulty is a bit strong. Imperfect then, however subjective that might be :-) Many of the warnings are legal RDF, but bad lexical forms for the datatype, or IRIs that trigger some of the standard warnings (but they are still legal IRIs). Should they be included or not? Seems to me you can argue both for and against. external_links_en.nt.bz2 is the largest source of broken IRIs. DBpedia is a wonderful and important dataset, and being derived from elsewhere is unlikely to ever be perfect (for some definition of perfect). Better to have the data than to wait for perfection. That's been the approach thus far. Anyway, as I said, we have a window of opportunity to identify current issues prior to performing a 3.5.1 reload. I just don't want to reduce the reload cycles due to other items on our todo etc.. Andy -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: DBpedia hosting burden
Kingsley Idehen wrote: Andy Seaborne wrote: On 15/04/2010 2:44 PM, Kingsley Idehen wrote: Andy, Great stuff, this is also why we are going to leave the current DBpedia 3.5 instance to stew for a while (until end of this week or a little later). DBpedia users: Now is the time to identify problems with the DBpedia 3.5 dataset dumps. We don't want to continue reloading DBpedia (Static Edition and then recalibrating DBpedia-Live) based on faulty datasets related matters, we do have other operational priorities etc.. Faulty is a bit strong. Imperfect then, however subjective that might be :-) Many of the warnings are legal RDF, but bad lexical forms for the datatype, or IRIs that trigger some of the standard warnings (but they are still legal IRIs). Should they be included or not? Seems to me you can argue both for and against. external_links_en.nt.bz2 is the largest source of broken IRIs. DBpedia is a wonderful and important dataset, and being derived from elsewhere is unlikely to ever be perfect (for some definition of perfect). Better to have the data than to wait for perfection. That's been the approach thus far. Actually meant to say: Anyway, as I said, we have a window of opportunity to identify current issues prior to performing a 3.5.1 reload. ** I jwant to reduce the reload cycles due to other items on our todo etc.. *** :-) -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: DBpedia hosting burden
On Wed, Apr 14, 2010 at 8:04 PM, Dan Brickley dan...@danbri.org wrote: Bills the major operative word in a world where the Bill Payer and Database Maintainer is a footnote (at best) re. perception of what constitutes the DBpedia Project. If dbpedia.org linked to the sparql endpoints of mirrors then that would be a way of sharing the burden. Ian
Re: DBpedia hosting burden
Ian Davis wrote: On Wed, Apr 14, 2010 at 8:04 PM, Dan Brickley dan...@danbri.org wrote: Bills the major operative word in a world where the Bill Payer and Database Maintainer is a footnote (at best) re. perception of what constitutes the DBpedia Project. If dbpedia.org linked to the sparql endpoints of mirrors then that would be a way of sharing the burden. Ian Ian, When you use the term: SPARQL Mirror (note: Leigh's comments yesterday re. not orienting towards this), you open up a different set of issues. I don't want to revisit SPARQL and SPARQL extensions debate etc.. Esp. as Virtuoso's SPARQL extensions are integral part of what makes the DBpedia SPARQL endpoint viable, amongst other things. The burden issue is basically veering away from the key points, which are: 1. Use the DBpedia instance properly 2. When the instance enforces restrictions, understand that this is a Virtuoso *feature* not a bug or server shortcoming. Beyond the dbpedia.org instance, there are other locations for: 1. Data Sets 2. SPARQL endpoints (like yours and a few others, where functionality mirroring isn't an expectation). Descriptor Resource vhandling ia mirrors, BitTorrents, Reverse Proxies, Cache directives, and some 303 heuristics etc.. Are the real issues of interest. Note: I can send wild SPARQL CONSTRUCTs, DESCRIBES, and HTTP GETs for Resource Descriptors to a zillion mirrors (maybe next year's April Fool's joke re. beauty of Linked Data crawling) and it will only make broaden the scope of my dysfunctional behavior. The behavior itself has to be handled (one or a zillion mirrors). Anyway, we will publish our guide for working with DBpedia very soon. I believe this will add immense clarity to this matter. -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: DBpedia hosting burden
On Thu, Apr 15, 2010 at 9:57 PM, Kingsley Idehen kide...@openlinksw.com wrote: Ian Davis wrote: When you use the term: SPARQL Mirror (note: Leigh's comments yesterday re. not orienting towards this), you open up a different set of issues. I don't want to revisit SPARQL and SPARQL extensions debate etc.. Esp. as Virtuoso's SPARQL extensions are integral part of what makes the DBpedia SPARQL endpoint viable, amongst other things. Having the same dataset available via different implementations of SPARQL can only be healthy. If certain extensions are necessary, this will only highlight their importance. If there are public services offering SPARQL-based access to the DBpedia datasets (or subsets) out there on the Web, it would be rather useful if we could have them linked from a single easy to find page, along with information about any restrictions, quirks, subsetting, or value-adding features special to that service. I suggest using a section in http://en.wikipedia.org/wiki/DBpedia for this, unless someone cares to handle that on dbpedia.org. The burden issue is basically veering away from the key points, which are: 1. Use the DBpedia instance properly 2. When the instance enforces restrictions, understand that this is a Virtuoso *feature* not a bug or server shortcoming. Yes, the showcase implementation needs to be used properly if it is going to survive the increasing attention developer LOD is getting. It is perfectly reasonable of you to make clear when there are limits they are for everyone's benefit. Beyond the dbpedia.org instance, there are other locations for: 1. Data Sets 2. SPARQL endpoints (like yours and a few others, where functionality mirroring isn't an expectation). Is there a list somewhere of related SPARQL endpoints? (also other Wikipedia-derrived datasets in RDF) Descriptor Resource vhandling ia mirrors, BitTorrents, Reverse Proxies, Cache directives, and some 303 heuristics etc.. Are the real issues of interest. (am chatting with Daniel Koller in Skype now re the BitTorrent experiments...) Note: I can send wild SPARQL CONSTRUCTs, DESCRIBES, and HTTP GETs for Resource Descriptors to a zillion mirrors (maybe next year's April Fool's joke re. beauty of Linked Data crawling) and it will only make broaden the scope of my dysfunctional behavior. The behavior itself has to be handled (one or a zillion mirrors). Sure. But on balance, more mirrors rather than fewer should benefit everyone, particularly if 'good behaviour' is documented and enforced... Anyway, we will publish our guide for working with DBpedia very soon. I believe this will add immense clarity to this matter. Great! cheers, Dan
Re: DBpedia hosting burden
On Thu, Apr 15, 2010 at 9:57 PM, Kingsley Idehen kide...@openlinksw.com wrote: Ian Davis wrote: When you use the term: SPARQL Mirror (note: Leigh's comments yesterday re. not orienting towards this), you open up a different set of issues. I don't want to revisit SPARQL and SPARQL extensions debate etc.. Esp. as Virtuoso's SPARQL extensions are integral part of what makes the DBpedia SPARQL endpoint viable, amongst other things. Having the same dataset available via different implementations of SPARQL can only be healthy. If certain extensions are necessary, this will only highlight their importance. If there are public services offering SPARQL-based access to the DBpedia datasets (or subsets) out there on the Web, it would be rather useful if we could have them linked from a single easy to find page, along with information about any restrictions, quirks, subsetting, or value-adding features special to that service. I suggest using a section in http://en.wikipedia.org/wiki/DBpedia for this, unless someone cares to handle that on dbpedia.org. The burden issue is basically veering away from the key points, which are: 1. Use the DBpedia instance properly 2. When the instance enforces restrictions, understand that this is a Virtuoso *feature* not a bug or server shortcoming. Yes, the showcase implementation needs to be used properly if it is going to survive the increasing attention developer LOD is getting. It is perfectly reasonable of you to make clear when there are limits they are for everyone's benefit. Beyond the dbpedia.org instance, there are other locations for: 1. Data Sets 2. SPARQL endpoints (like yours and a few others, where functionality mirroring isn't an expectation). Is there a list somewhere of related SPARQL endpoints? (also other Wikipedia-derrived datasets in RDF) Descriptor Resource vhandling ia mirrors, BitTorrents, Reverse Proxies, Cache directives, and some 303 heuristics etc.. Are the real issues of interest. (am chatting with Daniel Koller in Skype now re the BitTorrent experiments...) Note: I can send wild SPARQL CONSTRUCTs, DESCRIBES, and HTTP GETs for Resource Descriptors to a zillion mirrors (maybe next year's April Fool's joke re. beauty of Linked Data crawling) and it will only make broaden the scope of my dysfunctional behavior. The behavior itself has to be handled (one or a zillion mirrors). Sure. But on balance, more mirrors rather than fewer should benefit everyone, particularly if 'good behaviour' is documented and enforced... Anyway, we will publish our guide for working with DBpedia very soon. I believe this will add immense clarity to this matter. Great! cheers, Dan
Re: DBpedia hosting burden
Dan Brickley wrote: On Thu, Apr 15, 2010 at 9:57 PM, Kingsley Idehen kide...@openlinksw.com wrote: Ian Davis wrote: When you use the term: SPARQL Mirror (note: Leigh's comments yesterday re. not orienting towards this), you open up a different set of issues. I don't want to revisit SPARQL and SPARQL extensions debate etc.. Esp. as Virtuoso's SPARQL extensions are integral part of what makes the DBpedia SPARQL endpoint viable, amongst other things. Having the same dataset available via different implementations of SPARQL can only be healthy. If certain extensions are necessary, this will only highlight their importance. If there are public services offering SPARQL-based access to the DBpedia datasets (or subsets) out there on the Web, it would be rather useful if we could have them linked from a single easy to find page, along with information about any restrictions, quirks, subsetting, or value-adding features special to that service. I suggest using a section in http://en.wikipedia.org/wiki/DBpedia for this, unless someone cares to handle that on dbpedia.org. +1 The burden issue is basically veering away from the key points, which are: 1. Use the DBpedia instance properly 2. When the instance enforces restrictions, understand that this is a Virtuoso *feature* not a bug or server shortcoming. Yes, the showcase implementation needs to be used properly if it is going to survive the increasing attention developer LOD is getting. It is perfectly reasonable of you to make clear when there are limits they are for everyone's benefit. Yep, and as promised we will publish a document, this is certainly a missing piece of the puzzle right now. Beyond the dbpedia.org instance, there are other locations for: 1. Data Sets 2. SPARQL endpoints (like yours and a few others, where functionality mirroring isn't an expectation). Is there a list somewhere of related SPARQL endpoints? (also other Wikipedia-derrived datasets in RDF) See: http://delicious.com/kidehen/sparql_endpoint, that's how I track SPARQL endpoints, at the current time. Descriptor Resource vhandling ia mirrors, BitTorrents, Reverse Proxies, Cache directives, and some 303 heuristics etc.. Are the real issues of interest. (am chatting with Daniel Koller in Skype now re the BitTorrent experiments...) Yes, seeing progress. Note: I can send wild SPARQL CONSTRUCTs, DESCRIBES, and HTTP GETs for Resource Descriptors to a zillion mirrors (maybe next year's April Fool's joke re. beauty of Linked Data crawling) and it will only make broaden the scope of my dysfunctional behavior. The behavior itself has to be handled (one or a zillion mirrors). Sure. But on balance, more mirrors rather than fewer should benefit everyone, particularly if 'good behaviour' is documented and enforced... Yes, LinkedData DNS remains a personal aspiration of mine, but no matter what we build, enforcement needs to be understood as a *feature* rather than a bug or deficiency etc.. Anyway, we will publish our guide for working with DBpedia very soon. I believe this will add immense clarity to this matter. Great! cheers, Dan -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: DBpedia hosting burden
Dan Brickley wrote: (trimming cc: list to LOD and DBPedia) On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen kide...@openlinksw.com wrote: My comment wasn't a what is DBpedia? lecture. It was about clarifying the crux of the matter i.e., bandwidth consumption and its effects on other DBpedia users (as well as our own non-DBpedia related Web properties). (Leigh) I was just curious about usage volumes. We all talk about how central dbpedia is in the LOD cloud picture, and wondered if there was any publicly accessible metrics to help add some detail that. Well here is the critical detail: people typically crawl DBpedia. They crawl it more than any other Data Space in the LOD cloud. They do so because DBpedia is still quite central to to the burgeoning Web of Linked Data. Have you considered blocking DBpedia crawlers more aggressively, and nudging them to alternative ways of accessing the data? Yes. Some have cleaned up their act for sure. Problem is, there are others doing the same thing, who then complain about the instance in very generic fashion. While it is a shame to say 'no' to people trying to use linked data, this would be more saying 'yes, but not like that...'. I think we have an outstanding blog post / technical note about the DBpedia instance that hasn't been published (possibly due to the 3.5 and DBpedia-Live work we are doing), said note will cover how to work with the instance etc.. When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs via SPARQL, which is still ultimately Export from DBpedia and Import to my data space mindset. That's useful to know, thanks. Do you have the impression that these folk are typically trying to copy the entire thing, or to make some filtered subset (by geographical view, topic, property etc). Many (and to some degree quite natural) attempt to export the whole thing. Even when they're nudged to use OFFSET and LIMIT, end result is multiple hits en route to complete export. Can studying these logs help provide different downloadable dumps that would discourage crawlers? We do have a solution in mind, basically, we are going to have a different place for the descriptor resources and redirect crawlers there via 303's etc.. That's as simple and precise as this matter is. From a SPARQL perspective, DBpedia is quite microscopic, its when you factor in Crawler mentality and network bandwith that issues arise, and we deliberately have protection in place for Crawlers. Looking at http://wiki.dbpedia.org/OnlineAccess#h28-14 I don't see anything discouraging crawlers. Where is the 'best practice' or 'acceptable use' advice we should all be following, to avoid putting needless burden on your servers and bandwidth? We'll get the guide out. As you mention, DBpedia is an important and central resource, thanks both to the work of the Wikipedia community, and those in the DBpedia project who enrich and make available all that information. It's therefore important that the SemWeb / Linked Data community takes care to remember that these things don't come for free, that bills need paying and that de-referencing is a privilege not a right. Bills the major operative word in a world where the Bill Payer and Database Maintainer is a footnote (at best) re. perception of what constitutes the DBpedia Project. Our own ISPs even had to get in contact with us (last quarter of 2009) re. the amount of bandwidth being consumed by DBpedia etc.. If there are things we can do as a technology community to lower the cost of hosting / distributing such data, or to nudge consumers of it in the direction of more sustainable habits, we should do so. If there's not so much the rest of us can do but say 'thanks!', ... then, ...er, 'thanks!'. Much appreciated! For us, the most important thing is perspective. DBpedia is another space on a public network, thus it can't magically rewrite the underlying physics of wide area networking where access is open to the world. Thus, we can make a note about proper behavior and explain how we protect the instance such that everyone has a chance of using it (rather than a select few resource guzzlers). Are there any scenarios around eg. BitTorrent that could be explored? What if each of the static files in http://dbpedia.org/sitemap.xml were available as torrents (or magnet: URIs)? When we set up the Descriptor Resource host, these would certainly be considered. I realise that would only address part of the problem/cost, but it's a widely used technology for distributing large files; can we bend it to our needs? Also, we encourage use of gzip over HTTP :-) Kingsley cheers, Dan -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: DBpedia hosting burden
Nathan wrote: Dan Brickley wrote: (trimming cc: list to LOD and DBPedia) On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen kide...@openlinksw.com wrote: My comment wasn't a what is DBpedia? lecture. It was about clarifying the crux of the matter i.e., bandwidth consumption and its effects on other DBpedia users (as well as our own non-DBpedia related Web properties). (Leigh) I was just curious about usage volumes. We all talk about how central dbpedia is in the LOD cloud picture, and wondered if there was any publicly accessible metrics to help add some detail that. Well here is the critical detail: people typically crawl DBpedia. They crawl it more than any other Data Space in the LOD cloud. They do so because DBpedia is still quite central to to the burgeoning Web of Linked Data. Have you considered blocking DBpedia crawlers more aggressively, and nudging them to alternative ways of accessing the data? While it is a shame to say 'no' to people trying to use linked data, this would be more saying 'yes, but not like that...'. When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs via SPARQL, which is still ultimately Export from DBpedia and Import to my data space mindset. That's useful to know, thanks. Do you have the impression that these folk are typically trying to copy the entire thing, or to make some filtered subset (by geographical view, topic, property etc). Can studying these logs help provide different downloadable dumps that would discourage crawlers? That's as simple and precise as this matter is. From a SPARQL perspective, DBpedia is quite microscopic, its when you factor in Crawler mentality and network bandwith that issues arise, and we deliberately have protection in place for Crawlers. Looking at http://wiki.dbpedia.org/OnlineAccess#h28-14 I don't see anything discouraging crawlers. Where is the 'best practice' or 'acceptable use' advice we should all be following, to avoid putting needless burden on your servers and bandwidth? As you mention, DBpedia is an important and central resource, thanks both to the work of the Wikipedia community, and those in the DBpedia project who enrich and make available all that information. It's therefore important that the SemWeb / Linked Data community takes care to remember that these things don't come for free, that bills need paying and that de-referencing is a privilege not a right. If there are things we can do as a technology community to lower the cost of hosting / distributing such data, or to nudge consumers of it in the direction of more sustainable habits, we should do so. If there's not so much the rest of us can do but say 'thanks!', ... then, ...er, 'thanks!'. Much appreciated! Are there any scenarios around eg. BitTorrent that could be explored? What if each of the static files in http://dbpedia.org/sitemap.xml were available as torrents (or magnet: URIs)? I realise that would only address part of the problem/cost, but it's a widely used technology for distributing large files; can we bend it to our needs? I'd like to add; could the /data/* and /page/* resources all be made static files? (if they are not already) + make use of http caching etc. Yes. perhaps even the non-sparql dependant parts could be hosted on another machine purely for static content? perhaps an interim proxy which cache's said resources permanently (then cache rebuild on request when a new dataset is upgraded) Yes. Kingsley regards! -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: DBpedia hosting burden
On Wed, Apr 14, 2010 at 1:58 PM, Dan Brickley dan...@danbri.org wrote: (trimming cc: list to LOD and DBPedia) Using Dan's trimmed list to continue... On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen kide...@openlinksw.com wrote: When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs via SPARQL, which is still ultimately Export from DBpedia and Import to my data space mindset. Is this necessarily true? Couldn't the CONSTRUCT and/or DESCRIBE queries be used to find resources and view the whole graph (or specialized subsets) to determine if it's actually what is being sought? Is it better for DBpedia to do SELECTs and then retrieve the resource URIs individually? I suppose rather than assume that the data is all being exported into another space (which, I would think, definitely happening -- having data locally aids tremendously in indexing, for example) it could be a case of people just using SPARQL the way it seems that SPARQL should work? -Ross.
Re: DBpedia hosting burden
Ross Singer wrote: On Wed, Apr 14, 2010 at 1:58 PM, Dan Brickley dan...@danbri.org wrote: (trimming cc: list to LOD and DBPedia) Using Dan's trimmed list to continue... On Wed, Apr 14, 2010 at 7:09 PM, Kingsley Idehen kide...@openlinksw.com wrote: When people aren't crawling, they are executing CONSTRUCTsor DESCRIBEs via SPARQL, which is still ultimately Export from DBpedia and Import to my data space mindset. Is this necessarily true? Couldn't the CONSTRUCT and/or DESCRIBE queries be used to find resources and view the whole graph (or specialized subsets) to determine if it's actually what is being sought? I meant: the are sending a series of these query patterns with the same goal in mind: an export from DBpedia for import into their own Data Spaces. Is it better for DBpedia to do SELECTs and then retrieve the resource URIs individually? You can, and should use the full gamut of SPARQL queries, the issue is how they are used. On our side, we've always had the ability to protect the server. In recent times, we simply up the ante re. protection against problematic behavior. My only concern is that the tightening of control is sometimes misconstrued as a problem with the instance etc.. I suppose rather than assume that the data is all being exported into another space (which, I would think, definitely happening -- having data locally aids tremendously in indexing, for example) it could be a case of people just using SPARQL the way it seems that SPARQL should work? Hence the onus is on us to make a smart server, which we've had since day one. Again, the issue is: when the server protects itself, the behavior is being misconstrued as an instance problem. If you make a local instance of Virtuoso + DBpedia, you will see what I mean, and basically it would come down to what Nathan explained in this recent post [1]. Key excerpt: ...The public lod and dbpedia endpoints really do no justice as to just how powerful and fast Virtuoso is, queries which take a few seconds on the public endpoint return in hundredths of a second on my local (low spec) server... Links: 1. http://webr3.org/blog/experiments/linked-data-extractor-prototype-details/ Kingsley -Ross. -- Regards, Kingsley Idehen President CEO OpenLink Software Web: http://www.openlinksw.com Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca: kidehen
Re: DBpedia hosting burden
Dan, Are there any scenarios around eg. BitTorrent that could be explored? What if each of the static files in http://dbpedia.org/sitemap.xml were available as torrents (or magnet: URIs)? I realise that would only address part of the problem/cost, but it's a widely used technology for distributing large files; can we bend it to our needs? If I were The Emperor of LOD I'd ask all grand dukes of datasources to put fresh dumps at some torrent with control of UL/DL ratio :) For reason I can't understand this idea is proposed few times per year but never tried. Other approach is to implement scalable and safe patch/diff on RDF graphs plus subscription on them. That's what I'm writing ATM. Using this toolkit, it would be quite cheap to place a local copy of LOD on any appropriate box in any workgroup. A local copy will not require any hi-end equipment for two reasons: the database can be much smaller than the public one (one may install only a subset of LOD) and it will usually less sensitive to RAM/disk ratio (small number of clients will result in better locality because any given individual tend to browse interrelated data whereas a crowd produces chaotic sequence of requests). Crawlers and mobile apps will not migrate to local copies, but some complicated queries will go away from the bottleneck server and that would be good enough. Best Regards, Ivan Mikhailov OpenLink Software http://virtuoso.openlinksw.com
Re: DBpedia hosting burden
On Wed, Apr 14, 2010 at 8:11 PM, Kingsley Idehen kide...@openlinksw.com wrote: Some have cleaned up their act for sure. Problem is, there are others doing the same thing, who then complain about the instance in very generic fashion. They're lucky it exists at all. I'd refer them to this Louis CK sketch - http://videosift.com/video/Louie-CK-on-Conan-Oct-1st-2008?fromdupe=We-live-in-an-amazing-amazing-world-and-we-complain (if it stays online...). While it is a shame to say 'no' to people trying to use linked data, this would be more saying 'yes, but not like that...'. I think we have an outstanding blog post / technical note about the DBpedia instance that hasn't been published (possibly due to the 3.5 and DBpedia-Live work we are doing), said note will cover how to work with the instance etc.. [..] We do have a solution in mind, basically, we are going to have a different place for the descriptor resources and redirect crawlers there via 303's etc.. [...] We'll get the guide out. That sounds useful As you mention, DBpedia is an important and central resource, thanks both to the work of the Wikipedia community, and those in the DBpedia project who enrich and make available all that information. It's therefore important that the SemWeb / Linked Data community takes care to remember that these things don't come for free, that bills need paying and that de-referencing is a privilege not a right. Bills the major operative word in a world where the Bill Payer and Database Maintainer is a footnote (at best) re. perception of what constitutes the DBpedia Project. Yes, I'm sure some are thoughtless and take it for granted; but also that others are well aware of the burdens. (For that matter, I'm not myself so sure how Wikipedia cover their costs or what their longer-term plan is...). For us, the most important thing is perspective. DBpedia is another space on a public network, thus it can't magically rewrite the underlying physics of wide area networking where access is open to the world. Thus, we can make a note about proper behavior and explain how we protect the instance such that everyone has a chance of using it (rather than a select few resource guzzlers). This I think is something others can help with, when presenting LOD and related concepts: to encourage good habits that spread the cost of keeping this great dataset globally available. So all those making slides, tutorials, blog posts or software tools have a role to play here. Are there any scenarios around eg. BitTorrent that could be explored? What if each of the static files in http://dbpedia.org/sitemap.xml were available as torrents (or magnet: URIs)? When we set up the Descriptor Resource host, these would certainly be considered. Ok, let's take care to explore that then; it would probably help others too. There must be dozens of companies and research organizations who could put some bandwidth resources into this, if only there was a short guide to setting up a GUI-less bittorrent tool and configuring it appropriately. Are there any bittorrent experts on these mailing lists who could suggest next practical steps here (not necessarily dbpedia-specific)? (ah I see a reply from Ivan; copying it in here...) If I were The Emperor of LOD I'd ask all grand dukes of datasources to put fresh dumps at some torrent with control of UL/DL ratio :) For reason I can't understand this idea is proposed few times per year but never tried. I suspect BitTorrent is in some ways somehow 'taboo' technology, since it is most famous for being used to distributed materials that copyright-owners often don't want distributed. I have no detailed idea how torrent files are made, how trackers work, etc. I started poking around magnet: a bit recently but haven't got a sense for how solid that work is yet. Could a simple Wiki page be used for sharing torrents? (plus published hash of files elsewhere for integrity checks). What would it take to get started? Perhaps if http://wiki.dbpedia.org/Downloads35 had the sha1 for each download published (rdfa?), then others could experiment with torrents and downloaders could cross-check against an authoritative description of the file from dbpedia? I realise that would only address part of the problem/cost, but it's a widely used technology for distributing large files; can we bend it to our needs? Also, we encourage use of gzip over HTTP :-) Are there any RDF toolkits in need of a patch to their default setup in this regard? Tutorials that need fixing, etc? cheers, Dan ps. re big datasets, Library of Congress apparently are going to have complete twitter archive - see http://twitter.com/librarycongress/status/12172217971 - http://blogs.loc.gov/loc/2010/04/how-tweet-it-is-library-acquires-entire-twitter-archive/