implied datasets
This is the RDF version of the question I just sent to the CKAN list [1]. It is somewhat a policy question and I believe that in RDF terms the open world means the answer is basically, yes you can say what you want. Consider the diagram here, http://semantic.ckan.net/group/?group=http://ckan.net/group/lld this is interconnections between library datasets. You'll notice there is a partition. This partition is not really there. Here's why. In library world, perhaps more than elsewhere, it is common to do things like this, http://example.org/issn/1234-5678 a bibo:Jornal; blah blah blah some descriptions; owl:sameAs urn:issn:1234-5678. This is because there are standard identifiers for lots of things that are found in libraries and they even have a urn namespace. So it is a lot easier when publishing this data than to go out and use something like silk to try to find links. They're already implied by the identifiers we have in hand. So given two such datasets, they are indeed connected in the way we think of RDF datasets as being connected, not necessarily with semantics as strict as owl:sameAs - we would probably not choose to actually materialise its productions here especially since the entities might be modelled in different, incompatible ways, and the owl:sameAs is really not the right predicate to be using, but at least connected with semantics along the lines of rdfs:seeAlso. The point is, the two datasets are transitively connected. But because we have no extant dataset that contains all the ISSNs, particularly all ISSNs where the identifier is expressed as a urn: URI, we have nothing to put in our voiD linkset -- which is how the relationships between these datasets are represented at a high level. So we have an apparent partition. What I propose to do here, is invent an implied dataset, the one that contains in principle the entire list of ISSNs. Something like, urn:issn:- a rdf:Resource. urn:issn:-0001 a rdf:Resource. ... but which actually should contain X a rdf:Resource for everything in the valid lexical space of urn:issn, which may be (countably) infinite for all I know. Then for each dataset that I have that uses the links to this space, I count them up and make a linkset pointing at this imaginary dataset. Obviously the same strategy for anywhere there exist some kind of standard identifiers that are not URIs in HTTP. Does this make sense? Can we sensibly talk about and even assert the existence of a dataset of infinite size? (whatever existence means). Is this an abuse of DCat/voiD? Are this class of datasets subsets of sameAs.org (assuming sameAs.org to be complete in principle?) Cheers, -w [1] http://lists.okfn.org/pipermail/ckan-discuss/2011-May/001269.html -- William Waitesmailto:w...@styx.org http://river.styx.org/ww/sip:w...@styx.org F4B3 39BF E775 CF42 0BAB 3DF0 BE40 A6DF B06F FD45
Re: implied datasets
Hi William, On 23 May 2011 14:01, William Waites w...@styx.org wrote: ... Then for each dataset that I have that uses the links to this space, I count them up and make a linkset pointing at this imaginary dataset. Obviously the same strategy for anywhere there exist some kind of standard identifiers that are not URIs in HTTP. Does this make sense? I'm not sure that the dataset is imaginary, but what you're doing seems eminently sensible to me. I've been working on a little project that I hope to release shortly that aims to facilitate this kind of linking, especially where those non-URI identifiers, or Literal Keys [1] are used to build patterned URIs. Can we sensibly talk about and even assert the existence of a dataset of infinite size? (whatever existence means). I think so, we can assert what kinds of things it contains and describe it in general terms, even if we can't enumerate all of its elements. It may be more natural to thing of these more as services though than datasets. i.e. a service that accepts some keys as input and returns a set of assertions. In this case the assertions would be links to other datasets. Is this an abuse of DCat/voiD? Not in my view, I think the notion of dataset is already pretty broad. Are this class of datasets subsets of sameAs.org (assuming sameAs.org to be complete in principle?) Subsets if they only asserted sameAs links, but I think you're suggesting that this may be too strict. I think there's potentially a whole set of related predicate based services [2] that provide useful indexes of existing datasets, or expose additional annotations of extra sources. The project I've been working on facilitates not just sameAs links, but any form of links that can be derived from shared URI patterns. This would include topic/subject based linking. ISBN was one the use cases I had in mind, but here are others. Cheers, L. [1]. http://patterns.dataincubator.org/book/literal-keys.html [2]. http://www.ldodds.com/blog/2010/03/predicate-based-services/ Cheers, L. -- Leigh Dodds Programme Manager, Talis Platform Mobile: 07850 928381 http://kasabi.com http://talis.com Talis Systems Ltd 43 Temple Row Birmingham B2 5LS
Re: implied datasets
Here's why. In library world, perhaps more than elsewhere, it is common to do things like this, http://example.org/issn/1234-5678 a bibo:Jornal; blah blah blah some descriptions; owl:sameAs urn:issn:1234-5678. This is because there are standard identifiers for lots of things that are found in libraries and they even have a urn namespace. So it is a lot easier when publishing this data than to go out and use something like silk to try to find links. They're already implied by the identifiers we have in hand. It seems to me that this is another demonstration of confusion that wouldn't happen if we all understood RDF IDs to be pure identifiers that belong to the graph representation of a dataset and nothing else. ISSN numbers are not graph-node IDs, they are real-world conceptual identifiers like social security numbers or SKUs or country codes. Many different data-structure might reference them in very different ways, so it should be fairly clear that they cannot uniquely identify anything but themselves, and thus they should themselves be represented in RDF as nodes. So the above should be more like: ex:1 a ex:Journal; rdfs:label International Digest of Periodicity; ex:issn ex:2; ex:blah ex:3. ex:2 a ex:ISSN; rdfs:label 1234-5678; ex:journal ex:1. glenn
Re: implied datasets
* [2011-05-23 11:34:56 -0400] glenn mcdonald gl...@furia.com écrit: ] It seems to me that this is another demonstration of confusion that wouldn't ] happen if we all understood RDF IDs to be pure identifiers that belong to ] the graph representation of a dataset and nothing else. ISSN numbers are not ] graph-node IDs, they are real-world conceptual identifiers like social ] security numbers or SKUs or country codes. Many different data-structure ] might reference them in very different ways, so it should be fairly clear ] that they cannot uniquely identify anything but themselves, and thus they ] should themselves be represented in RDF as nodes. So the above should be ] more like: Hi Glenn, That may be so but it misses the point. The point is there is a field, be it a URI or a literal however modelled, that can be used to join between two datasets. This join field is hidden in that there exists no (known) dataset that contains all possible values it can take on. So you have a situation when you are trying to describe datasets where you can say that DS1 and DS2 are indirectly linked and you want to make that link explicit so that you can put it on diagrams ans such. Saying, DS1 indirectlyLinkedTo DS2 is no good because then you get O(n^2) such statements which makes your visualisation messy and furthermore you don't know without examining them that they have any common values on the join field so they may not actually be linked except in a degenerate sense. Inventing a dataset that contains only the join field lets you say something useful and coherent about the relationship between DS1 and DS2. There is nothing in this that requires the datasets themselves to be RDF. See my other post to ckan-discuss on the same topic expressed in terms of the relationships between CSV files. Cheers, -w -- William Waitesmailto:w...@styx.org http://river.styx.org/ww/sip:w...@styx.org F4B3 39BF E775 CF42 0BAB 3DF0 BE40 A6DF B06F FD45
Re: implied datasets
That may be so but it misses the point. The point is there is a field, be it a URI or a literal however modelled, that can be used to join between two datasets. This join field is hidden in that there exists no (known) dataset that contains all possible values it can take on. Hmm. I'm still not getting why this is a problem. It seems like as long as the ISSNs in both datasets are represented by nodes with type-assignments, all you have to assert is that the two types are equivalent (e.g. same URIs, or owl:equivalentClass...), and that their rdfs:labels uniquely define them (e.g. owl:InverseFunctionalProperty...). I don't (yet) see why you need an imaginary extra dataset in between.
Re: implied datasets
If one has one dataset (say) and wants to find other datasets that might be usefully combined with it to do some analysis, it would (I think) be useful to have something like this to help with the discovery. OK, but I'm not seeing is how this extra imaginary dataset helps with discovery, either. Isn't a type-assertion pretty much exactly what you're saying here: a statement that this entity belongs to a set (which may or may not be completely enumerated in any one place, or even at all)? So what is this implied dataset doing that the type assertions are not?
Re: implied datasets
I think that this area of useful bridging sets of instance URIs is ripe for exploring and exploiting. I won't go into whether the April Fool's joke of the integers might actually be useful (note that dbpedia has quite a lot of URIs for numbers), but there will be many other standard URIs for things that we take for granted. The recent colour ones might seem like a joke as well, but perhaps not? My favourite at the moment is http://data.totl.net/chess/state/rnbqkbnr__8_8_8_8__RNBQKBNR_w_KQkq_-_0_1 A very large number of URIs that describe chess positions. And tells you things like the next legal move in RDF. So if I had loads of games in RDF, I could reliably do some fun queries about games with move sequences, etc. Seems to me it is very similar to William's requirements. However, it does it slightly differently, by having resolvable URIs for the positions, which can easily go to the more conventional representations. Is that not a better way of doing what you want, William? Bring up a simple site that actually has http://example.org/issn/1234-5678 or perhaps more appropriately something like http://totl.net/issn/1234-5678 which actually resolves to some (generated) RDF snippet that is sensible. I keep meaning to build something to do it easily, but keep hoping that Leigh will do it first :-) In fact, when generating RDF from any dataset, in some sense, if you accept some of the strings uniquely identify NIRs, and then generate URIs in more than one context, based on the string, you are doing exactly the same thing locally. I guess not very clean, but as you describe it, very practical. Best Hugh On 23 May 2011, at 14:46, Leigh Dodds wrote: Hi William, On 23 May 2011 14:01, William Waites w...@styx.org wrote: ... Then for each dataset that I have that uses the links to this space, I count them up and make a linkset pointing at this imaginary dataset. Obviously the same strategy for anywhere there exist some kind of standard identifiers that are not URIs in HTTP. Does this make sense? I'm not sure that the dataset is imaginary, but what you're doing seems eminently sensible to me. I've been working on a little project that I hope to release shortly that aims to facilitate this kind of linking, especially where those non-URI identifiers, or Literal Keys [1] are used to build patterned URIs. Can we sensibly talk about and even assert the existence of a dataset of infinite size? (whatever existence means). I think so, we can assert what kinds of things it contains and describe it in general terms, even if we can't enumerate all of its elements. It may be more natural to thing of these more as services though than datasets. i.e. a service that accepts some keys as input and returns a set of assertions. In this case the assertions would be links to other datasets. Is this an abuse of DCat/voiD? Not in my view, I think the notion of dataset is already pretty broad. Are this class of datasets subsets of sameAs.org (assuming sameAs.org to be complete in principle?) Subsets if they only asserted sameAs links, but I think you're suggesting that this may be too strict. I think there's potentially a whole set of related predicate based services [2] that provide useful indexes of existing datasets, or expose additional annotations of extra sources. The project I've been working on facilitates not just sameAs links, but any form of links that can be derived from shared URI patterns. This would include topic/subject based linking. ISBN was one the use cases I had in mind, but here are others. Cheers, L. [1]. http://patterns.dataincubator.org/book/literal-keys.html [2]. http://www.ldodds.com/blog/2010/03/predicate-based-services/ Cheers, L. -- Leigh Dodds Programme Manager, Talis Platform Mobile: 07850 928381 http://kasabi.com http://talis.com Talis Systems Ltd 43 Temple Row Birmingham B2 5LS -- Hugh Glaser, Intelligence, Agents, Multimedia School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ Work: +44 23 8059 3670, Fax: +44 23 8059 3045 Mobile: +44 75 9533 4155 , Home: +44 23 8061 5652 http://www.ecs.soton.ac.uk/~hg/
Re: implied datasets
* [2011-05-23 14:46:47 +0100] Leigh Dodds leigh.do...@talis.com écrit: ] I'm not sure that the dataset is imaginary, but what you're doing ] seems eminently sensible to me. I've been working on a little project ] that I hope to release shortly that aims to facilitate this kind of ] linking, especially where those non-URI identifiers, or Literal Keys ] [1] are used to build patterned URIs. The thing is, as with Hugh's suggestion, as a curator of datasets I have little control or influence over how the dataset authors choose to do this. I have noticed a common pattern though (urn:issn for example) and encouraging patterns like this is helpful I think. ] It may be more natural to thing of these more as services though than ] datasets. i.e. a service that accepts some keys as input and returns a ] set of assertions. In this case the assertions would be links to other ] datasets. This is a bit different. I was thinking of an implied dataset that would have no links outwards at all. ] Subsets if they only asserted sameAs links, but I think you're ] suggesting that this may be too strict. I think there's potentially a ] whole set of related predicate based services [2] that provide ] useful indexes of existing datasets, or expose additional annotations ] of extra sources. So this would be a separation of edge-labelled graphs into a bunch of perhaps more manageable basic (V,E) graphs. An interesting way of chopping things up. The reason I think sameAs is too strict, aside from people putting sameAs when they really mean similarTo, can be shown by another library example. Broadly there seem to be two strategies for representing things like books, the flat BIBO style and the more elaborate FRBR/WEMI style. So if I have two datasets, one in each, I might have something like, ds1:flc a bibo:Book; dc:title The Feynman Lectures on Computation; dc:creator [ foaf:name Richard Feynman ]; dc:language eng; owl:sameAs urn:isbn:0738202967. ds2:flc a frbr:Manifestation; frbr:manifestationOf [ a frbr:Expression; dc:language en; frbr:expressionOf [ a frbr:Work; dc:title The Feynman Lectures on Computation; dc:creator [ foaf:name Richard Feynman ] ] ]; owl:sameAs urn:isbn:0738202967. Both the authors have done something prima facie reasonable with the sameAs but if you actually run it transitively you get into trouble. This also goes to what Glenn was saying. These datasets are obviously related in a meaningful way, there may well be useful ways for someone who studies them to draw links between them but it isn't as simple as saying they both have things of the same type. In fact what type assertions are appropriate to clarify the relationship between these datasets is the type of analysis that I would want to facilitate, not try to do up front. What I can say is they both have references (that may or may not be strictly believable) to this funny non-dereferenceable URI (or equivalently, string literal of a certain kind). Cheers, -w -- William Waitesmailto:w...@styx.org http://river.styx.org/ww/sip:w...@styx.org F4B3 39BF E775 CF42 0BAB 3DF0 BE40 A6DF B06F FD45
Re: data vs. information (Was Re: implied datasets)
Thanks for all your thoughts William - food for pondering. A few comments, which I find hard to interleave - sorry. The totl.net site doesn't have to be hit for the URIs to have value. dbpedia doesn't even have to exist for the URIs to have value (ducks /) - well at least not very often; it may have been down for the last month, for all I know, but I have been using the URIs. They are just there as useful identifiers. Ah yes, crossref.org; I realise all this is quite controversial in the publishing sphere - I think there is a site that does Linked Data to DOI, I don't think crossref.org does it, but can't remember which it is. It is rather strange that we worry about having authoritative or at least agreed URIs for hard things like people, but don't manage to have them for less complex things (at least in terms of enumeration) such as pantone colours or chemical elements, and yes, ISSN. dbpedia can sort of fit this role, if wikipedia had pages on them, but somehow it feels like clear datasets such as this should be sort of taken out. And of course, when we get to datasets of essentially arbitrary size (IPV6 URI anyone?, or even V4), we are in a different world of representation and service. Best Hugh On 23 May 2011, at 23:02, William Waites wrote: * [2011-05-23 18:19:49 +] Hugh Glaser h...@ecs.soton.ac.uk écrit: ] I won't go into whether the April Fool's joke of the integers might ] actually be useful (note that dbpedia has quite a lot of URIs for numbers), ] but there will be many other standard URIs for things that we take for granted. ] The recent colour ones might seem like a joke as well, but perhaps not? I had this a little bit in mind when I wrote the original mail, and this goes nicely to some related thoughts about quality. The thing with the linked open numbers is that it makes the point pretty neatly I think that it is silly to try to materialise everything that can be stated in RDF. A small computer program that describes numbers might have the same information content as all of those numbers made manifest. And it would take up a lot less disk space and be much faster to query. But you could still use it to refer to the numbers when you needed to. Is this always the case? It seems to be the tradeoff is speed vs. space. For some aspects of numbers this makes sense (e.g. their representation in roman numerals) but what about computationally expensive things like their prime factors? This quickly becomes too expensive to calculate on the fly but actually a lookup service could make a certain amount of sense... ] My favourite at the moment is ] http://data.totl.net/chess/state/rnbqkbnr__8_8_8_8__RNBQKBNR_w_KQkq_-_0_1 ] A very large number of URIs that describe chess positions. ] And tells you things like the next legal move in RDF. ] ] So if I had loads of games in RDF, I could reliably do some fun queries ] about games with move sequences, etc. ] ] Seems to me it is very similar to William's requirements. Oh, that's beautiful. ] However, it does it slightly differently, by having resolvable URIs for the ] positions, which can easily go to the more conventional representations. And this works because the service has a compact representation of the space of all possible positions and moves, a small computer program. You can then materialise the small subspace that you're interested in and run some analysis on it. Its nice that totl wants to run that program for me but I guess they could just as easily give me the program and let me run it myself. Bit for bit they would have given me far more information and far less data. But then it might be more convenient for me to use their service if I only have a relatively small number of positions/moves to consider. Might the service be useful for a program to help study how to play chess? Quite possibly. Would it make sense to build a chess-playing computer on top of their service? It would be interesting to see but I suspect network traffic and delays would be prohibitive. Its the same story with the trend of taking CSV files, a pretty compact and easy to work with representation of tabular data, and expanding them into giant RDF datasets that take up a lot of disk space and are cumbersome to query. A service to refer to a cell in a spreadsheet, to give it a URI and return some small amount of data would be useful. Proactively materialising the whole thing (not infinite but in some cases still very large) is probably not. ] Is that not a better way of doing what you want, William? ] Bring up a simple site that actually has http://example.org/issn/1234-5678 ] or perhaps more appropriately something like http://totl.net/issn/1234-5678 ] which actually resolves to some (generated) RDF snippet that is ] sensible. So quite reasonable, and I believe but am not certain that crossref.org has already done exactly this for ISSNs (but that