implied datasets

2011-05-23 Thread William Waites
This is the RDF version of the question I just sent to the CKAN list
[1]. It is somewhat a policy question and I believe that in RDF terms
the open world means the answer is basically, yes you can say what
you want.

Consider the diagram here,

  http://semantic.ckan.net/group/?group=http://ckan.net/group/lld

this is interconnections between library datasets. You'll notice there
is a partition. This partition is not really there.

Here's why. In library world, perhaps more than elsewhere, it is
common to do things like this,

http://example.org/issn/1234-5678 a bibo:Jornal;
blah blah blah some descriptions;
owl:sameAs urn:issn:1234-5678.

This is because there are standard identifiers for lots of things that
are found in libraries and they even have a urn namespace. So it is a
lot easier when publishing this data than to go out and use something
like silk to try to find links. They're already implied by the
identifiers we have in hand.

So given two such datasets, they are indeed connected in the way we
think of RDF datasets as being connected, not necessarily with
semantics as strict as owl:sameAs - we would probably not choose to
actually materialise its productions here especially since the
entities might be modelled in different, incompatible ways, and the
owl:sameAs is really not the right predicate to be using, but at least
connected with semantics along the lines of rdfs:seeAlso. The point
is, the two datasets are transitively connected.

But because we have no extant dataset that contains all the ISSNs,
particularly all ISSNs where the identifier is expressed as a urn:
URI, we have nothing to put in our voiD linkset -- which is how the
relationships between these datasets are represented at a high
level. So we have an apparent partition.

What I propose to do here, is invent an implied dataset, the one that
contains in principle the entire list of ISSNs. Something like,

urn:issn:- a rdf:Resource.
urn:issn:-0001 a rdf:Resource.
...

but which actually should contain X a rdf:Resource for everything in
the valid lexical space of urn:issn, which may be (countably) infinite
for all I know.

Then for each dataset that I have that uses the links to this space, I
count them up and make a linkset pointing at this imaginary dataset.

Obviously the same strategy for anywhere there exist some kind of
standard identifiers that are not URIs in HTTP.

Does this make sense?

Can we sensibly talk about and even assert the existence of a dataset
of infinite size? (whatever existence means).

Is this an abuse of DCat/voiD?

Are this class of datasets subsets of sameAs.org (assuming sameAs.org
to be complete in principle?)

Cheers,
-w

[1] http://lists.okfn.org/pipermail/ckan-discuss/2011-May/001269.html
-- 
William Waitesmailto:w...@styx.org
http://river.styx.org/ww/sip:w...@styx.org
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45



Re: implied datasets

2011-05-23 Thread Leigh Dodds
Hi William,

On 23 May 2011 14:01, William Waites w...@styx.org wrote:
 ...
 Then for each dataset that I have that uses the links to this space, I
 count them up and make a linkset pointing at this imaginary dataset.

 Obviously the same strategy for anywhere there exist some kind of
 standard identifiers that are not URIs in HTTP.

 Does this make sense?

I'm not sure that the dataset is imaginary, but what you're doing
seems eminently sensible to me. I've been working on a little project
that I hope to release shortly that aims to facilitate this kind of
linking, especially where those non-URI identifiers, or Literal Keys
[1] are
used to build patterned URIs.

 Can we sensibly talk about and even assert the existence of a dataset
 of infinite size? (whatever existence means).

I think so, we can assert what kinds of things it contains and
describe it in general terms, even if we can't enumerate all of its
elements.

It may be more natural to thing of these more as services though than
datasets. i.e. a service that accepts some keys as input and returns a
set of assertions. In this case the assertions would be links to other
datasets.

 Is this an abuse of DCat/voiD?

Not in my view, I think the notion of dataset is already pretty broad.

 Are this class of datasets subsets of sameAs.org (assuming sameAs.org
 to be complete in principle?)

Subsets if they only asserted sameAs links, but I think you're
suggesting that this may be too strict. I think there's potentially a
whole set of related predicate based services [2] that provide
useful indexes of existing datasets, or expose additional annotations
of extra sources.

The project I've been working on facilitates not just sameAs links,
but any form of links that can be derived from shared URI patterns.
This would include topic/subject based linking. ISBN was one the use
cases I had in mind, but here are others.

Cheers,

L.

[1]. http://patterns.dataincubator.org/book/literal-keys.html
[2]. http://www.ldodds.com/blog/2010/03/predicate-based-services/

Cheers,

L.

-- 
Leigh Dodds
Programme Manager, Talis Platform
Mobile: 07850 928381
http://kasabi.com
http://talis.com

Talis Systems Ltd
43 Temple Row
Birmingham
B2 5LS



Re: implied datasets

2011-05-23 Thread glenn mcdonald

 Here's why. In library world, perhaps more than elsewhere, it is
 common to do things like this,

 http://example.org/issn/1234-5678 a bibo:Jornal;
blah blah blah some descriptions;
owl:sameAs urn:issn:1234-5678.

 This is because there are standard identifiers for lots of things that
 are found in libraries and they even have a urn namespace. So it is a
 lot easier when publishing this data than to go out and use something
 like silk to try to find links. They're already implied by the
 identifiers we have in hand.


It seems to me that this is another demonstration of confusion that wouldn't
happen if we all understood RDF IDs to be pure identifiers that belong to
the graph representation of a dataset and nothing else. ISSN numbers are not
graph-node IDs, they are real-world conceptual identifiers like social
security numbers or SKUs or country codes. Many different data-structure
might reference them in very different ways, so it should be fairly clear
that they cannot uniquely identify anything but themselves, and thus they
should themselves be represented in RDF as nodes. So the above should be
more like:

ex:1 a ex:Journal;
  rdfs:label International Digest of Periodicity;
  ex:issn ex:2;
  ex:blah ex:3.

ex:2 a ex:ISSN;
  rdfs:label 1234-5678;
  ex:journal ex:1.

glenn


Re: implied datasets

2011-05-23 Thread William Waites
* [2011-05-23 11:34:56 -0400] glenn mcdonald gl...@furia.com écrit:

] It seems to me that this is another demonstration of confusion that wouldn't
] happen if we all understood RDF IDs to be pure identifiers that belong to
] the graph representation of a dataset and nothing else. ISSN numbers are not
] graph-node IDs, they are real-world conceptual identifiers like social
] security numbers or SKUs or country codes. Many different data-structure
] might reference them in very different ways, so it should be fairly clear
] that they cannot uniquely identify anything but themselves, and thus they
] should themselves be represented in RDF as nodes. So the above should be
] more like:

Hi Glenn,

That may be so but it misses the point. The point is there is a field,
be it a URI or a literal however modelled, that can be used to join
between two datasets. This join field is hidden in that there exists
no (known) dataset that contains all possible values it can take on.

So you have a situation when you are trying to describe datasets where
you can say that DS1 and DS2 are indirectly linked and you want to
make that link explicit so that you can put it on diagrams ans such.

Saying,

  DS1 indirectlyLinkedTo DS2

is no good because then you get O(n^2) such statements which makes
your visualisation messy and furthermore you don't know without
examining them that they have any common values on the join field so
they may not actually be linked except in a degenerate sense.

Inventing a dataset that contains only the join field lets you say
something useful and coherent about the relationship between DS1 and
DS2.

There is nothing in this that requires the datasets themselves to be
RDF. See my other post to ckan-discuss on the same topic expressed in
terms of the relationships between CSV files.

Cheers,
-w
-- 
William Waitesmailto:w...@styx.org
http://river.styx.org/ww/sip:w...@styx.org
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45



Re: implied datasets

2011-05-23 Thread glenn mcdonald

 That may be so but it misses the point. The point is there is a field,
 be it a URI or a literal however modelled, that can be used to join
 between two datasets. This join field is hidden in that there exists
 no (known) dataset that contains all possible values it can take on.


Hmm. I'm still not getting why this is a problem. It seems like as long as
the ISSNs in both datasets are represented by nodes with type-assignments,
all you have to assert is that the two types are equivalent (e.g. same URIs,
or owl:equivalentClass...), and that their rdfs:labels uniquely define them
(e.g. owl:InverseFunctionalProperty...). I don't (yet) see why you need an
imaginary extra dataset in between.


Re: implied datasets

2011-05-23 Thread glenn mcdonald

 If one has one dataset (say) and wants to find other datasets that

might be usefully combined with it to do some analysis, it would (I

think) be useful to have something like this to help with the discovery.


OK, but I'm not seeing is how this extra imaginary dataset helps with
discovery, either. Isn't a type-assertion pretty much exactly what you're
saying here: a statement that this entity belongs to a set (which may or may
not be completely enumerated in any one place, or even at all)? So what is
this implied dataset doing that the type assertions are not?


Re: implied datasets

2011-05-23 Thread Hugh Glaser
I think that this area of useful bridging sets of instance URIs is ripe for 
exploring and exploiting.
I won't go into whether the April Fool's joke of the integers might actually be 
useful (note that dbpedia has quite a lot of URIs for numbers), but there will 
be many other standard URIs for things that we take for granted.
The recent colour ones might seem like a joke as well, but perhaps not?

My favourite at the moment is
http://data.totl.net/chess/state/rnbqkbnr__8_8_8_8__RNBQKBNR_w_KQkq_-_0_1
A very large number of URIs that describe chess positions.
And tells you things like the next legal move in RDF.

So if I had loads of games in RDF, I could reliably do some fun queries about 
games with move sequences, etc.

Seems to me it is very similar to William's requirements.
However, it does it slightly differently, by having resolvable URIs for the 
positions, which can easily go to the more conventional representations.

Is that not a better way of doing what you want, William?
Bring up a simple site that actually has http://example.org/issn/1234-5678 or 
perhaps more appropriately something like
http://totl.net/issn/1234-5678 which actually resolves to some (generated) RDF 
snippet that is sensible.

I keep meaning to build something to do it easily, but keep hoping that Leigh 
will do it first :-)

In fact, when generating RDF from any dataset, in some sense, if you accept 
some of the strings uniquely identify NIRs, and then generate URIs in more than 
one context, based on the string, you are doing exactly the same thing locally.

I guess not very clean, but as you describe it, very practical.
Best
Hugh

On 23 May 2011, at 14:46, Leigh Dodds wrote:

 Hi William,
 
 On 23 May 2011 14:01, William Waites w...@styx.org wrote:
 ...
 Then for each dataset that I have that uses the links to this space, I
 count them up and make a linkset pointing at this imaginary dataset.
 
 Obviously the same strategy for anywhere there exist some kind of
 standard identifiers that are not URIs in HTTP.
 
 Does this make sense?
 
 I'm not sure that the dataset is imaginary, but what you're doing
 seems eminently sensible to me. I've been working on a little project
 that I hope to release shortly that aims to facilitate this kind of
 linking, especially where those non-URI identifiers, or Literal Keys
 [1] are
 used to build patterned URIs.
 
 Can we sensibly talk about and even assert the existence of a dataset
 of infinite size? (whatever existence means).
 
 I think so, we can assert what kinds of things it contains and
 describe it in general terms, even if we can't enumerate all of its
 elements.
 
 It may be more natural to thing of these more as services though than
 datasets. i.e. a service that accepts some keys as input and returns a
 set of assertions. In this case the assertions would be links to other
 datasets.
 
 Is this an abuse of DCat/voiD?
 
 Not in my view, I think the notion of dataset is already pretty broad.
 
 Are this class of datasets subsets of sameAs.org (assuming sameAs.org
 to be complete in principle?)
 
 Subsets if they only asserted sameAs links, but I think you're
 suggesting that this may be too strict. I think there's potentially a
 whole set of related predicate based services [2] that provide
 useful indexes of existing datasets, or expose additional annotations
 of extra sources.
 
 The project I've been working on facilitates not just sameAs links,
 but any form of links that can be derived from shared URI patterns.
 This would include topic/subject based linking. ISBN was one the use
 cases I had in mind, but here are others.
 
 Cheers,
 
 L.
 
 [1]. http://patterns.dataincubator.org/book/literal-keys.html
 [2]. http://www.ldodds.com/blog/2010/03/predicate-based-services/
 
 Cheers,
 
 L.
 
 -- 
 Leigh Dodds
 Programme Manager, Talis Platform
 Mobile: 07850 928381
 http://kasabi.com
 http://talis.com
 
 Talis Systems Ltd
 43 Temple Row
 Birmingham
 B2 5LS
 

-- 
Hugh Glaser,  
  Intelligence, Agents, Multimedia
  School of Electronics and Computer Science,
  University of Southampton,
  Southampton SO17 1BJ
Work: +44 23 8059 3670, Fax: +44 23 8059 3045
Mobile: +44 75 9533 4155 , Home: +44 23 8061 5652
http://www.ecs.soton.ac.uk/~hg/





Re: implied datasets

2011-05-23 Thread William Waites
* [2011-05-23 14:46:47 +0100] Leigh Dodds leigh.do...@talis.com écrit:

] I'm not sure that the dataset is imaginary, but what you're doing
] seems eminently sensible to me. I've been working on a little project
] that I hope to release shortly that aims to facilitate this kind of
] linking, especially where those non-URI identifiers, or Literal Keys
] [1] are used to build patterned URIs.

The thing is, as with Hugh's suggestion, as a curator of datasets I
have little control or influence over how the dataset authors choose
to do this. I have noticed a common pattern though (urn:issn for
example) and encouraging patterns like this is helpful I think.

] It may be more natural to thing of these more as services though than
] datasets. i.e. a service that accepts some keys as input and returns a
] set of assertions. In this case the assertions would be links to other
] datasets.

This is a bit different. I was thinking of an implied dataset that 
would have no links outwards at all. 

] Subsets if they only asserted sameAs links, but I think you're
] suggesting that this may be too strict. I think there's potentially a
] whole set of related predicate based services [2] that provide
] useful indexes of existing datasets, or expose additional annotations
] of extra sources.

So this would be a separation of edge-labelled graphs into a bunch
of perhaps more manageable basic (V,E) graphs. An interesting way
of chopping things up.

The reason I think sameAs is too strict, aside from people putting
sameAs when they really mean similarTo, can be shown by another
library example. Broadly there seem to be two strategies for
representing things like books, the flat BIBO style and the more
elaborate FRBR/WEMI style. So if I have two datasets, one in each,
I might have something like,

ds1:flc a bibo:Book;
  dc:title The Feynman Lectures on Computation;
  dc:creator [ foaf:name Richard Feynman ];
  dc:language eng;
  owl:sameAs urn:isbn:0738202967.

ds2:flc a frbr:Manifestation;
  frbr:manifestationOf [
a frbr:Expression;
dc:language en;
frbr:expressionOf [
   a frbr:Work;
   dc:title The Feynman Lectures on Computation;
   dc:creator [ foaf:name Richard Feynman ]
]
  ];
  owl:sameAs urn:isbn:0738202967.

Both the authors have done something prima facie reasonable with the
sameAs but if you actually run it transitively you get into trouble.

This also goes to what Glenn was saying. These datasets are obviously
related in a meaningful way, there may well be useful ways for someone
who studies them to draw links between them but it isn't as simple as
saying they both have things of the same type. In fact what type
assertions are appropriate to clarify the relationship between these
datasets is the type of analysis that I would want to facilitate, not
try to do up front. What I can say is they both have references (that
may or may not be strictly believable) to this funny
non-dereferenceable URI (or equivalently, string literal of a certain
kind).

Cheers,
-w

-- 
William Waitesmailto:w...@styx.org
http://river.styx.org/ww/sip:w...@styx.org
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45



Re: data vs. information (Was Re: implied datasets)

2011-05-23 Thread Hugh Glaser
Thanks for all your thoughts William - food for pondering.
A few comments, which I find hard to interleave - sorry.
The totl.net site doesn't have to be hit for the URIs to have value.
dbpedia doesn't even have to exist for the URIs to have value (ducks /) - 
well at least not very often; it may have been down for the last month, for all 
I know, but I have been using the URIs.
They are just there as useful identifiers.
Ah yes, crossref.org; I realise all this is quite controversial in the 
publishing sphere - I think there is a site that does Linked Data to DOI, I 
don't think crossref.org does it, but can't remember which it is.

It is rather strange that we worry about having authoritative or at least 
agreed URIs for hard things like people, but don't manage to have them for less 
complex things (at least in terms of enumeration) such as pantone colours or 
chemical elements, and yes, ISSN.

dbpedia can sort of fit this role, if wikipedia had pages on them, but somehow 
it feels like clear datasets such as this should be sort of taken out.
And of course, when we get to datasets of essentially arbitrary size (IPV6 URI 
anyone?, or even V4), we are in a different world of representation and service.
Best
Hugh

On 23 May 2011, at 23:02, William Waites wrote:

 * [2011-05-23 18:19:49 +] Hugh Glaser h...@ecs.soton.ac.uk écrit:
 
 ] I won't go into whether the April Fool's joke of the integers might
 ] actually be useful (note that dbpedia has quite a lot of URIs for numbers),
 ] but there will be many other standard URIs for things that we take for 
 granted.
 ] The recent colour ones might seem like a joke as well, but perhaps not?
 
 I had this a little bit in mind when I wrote the original mail, and this 
 goes nicely to some related thoughts about quality.
 
 The thing with the linked open numbers is that it makes the point
 pretty neatly I think that it is silly to try to materialise
 everything that can be stated in RDF. A small computer program that
 describes numbers might have the same information content as all of
 those numbers made manifest. And it would take up a lot less disk
 space and be much faster to query. But you could still use it to refer
 to the numbers when you needed to.
 
 Is this always the case? It seems to be the tradeoff is speed
 vs. space. For some aspects of numbers this makes sense (e.g. their
 representation in roman numerals) but what about computationally
 expensive things like their prime factors? This quickly becomes too
 expensive to calculate on the fly but actually a lookup service could
 make a certain amount of sense...
 
 ] My favourite at the moment is
 ] 
 http://data.totl.net/chess/state/rnbqkbnr__8_8_8_8__RNBQKBNR_w_KQkq_-_0_1
 ] A very large number of URIs that describe chess positions.
 ] And tells you things like the next legal move in RDF.
 ] 
 ] So if I had loads of games in RDF, I could reliably do some fun queries
 ] about games with move sequences, etc.
 ] 
 ] Seems to me it is very similar to William's requirements.
 
 Oh, that's beautiful.
 
 ] However, it does it slightly differently, by having resolvable URIs for the
 ] positions, which can easily go to the more conventional representations.
 
 And this works because the service has a compact representation of the
 space of all possible positions and moves, a small computer
 program. You can then materialise the small subspace that you're
 interested in and run some analysis on it.
 
 Its nice that totl wants to run that program for me but I guess they
 could just as easily give me the program and let me run it myself. Bit
 for bit they would have given me far more information and far less
 data. But then it might be more convenient for me to use their service
 if I only have a relatively small number of positions/moves to
 consider. Might the service be useful for a program to help study how
 to play chess? Quite possibly. Would it make sense to build a
 chess-playing computer on top of their service? It would be
 interesting to see but I suspect network traffic and delays would be
 prohibitive.
 
 Its the same story with the trend of taking CSV files, a pretty
 compact and easy to work with representation of tabular data, and
 expanding them into giant RDF datasets that take up a lot of disk
 space and are cumbersome to query. A service to refer to a cell in a
 spreadsheet, to give it a URI and return some small amount of data
 would be useful. Proactively materialising the whole thing (not
 infinite but in some cases still very large) is probably not.
 
 ] Is that not a better way of doing what you want, William?
 ] Bring up a simple site that actually has http://example.org/issn/1234-5678 
 ] or perhaps more appropriately something like http://totl.net/issn/1234-5678
 ] which actually resolves to some (generated) RDF snippet that is
 ] sensible.
 
 So quite reasonable, and I believe but am not certain that
 crossref.org has already done exactly this for ISSNs (but that