Re: Propagation of bad sameAs statements

2010-09-09 Thread Hugh Glaser
Hi,
Thank you for your interest.
Here are some sort of answers to this and other questions.
In fact, this has become something of a dialogue with myself :-)

sameas.org does not itself do any interesting inference, other than
A sameas B  B sameas C = A sameas C when asked about A.
It aims to gather equivalence information from existing sources and service
the results in a convenient (single) place.
(It also aims to address the problem of owl:sameas being a pairwise
statement, which gives an unpleasant explosion (n**2) of statements for
groups of equivalences, which can be quite hard to handle.)

Who chooses what data is acceptable?
Er, me.
I look at it and decide.

Is it a spider (people sometimes ask this)?
No - when I am bored with the other things I am doing I add more to it, by
downloading dumps or querying SPARQL endpoints, often as a result of
messages on this and other lists.

Is owl:sameAs the only predicate recognised?
As you have worked out, no.
It is a service giving equivalent URIs, and one of the formats you can get
back is owl:sameAs. But you can get other formats if you want. So the inputs
include things like skos:exactMatch and skos:closeMatch (as I recall).
And we could output other formats such as these if asked.
At the moment we only do rdf+xml, text/n3, application/json, text/plain, see
http://www.sameas.org/about.php.
What has now been noticed is that I decided that dbpedia redirects should be
treated as equivalent.
The reason I did this is that it meant that a lot of expected URIs now
worked.
Eg http://dbpedia.org/resource/UN/LOCODE:GBLON and
even http://dbpedia.org/resource/Capital_of_the_UK get to
http://data.ordnancesurvey.co.uk/id/70041428 and
http://statistics.data.gov.uk/id/eer/07.
The downside is that there is quite a lot of cruft in the redirects, and so
some strange things happen (as has been observed).

Do I know about errors in sameas.org?
Yes.
I like the Iron Maiden one to opencyc, for example.
But I don't aim to correct these, any more than Google aims to correct
things it links to.

Why such a liberal attitude to equivalence?
I eventually worked out that sameas.org was a discovery service.
We have other sameas services, called crs services, on our systems (eg
http://opencyc.rkbexplorer.com/crs/ is an external one) which are
definitional (I hesitate to use a word like authoritative, with all its
other connotations).
And so in that vein, I have cast the net wider for sameas.org.
This was the case early in its life, as the wordnet equivalence to dbpedia
is in fact the equivalence of the word to the thing, which is wrong at
some/any level.
But I have taken the view that people/agents that come to sameas.org are
looking for things, and might not care about such subtleties, not least
because they may not have understand them when they constructed their RDF.

If I had the time/funding, I would provide other services that took
different views of equivalence, in terms of discovery/definitional or
liberal/conservative (precision/recall is another way of saying that).

Mind you it is probably the case that the sameas.org data is no worse than a
lot of the data in the LOD diagram, in terms of reliably identifying
resources, as I have rejected a bunch of them as being substandard.

On 08/09/2010 15:42, joel sachs jsa...@csee.umbc.edu wrote:

 
...
 So, a request for the sameas.org folks: Would it be possible to include a
 provenance column for all sameAs assertions you keep track of?  In cases
 where the sameAs assertion isn't actually asserted on the web, you could
 indicate the provenance as inferred in the provenance column. Also, have
 you published the heuristics you use (if any) to infer sameAs relations?
 
...
 
 Thanks!
 Joel.
 
 
 
So finally getting round to your specific question (although hopefully the
other stuff has also helped).
It would be hard to provide the extra column for quite a few reasons.
We do know where we got the data from, but it may be a SPARQL endpoint, a
dump downloaded, or an email sent to me, for examples. So it would not be
very easy to interpret.
But only a small number of the pairs would be so identified, as all the rest
are inferred from the other pairwise assertions.
We can actually have our own visualisation tools for bundles, with
assertions and dates, etc, but the tool is hard to read if you don't know
what is happening, and...
1) Finding the resources to make it more accessible would be hard.
sameas.org has effectively never been funded - it is my hobby with Ian
Millard, and we would love to have the resources to do this sort of stuff.
I actually have plans for a more sophisticated architecture behind
sameas.org which facilitate this and a lot of other stuff, but again it is a
question of resources.

2) What is the Ontology?
A big question with giving more information is, what is the ontology?
We live in the Linked Data world (for sameas.org), and machine-interpretable
structures.
So sameas.org is designed to be used by services, and 

Re: Propagation of bad sameAs statements

2010-09-09 Thread Juan Sequeda
Hugh,

Great to understand how this all works. I'm now expecting somebody to take
all these sameAs links and run some type of page rank algorithm and rank
what actually is sameAs.

Cheers

Juan Sequeda
+1-575-SEQ-UEDA
www.juansequeda.com


On Thu, Sep 9, 2010 at 8:23 AM, Hugh Glaser h...@ecs.soton.ac.uk wrote:

 Hi,
 Thank you for your interest.
 Here are some sort of answers to this and other questions.
 In fact, this has become something of a dialogue with myself :-)

 sameas.org does not itself do any interesting inference, other than
 A sameas B  B sameas C = A sameas C when asked about A.
 It aims to gather equivalence information from existing sources and service
 the results in a convenient (single) place.
 (It also aims to address the problem of owl:sameas being a pairwise
 statement, which gives an unpleasant explosion (n**2) of statements for
 groups of equivalences, which can be quite hard to handle.)

 Who chooses what data is acceptable?
 Er, me.
 I look at it and decide.

 Is it a spider (people sometimes ask this)?
 No - when I am bored with the other things I am doing I add more to it, by
 downloading dumps or querying SPARQL endpoints, often as a result of
 messages on this and other lists.

 Is owl:sameAs the only predicate recognised?
 As you have worked out, no.
 It is a service giving equivalent URIs, and one of the formats you can get
 back is owl:sameAs. But you can get other formats if you want. So the
 inputs
 include things like skos:exactMatch and skos:closeMatch (as I recall).
 And we could output other formats such as these if asked.
 At the moment we only do rdf+xml, text/n3, application/json, text/plain,
 see
 http://www.sameas.org/about.php.
 What has now been noticed is that I decided that dbpedia redirects should
 be
 treated as equivalent.
 The reason I did this is that it meant that a lot of expected URIs now
 worked.
 Eg http://dbpedia.org/resource/UN/LOCODE:GBLON and
 even http://dbpedia.org/resource/Capital_of_the_UK get to
 http://data.ordnancesurvey.co.uk/id/70041428 and
 http://statistics.data.gov.uk/id/eer/07.
 The downside is that there is quite a lot of cruft in the redirects, and so
 some strange things happen (as has been observed).

 Do I know about errors in sameas.org?
 Yes.
 I like the Iron Maiden one to opencyc, for example.
 But I don't aim to correct these, any more than Google aims to correct
 things it links to.

 Why such a liberal attitude to equivalence?
 I eventually worked out that sameas.org was a discovery service.
 We have other sameas services, called crs services, on our systems (eg
 http://opencyc.rkbexplorer.com/crs/ is an external one) which are
 definitional (I hesitate to use a word like authoritative, with all its
 other connotations).
 And so in that vein, I have cast the net wider for sameas.org.
 This was the case early in its life, as the wordnet equivalence to dbpedia
 is in fact the equivalence of the word to the thing, which is wrong at
 some/any level.
 But I have taken the view that people/agents that come to sameas.org are
 looking for things, and might not care about such subtleties, not least
 because they may not have understand them when they constructed their RDF.

 If I had the time/funding, I would provide other services that took
 different views of equivalence, in terms of discovery/definitional or
 liberal/conservative (precision/recall is another way of saying that).

 Mind you it is probably the case that the sameas.org data is no worse than
 a
 lot of the data in the LOD diagram, in terms of reliably identifying
 resources, as I have rejected a bunch of them as being substandard.

 On 08/09/2010 15:42, joel sachs jsa...@csee.umbc.edu wrote:

 
 ...
  So, a request for the sameas.org folks: Would it be possible to include
 a
  provenance column for all sameAs assertions you keep track of?  In cases
  where the sameAs assertion isn't actually asserted on the web, you could
  indicate the provenance as inferred in the provenance column. Also,
 have
  you published the heuristics you use (if any) to infer sameAs relations?
 
 ...
 
  Thanks!
  Joel.
 
 
 
 So finally getting round to your specific question (although hopefully the
 other stuff has also helped).
 It would be hard to provide the extra column for quite a few reasons.
 We do know where we got the data from, but it may be a SPARQL endpoint, a
 dump downloaded, or an email sent to me, for examples. So it would not be
 very easy to interpret.
 But only a small number of the pairs would be so identified, as all the
 rest
 are inferred from the other pairwise assertions.
 We can actually have our own visualisation tools for bundles, with
 assertions and dates, etc, but the tool is hard to read if you don't know
 what is happening, and...
 1) Finding the resources to make it more accessible would be hard.
 sameas.org has effectively never been funded - it is my hobby with Ian
 Millard, and we would love to have the resources to do this sort of stuff.
 

Propagation of bad sameAs statements

2010-09-08 Thread joel sachs
I'd like to catalog sources of biodiversity information and misinformation 
on the semantic web, and am trying to determine the genesis of some 
unfortunate owl:sameAs statements.


According to sameas.org:

http://dbpedia.org/resource/Invasive_species
   owl:sameAs
  http://dbpedia.org/resource/Invasive_plant
  http://dbpedia.org/resource/Invasive_animal
  http://dbpedia.org/resource/Invasive_organism
  http://rdf.freebase.com/ns/guid.9202a8c04000641f8007de24
 (many other concepts)

Checking out the dbpedia resources that are the objects of the sameAs 
assertions, we see that each redirects to
http://dbpedia.org/resource/Invasive_species. But other than 
dbpedia:Invasive_species including a sameAs link to 
freebase:Invasive_species, no dbpedia page, afaict,  makes the sameAs assertions listed above.


However, http://rdf.freebase.com/rdf/guid.9202a8c04000641f8007de24 
does assert:


http://rdf.freebase.com/ns/guid.9202a8c04000641f8007de24
   owl:sameAs
  http://dbpedia.org/resource/Invasive_species
  http://dbpedia.org/resource/Invasive_plant
  http://dbpedia.org/resource/Invasive_organism
  http://dbpedia.org/resource/Invasive_animal
  etc.


The direction of propagation is not explicit. One possibility is that 
sameas.org is inferring that A sameAs B based on A redirects to B, and 
that these assertions are making their way into freebase. Another is that 
a freebase contributor is making the sameas inferences, and that they are 
being picked up by sameas.org. (Similar cycles of sameAs can be found for 
habitat, introduced_species, and many other concepts.)


So, a request for the sameas.org folks: Would it be possible to include a 
provenance column for all sameAs assertions you keep track of?  In cases 
where the sameAs assertion isn't actually asserted on the web, you could 
indicate the provenance as inferred in the provenance column. Also, have 
you published the heuristics you use (if any) to infer sameAs relations?


And questions for freebase contributors: Are any of you running a script 
that either a) loads in assertions from sameas.org, or b) deduces sameAs 
relations from dbepedia redirection behaviour?


Thanks!
Joel.





Re: [Freebase-discuss] Propagation of bad sameAs statements

2010-09-08 Thread Philip Kendall
[ Crossposting. Apologies for the duplicate. ]

- Forwarded message from Philip Kendall 
philip-freeb...@shadowmagic.org.uk -

From: Philip Kendall philip-freeb...@shadowmagic.org.uk
To: freebase-disc...@freebase.com
Subject: Re: [Freebase-discuss] Propagation of bad sameAs statements
Date: Wed, 8 Sep 2010 15:55:31 +0100

On Wed, Sep 08, 2010 at 10:42:45AM -0400, joel sachs wrote:
 
 And questions for freebase contributors: Are any of you running a script 
 that either a) loads in assertions from sameas.org, or b) deduces sameAs 
 relations from dbepedia redirection behaviour?

Essentially, (b) - they're deduced from Wikipedia rather than dppedia,
but it comes down to the same thing.

I agree with you that it's the wrong thing to do - hopefully one of the
Freebase Data Team will be along to explain why they do it.

Cheers,

Phil

-- 
  Philip Kendall phi...@shadowmagic.org.uk
  http://www.shadowmagic.org.uk/

- End forwarded message -

-- 
  Philip Kendall phi...@shadowmagic.org.uk
  http://www.shadowmagic.org.uk/




Re: [Freebase-discuss] Propagation of bad sameAs statements

2010-09-08 Thread Tom Morris
On Wed, Sep 8, 2010 at 10:59 AM, Philip Kendall
philip-freeb...@shadowmagic.org.uk wrote:
 [ Crossposting. Apologies for the duplicate. ]

 - Forwarded message from Philip Kendall 
 philip-freeb...@shadowmagic.org.uk -

 From: Philip Kendall philip-freeb...@shadowmagic.org.uk
 To: freebase-disc...@freebase.com
 Subject: Re: [Freebase-discuss] Propagation of bad sameAs statements
 Date: Wed, 8 Sep 2010 15:55:31 +0100

 On Wed, Sep 08, 2010 at 10:42:45AM -0400, joel sachs wrote:

 And questions for freebase contributors: Are any of you running a script
 that either a) loads in assertions from sameas.org, or b) deduces sameAs
 relations from dbepedia redirection behaviour?

 Essentially, (b) - they're deduced from Wikipedia rather than dppedia,
 but it comes down to the same thing.

 I agree with you that it's the wrong thing to do - hopefully one of the
 Freebase Data Team will be along to explain why they do it.

It may be to allow any URL that refers (or referred) to a Wikipedia
page to be mechanically transformed into a valid Freebase URL, but
Wikipedia redirects are a mishmash of valid alternative names,
misspellings, and names of completely separate concepts which were
merged because they weren't big or significant enough to warrant their
own Wikipedia page.

I agree that it would be much better to have a single sameAs between
the concepts and to keep the information from the redirects as
alternate labels (if at all).

Speaking of DBpedia/Freebase sameAs links, the DBpedia side of things
shouldn't be using internal Freebase GUIDs.  They should either be
using the standard IDs or, preferably, the relatively new MIDs i.e.
one of the following:

  http://rdf.freebase.com/rdf/m.0hrk4
  http://rdf.freebase.com/rdf/en.invasive_species

As an aside, Freebase should also be using owl:sameAs to link these
alternate identities together.

Tom