On 12 Sep 2014, at 01:26, Kingsley Idehen <kide...@openlinksw.com> wrote:

> Good place to report these matters. Bottom line, the New York Times Linked 
> Data is problematic. They should be using foaf:focus where they currently use 
> owl:sameAs.
> 
> I know of fixed this in the last DBpedia instance, via SPARQL 1.1. 
> forward-chaining. I guess I need to make time to repeat the fix.
> 
> DBpedia Team: we need to perform this step next time around, if the New York 
> Times refuse to make this important correction.
> 
> Alternatively, you can make fix dump too. Either way, this is a problem that 
> we should fix.


I think it's a better idea to fix this in the dumps than only on one endpoint.

I assume the wrong info is coming from the nytimes_links.nt.gz dump file (9678 
lines).

These are the double occurring data.nytimes.com URIs which link various wrong 
things with owl:sameAs:
(I know it's a bit dirty, but the data.nytimes.com URIs are shorter than that 
and the 2nd column is long enough that the 47 char width never 3rd column):
$ zcat nytimes_links.nt.gz | sort | uniq -D -w 47 | less
<http://data.nytimes.com/10037152102685288131> 
<http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Harlem> .
<http://data.nytimes.com/10037152102685288131> 
<http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Woods_Hole,_Massachusetts> .
<http://data.nytimes.com/10219323006478270621> 
<http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Colombia> .
<http://data.nytimes.com/10219323006478270621> 
<http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/St._Louis,_Missouri> .
<http://data.nytimes.com/10943489202025116191> 
<http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Bar_Harbor,_Maine> .
<http://data.nytimes.com/10943489202025116191> 
<http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Vancouver> .
<http://data.nytimes.com/11974025787996384181> 
<http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Montenegro> 
.
<http://data.nytimes.com/11974025787996384181> 
<http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Orlando,_Florida> .
<http://data.nytimes.com/13330280224726436521> 
<http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Ann_Arbor,_Michigan> .
<http://data.nytimes.com/13330280224726436521> 
<http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Bucharest> .
<http://data.nytimes.com/14192138827082289301> 
<http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Brisbane> .
<http://data.nytimes.com/14192138827082289301> 
<http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Timbuktu> .
...
1102 lines

File here (18 KB):  http://www.dfki.de/~hees/nytimes_links_dups.nt.gz

Did some quick stats: each of those URIs links exactly 2 things, so we have 551 
of them which are problematic:

$ zcat nytimes_links.nt.gz | sort | uniq | cut -d' ' -f1 | sort | uniq -d | less
<http://data.nytimes.com/10037152102685288131>
<http://data.nytimes.com/10219323006478270621>
<http://data.nytimes.com/10943489202025116191>
<http://data.nytimes.com/11974025787996384181>
<http://data.nytimes.com/13330280224726436521>
<http://data.nytimes.com/14192138827082289301>
...
551 lines


This only leaves lines without the duplicate prefix
$ zcat nytimes_links.nt.gz | sort | uniq -u -w 47 | less
<http://data.nytimes.com/10014285150226506373> 
<http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Shane_Mosley> .
<http://data.nytimes.com/10014285150226506373> 
<http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Shane_Mosley> .
<http://data.nytimes.com/10028178420088332933> 
<http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/F._Lee_Bailey> .
<http://data.nytimes.com/10040729966879859333> 
<http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Grace_Paley> .
<http://data.nytimes.com/10054942171853816843> 
<http://www.w3.org/2002/07/owl#sameAs> 
<http://dbpedia.org/resource/Jesse_McKinley> .
...
8576 lines


File here (194 KB): http://www.dfki.de/~hees/nytimes_links_dups_pruned.nt.gz 


I'm not sure about the rest of that file though, given that nearly 1/10th of it 
were obviously wrong...


Cheers,
Jörn


------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to