On 12 Sep 2014, at 01:26, Kingsley Idehen <kide...@openlinksw.com> wrote:
> Good place to report these matters. Bottom line, the New York Times Linked > Data is problematic. They should be using foaf:focus where they currently use > owl:sameAs. > > I know of fixed this in the last DBpedia instance, via SPARQL 1.1. > forward-chaining. I guess I need to make time to repeat the fix. > > DBpedia Team: we need to perform this step next time around, if the New York > Times refuse to make this important correction. > > Alternatively, you can make fix dump too. Either way, this is a problem that > we should fix. I think it's a better idea to fix this in the dumps than only on one endpoint. I assume the wrong info is coming from the nytimes_links.nt.gz dump file (9678 lines). These are the double occurring data.nytimes.com URIs which link various wrong things with owl:sameAs: (I know it's a bit dirty, but the data.nytimes.com URIs are shorter than that and the 2nd column is long enough that the 47 char width never 3rd column): $ zcat nytimes_links.nt.gz | sort | uniq -D -w 47 | less <http://data.nytimes.com/10037152102685288131> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Harlem> . <http://data.nytimes.com/10037152102685288131> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Woods_Hole,_Massachusetts> . <http://data.nytimes.com/10219323006478270621> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Colombia> . <http://data.nytimes.com/10219323006478270621> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/St._Louis,_Missouri> . <http://data.nytimes.com/10943489202025116191> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Bar_Harbor,_Maine> . <http://data.nytimes.com/10943489202025116191> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Vancouver> . <http://data.nytimes.com/11974025787996384181> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Montenegro> . <http://data.nytimes.com/11974025787996384181> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Orlando,_Florida> . <http://data.nytimes.com/13330280224726436521> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Ann_Arbor,_Michigan> . <http://data.nytimes.com/13330280224726436521> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Bucharest> . <http://data.nytimes.com/14192138827082289301> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Brisbane> . <http://data.nytimes.com/14192138827082289301> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Timbuktu> . ... 1102 lines File here (18 KB): http://www.dfki.de/~hees/nytimes_links_dups.nt.gz Did some quick stats: each of those URIs links exactly 2 things, so we have 551 of them which are problematic: $ zcat nytimes_links.nt.gz | sort | uniq | cut -d' ' -f1 | sort | uniq -d | less <http://data.nytimes.com/10037152102685288131> <http://data.nytimes.com/10219323006478270621> <http://data.nytimes.com/10943489202025116191> <http://data.nytimes.com/11974025787996384181> <http://data.nytimes.com/13330280224726436521> <http://data.nytimes.com/14192138827082289301> ... 551 lines This only leaves lines without the duplicate prefix $ zcat nytimes_links.nt.gz | sort | uniq -u -w 47 | less <http://data.nytimes.com/10014285150226506373> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Shane_Mosley> . <http://data.nytimes.com/10014285150226506373> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Shane_Mosley> . <http://data.nytimes.com/10028178420088332933> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/F._Lee_Bailey> . <http://data.nytimes.com/10040729966879859333> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Grace_Paley> . <http://data.nytimes.com/10054942171853816843> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Jesse_McKinley> . ... 8576 lines File here (194 KB): http://www.dfki.de/~hees/nytimes_links_dups_pruned.nt.gz I'm not sure about the rest of that file though, given that nearly 1/10th of it were obviously wrong... Cheers, Jörn ------------------------------------------------------------------------------ Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion