On 9/12/14 4:58 PM, Jörn Hees wrote:
On 12 Sep 2014, at 01:26, Kingsley Idehen <kide...@openlinksw.com> wrote:Good place to report these matters. Bottom line, the New York Times Linked Data is problematic. They should be using foaf:focus where they currently use owl:sameAs. I know of fixed this in the last DBpedia instance, via SPARQL 1.1. forward-chaining. I guess I need to make time to repeat the fix. DBpedia Team: we need to perform this step next time around, if the New York Times refuse to make this important correction. Alternatively, you can make fix dump too. Either way, this is a problem that we should fix.I think it's a better idea to fix this in the dumps than only on one endpoint.
Of course.My point is that when its fixed in the Virtuoso DBMS behind the endpoint, we then make a dump which becomes the replacement dataset for future efforts.
Links:[1] http://kingsley.idehen.net/public_home/kidehen/Public/SPARQL-CRUD/nyt_dbpedia_mappings_fix.rq -- SPARQL 1.1 fix
Kingsley
I assume the wrong info is coming from the nytimes_links.nt.gz dump file (9678 lines). These are the double occurring data.nytimes.com URIs which link various wrong things with owl:sameAs: (I know it's a bit dirty, but the data.nytimes.com URIs are shorter than that and the 2nd column is long enough that the 47 char width never 3rd column): $ zcat nytimes_links.nt.gz | sort | uniq -D -w 47 | less <http://data.nytimes.com/10037152102685288131> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Harlem> . <http://data.nytimes.com/10037152102685288131> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Woods_Hole,_Massachusetts> . <http://data.nytimes.com/10219323006478270621> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Colombia> . <http://data.nytimes.com/10219323006478270621> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/St._Louis,_Missouri> . <http://data.nytimes.com/10943489202025116191> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Bar_Harbor,_Maine> . <http://data.nytimes.com/10943489202025116191> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Vancouver> . <http://data.nytimes.com/11974025787996384181> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Montenegro> . <http://data.nytimes.com/11974025787996384181> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Orlando,_Florida> . <http://data.nytimes.com/13330280224726436521> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Ann_Arbor,_Michigan> . <http://data.nytimes.com/13330280224726436521> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Bucharest> . <http://data.nytimes.com/14192138827082289301> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Brisbane> . <http://data.nytimes.com/14192138827082289301> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Timbuktu> . ... 1102 lines File here (18 KB): http://www.dfki.de/~hees/nytimes_links_dups.nt.gz Did some quick stats: each of those URIs links exactly 2 things, so we have 551 of them which are problematic: $ zcat nytimes_links.nt.gz | sort | uniq | cut -d' ' -f1 | sort | uniq -d | less <http://data.nytimes.com/10037152102685288131> <http://data.nytimes.com/10219323006478270621> <http://data.nytimes.com/10943489202025116191> <http://data.nytimes.com/11974025787996384181> <http://data.nytimes.com/13330280224726436521> <http://data.nytimes.com/14192138827082289301> ... 551 lines This only leaves lines without the duplicate prefix $ zcat nytimes_links.nt.gz | sort | uniq -u -w 47 | less <http://data.nytimes.com/10014285150226506373> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Shane_Mosley> . <http://data.nytimes.com/10014285150226506373> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Shane_Mosley> . <http://data.nytimes.com/10028178420088332933> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/F._Lee_Bailey> . <http://data.nytimes.com/10040729966879859333> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Grace_Paley> . <http://data.nytimes.com/10054942171853816843> <http://www.w3.org/2002/07/owl#sameAs> <http://dbpedia.org/resource/Jesse_McKinley> . ... 8576 lines File here (194 KB): http://www.dfki.de/~hees/nytimes_links_dups_pruned.nt.gz I'm not sure about the rest of that file though, given that nearly 1/10th of it were obviously wrong... Cheers, Jörn
-- Regards, Kingsley Idehen Founder & CEO OpenLink Software Company Web: http://www.openlinksw.com Personal Weblog 1: http://kidehen.blogspot.com Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen Twitter Profile: https://twitter.com/kidehen Google+ Profile: https://plus.google.com/+KingsleyIdehen/about LinkedIn Profile: http://www.linkedin.com/in/kidehen Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this
smime.p7s
Description: S/MIME Cryptographic Signature
------------------------------------------------------------------------------ Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion