[I've pulled out the stuff missing from the canonicalized datasets according to Dimitris.]
> What is missing from canonicalized datasets is mainly: > - redirects (~7.7M resources) I'm not sure that information these add. It might have been that they give new aliases, but many redirects are not aliases, being instead being related topics. > - the new nlp/nif datasets As far as I can tell, these relate terms to text from Wikipedia, and thus are not directly domain information. > - most of categories-based datasets (last time I checked, most categories > do not have wikidata item) ((total ~1.7M skos concepts) The Wikipedia categories have some information, but I'm willing to forgo it. There is also the big problem of whether the relationship between a category and its elements is one of instance or subclass. > - extracted template metadata I don't think that this is domain information, except perhaps as provenance. > Also, as you noted some infoboxes produce multiple intermediate resources > that do not have a wikidata item (~1.7M resources for English) This is a potential problem. I thought that the number of these things was quite small, coming only from pages that had multiple infoboxes. What else can cause these? So in the end it looks as if I have picked up more or less what I needed, but it took a lot of analysis that I had been hoping wasn't needed. peter On 08/30/2017 11:40 PM, Dimitris Kontokostas wrote: > Hello Peter, > > some minor comments inline > > On Thu, Aug 31, 2017 at 3:39 AM, Peter F. Patel-Schneider > <pfpschnei...@gmail.com <mailto:pfpschnei...@gmail.com>> wrote: > > Well what I was trying to do was to figure out just which of the DBpedia > files I need to combine to get a maximal set of useful high-quality data. > > I had thought that this should be easy. However, it is not. > > First there is the problem of getting the file table in the dataset > section > to show up at all. > > There is also the question of whether to look in the core directory or the > core-i18n directory. I guess that the core-i18n directory is the place to > because the files in the dataset section of > http://wiki.dbpedia.org/downloads-2016-10 > <http://wiki.dbpedia.org/downloads-2016-10> are all from there. > > Then there is the question of whether to use the canonicalized names or > the > localized names. There are warnings that the files using canonicalized > names may be missing some information. But how much information is > missing? > Every useful Wikipedia page has a Wikidata item for it so it seems at > first > that there are no missing Wikipedia items. But then I remembered that > pages > with multiple mapped infoboxes will produce multiple DBpedia items, so I > guess that these are not present. But how many of these are there? My > guess is not many, and the benefits of the canonicalized names outweigh > the > effect of missing some information. > > > What is missing from canonicalized datasets is mainly: > - redirects (~7.7M resources) > - the new nlp/nif datasets > - most of categories-based datasets (last time I checked, most categories do > not have wikidata item) ((total ~1.7M skos concepts) > - extracted template metadata > > Also, as you noted some infoboxes produce multiple intermediate resources that > do not have a wikidata item (~1.7M resources for English) > > I think I miss a few but imho these are the most important > > > Then there is the question of whether simple or commons is the way to go > like one of them might have been in the past. I guess not, because the > canonicalized names provides better integration. > > Then there is the question of whether to use only mapping-based > information > or to include other information. As I'm interested in high-quality > information, I chose mapping-based information only. Then there is the > question of how to get all the mapping-based information. My guess is > that > I need "Mappingbased Literals" and "Mappingbased Objects" which should be > adequate to pick up all the non-instance triples based on their > descriptions. However, I guess that I also need "Geo Coordinates > Mappingbased" but that I don't need "Specific Mappingbased Properties". > Then I guess that I also need "Instance Types" and "Instance Types > Transitive". I also want labels of the information, so I guess I need > "Labels" and labels_nmw, whereever that is. > > Then there is the question of which languages to include. My guess is all > of them, as I'm using the canonicalized names and the mapping-based > results > so everything should combine together correctly. If I get some duplicates > (e.g., from labels) that should be benign. > > > Note that besides label duplicates you will most probably get a lot of othe > duplication that you need to deal with > e.g. different birthdates between DBpedia EL, NL & > Wikidata https://gist.github.com/jimkont/01f6add8527939c39192bcb3f840eca0 > > The DBpedia team is working on a fused DBpedia version that will try to > consolidate these differences > > So I tried > wget -nc -r -np --cut-dirs=3 -A > > "*mappingbased*_wkd_*.ttl.bz2","instance_types*wkd_uris_*.ttl.bz2","labels_wkd_uris_*.ttl.bz2","*labels_nmw_*.ttl.bz2" > http://downloads.dbpedia.org/2016-10/core-i18n/ > <http://downloads.dbpedia.org/2016-10/core-i18n/> > which seems to do the trick, but I'm not very confidant that I have > downloaded everything I need. > > > Looks like a good pick but high quality is fittness for use , e.g. as you > said, some people would use the geo data as well. > What you might also consider is the `homepages` and `images` datasets, if you > are interested in those > > In addition to that, the dataid metadata contain the suggested named graphs > each dataset (in each language) should go into but is more aligned with the > data that is loaded in the main endpoint > > Best, > Dimitris > > -- > Kontokostas Dimitris ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ DBpedia-discussion mailing list DBpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion