On 21 April 2014 12:02, Volha Bryl <vo...@informatik.uni-mannheim.de> wrote: > Hi Christopher, > > A curiosity: > > > On 4/21/2014 3:05 AM, Jona Christopher Sahnwaldt wrote: >> >> On 20 April 2014 18:58, Volha Bryl <vo...@informatik.uni-mannheim.de> >> wrote: >>> >>> In fact, >>> SELECT COUNT(*) WHERE {?x ?y ?z} >>> executed against DBpedia SPARQL endpoint returns 825,761,509 at the >>> moment. >>> And actually I am not sure that all datasets available at [5] are loaded >>> into the endpoint >> >> No, only certain datasets are loaded. They are listed here: >> http://wiki.dbpedia.org/DatasetsLoaded39 >> >>> so the total number for English can be even bigger. >>> >>> Summarizing, [1,2] are good sources for getting numbers of >>> things/instances. >>> For the number of triples - depends on what you want to count. For types >>> and >>> properties refer to [1,2], for total number of triples - refer to SPARQL >>> endpoints for English and some other languages for which the endpoints >>> exist. Or go through the dumps and count :) >> >> The number of lines in each dataset file is listed in this file: >> >> >> https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/data/lines-bytes-packed.txt >> >> There are a few comment lines in each file, so the number of triples >> is slightly lower, but not by much. >> >> I just counted the lines in all English NT files by the following >> command. (grep -v is necessary to remove a few files that contain >> almost the same triples as other files.) >> >> grep 'en/.*\.nt' lines-bytes-packed.txt | grep -vE >> 'unredirected|same_as|see_also|chapters|cleaned' | awk '{sum+=$3} END >> {print sum}' >> >> Result for en: 488 million triples. >> For all languages: 3.1 billion triples > > Why then the triple count according to the endpoint (see the query above) is > more than 800 mln? From your explanations (not all triples are loaded) it > should be the other way around.
Good question. I dont' know. The number of lines in all files listed in DatasetsLoaded39 [1] (same files as in datasets.txt [2] and linksets.txt [3]) is 341,542,042 - not even half the number given by COUNT(*). @OpenLink: can you help? Maybe you guys added some other datasets or inferred a lot of triples when you loaded the DBpedia datasets? Just curious. Details: cat datasets.txt linksets.txt > loaded.txt grep -f loaded.txt lines-bytes-packed.txt | awk '{sum+=$3} END {print sum}' Cheers, JC [1] http://wiki.dbpedia.org/DatasetsLoaded39 [2] https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/data/datasets.txt [3] https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/data/linksets.txt > > Cheers, > Volha ------------------------------------------------------------------------------ Start Your Social Network Today - Download eXo Platform Build your Enterprise Intranet with eXo Platform Software Java Based Open Source Intranet - Social, Extensible, Cloud Ready Get Started Now And Turn Your Intranet Into A Collaboration Platform http://p.sf.net/sfu/ExoPlatform _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion