Re: [Dbpedia-discussion] total number of triples and instances in current DBpedia version 3.9

Jona Christopher Sahnwaldt Tue, 22 Apr 2014 08:49:38 -0700

On 21 April 2014 12:02, Volha Bryl <vo...@informatik.uni-mannheim.de> wrote:
> Hi Christopher,
>
> A curiosity:
>
>
> On 4/21/2014 3:05 AM, Jona Christopher Sahnwaldt wrote:
>>
>> On 20 April 2014 18:58, Volha Bryl <vo...@informatik.uni-mannheim.de>
>> wrote:
>>>
>>> In fact,
>>> SELECT COUNT(*) WHERE {?x ?y ?z}
>>> executed against DBpedia SPARQL endpoint returns 825,761,509 at the
>>> moment.
>>> And actually I am not sure that all datasets available at [5] are loaded
>>> into the endpoint
>>
>> No, only certain datasets are loaded. They are listed here:
>> http://wiki.dbpedia.org/DatasetsLoaded39
>>
>>> so the total number for English can be even bigger.
>>>
>>> Summarizing, [1,2] are good sources for getting numbers of
>>> things/instances.
>>> For the number of triples - depends on what you want to count. For types
>>> and
>>> properties refer to [1,2], for total number of triples - refer to SPARQL
>>> endpoints for English and some other languages for which the endpoints
>>> exist. Or go through the dumps and count :)
>>
>> The number of lines in each dataset file is listed in this file:
>>
>>
>> https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/data/lines-bytes-packed.txt
>>
>> There are a few comment lines in each file, so the number of triples
>> is slightly lower, but not by much.
>>
>> I just counted the lines in all English NT files by the following
>> command. (grep -v is necessary to remove a few files that contain
>> almost the same triples as other files.)
>>
>> grep 'en/.*\.nt' lines-bytes-packed.txt  | grep -vE
>> 'unredirected|same_as|see_also|chapters|cleaned' | awk '{sum+=$3} END
>> {print sum}'
>>
>> Result for en: 488 million triples.
>> For all languages: 3.1 billion triples
>
> Why then the triple count according to the endpoint (see the query above) is
> more than 800 mln? From your explanations (not all triples are loaded) it
> should be the other way around.


Good question. I dont' know. The number of lines in all files listed
in DatasetsLoaded39 [1] (same files as in datasets.txt [2] and
linksets.txt [3]) is 341,542,042 - not even half the number given by
COUNT(*).

@OpenLink: can you help? Maybe you guys added some other datasets or
inferred a lot of triples when you loaded the DBpedia datasets? Just
curious.

Details:

cat datasets.txt linksets.txt > loaded.txt
grep -f loaded.txt lines-bytes-packed.txt | awk '{sum+=$3} END {print sum}'

Cheers,
JC

[1] http://wiki.dbpedia.org/DatasetsLoaded39
[2] 
https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/data/datasets.txt
[3] 
https://github.com/dbpedia/extraction-framework/blob/master/scripts/src/main/data/linksets.txt

>
> Cheers,
> Volha

------------------------------------------------------------------------------
Start Your Social Network Today - Download eXo Platform
Build your Enterprise Intranet with eXo Platform Software
Java Based Open Source Intranet - Social, Extensible, Cloud Ready
Get Started Now And Turn Your Intranet Into A Collaboration Platform
http://p.sf.net/sfu/ExoPlatform
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Re: [Dbpedia-discussion] total number of triples and instances in current DBpedia version 3.9

Reply via email to