Resource requirements and configuration for loading a Wikidata dump

Hoffart, Johannes Mon, 08 Jun 2020 08:56:01 -0700

Hi,

I want to load the full Wikidata dump, available at 
https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use in 
Jena.


I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially, 
the progress (measured by dataset size) is quick. It slows down very much after 
a couple of 100GB written, and finally, at around 500GB, the progress is almost 
halted.

Did anyone ingest Wikidata into Jena before? What are the system requirements? 
Is there a specific tdb2.tdbloader configuration that would speed things up? 
For example building an index after data ingest?

Thanks
Johannes

Johannes Hoffart, Executive Director, Technology Division
Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329 
Frankfurt am Main
Email: johannes.hoff...@gs.com<mailto:johannes.hoff...@gs.com> | Tel: +49 (0)69 
7532 3558
Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr. 
Matthias Bock
Vorsitzender des Aufsichtsrats: Dermot McDonogh
Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190


________________________________

Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Resource requirements and configuration for loading a Wikidata dump

Reply via email to