Hi Johannes,
On 08/06/2020 16:54, Hoffart, Johannes wrote:
Hi,
I want to load the full Wikidata dump, available at
https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use in
Jena.
I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially,
the progress (measured by dataset size) is quick. It slows down very much after
a couple of 100GB written, and finally, at around 500GB, the progress is almost
halted.
Loading performance is sensitive to the hardware used. Large RAM, high
performance SSD.
Setting the heap size larger actually slows the process down. The
database indexes are cached outside the heap in the main OS filesystem
case so a cache size of 120G is taking space away from that space.
A heap size of ~8G should be more than enough.
The other factor is the storage. A large SSD, and best of an M.2
connected local SSD, is significantly faster.
It can be worthwhile to build the database on a machine spec'ed for
loading and move it elsewhere for query use. The database, once built,
can be file-copied.
It will take many hours to load under optimal conditions - it has been
reported it takes over an hour just to count the lines in the
latest-all.ttl.bz2 file using the standard unix tools (no java in
sight!). I'm trying to just parse the file and the parser is taking
hours. There are ea lot of warnings (you can ignore them - they are just
warnings, not errors).
latest-truthy is a significantly smaller. Getting the process working
(it's only in NT format but you can just load the prefixes taken from
the TTL version separately)
And check the download of any of these large files - I have had it
truncate in one attempt I made.
Andy
Did anyone ingest Wikidata into Jena before? What are the system requirements?
Is there a specific tdb2.tdbloader configuration that would speed things up?
For example building an index after data ingest?
tdb2.tdbloader has options for loader algorithm. --loader=parallel is
probably fastest if you have the SSD space.
Thanks
Johannes
Johannes Hoffart, Executive Director, Technology Division
Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329
Frankfurt am Main
Email: johannes.hoff...@gs.com<mailto:johannes.hoff...@gs.com> | Tel: +49 (0)69
7532 3558
Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr.
Matthias Bock
Vorsitzender des Aufsichtsrats: Dermot McDonogh
Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
________________________________
Your Personal Data: We may collect and process information about you that may be
subject to data protection laws. For more information about how we use and disclose
your personal data, how we protect your information, our legal basis to use your
information, your rights and who you can contact, please refer to:
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>