I've loaded the Wikidata "truthy" dataset with 6b triples. Summary stats is:

10:09:29 INFO  Load node table  = 35555 seconds
10:09:29 INFO  Load ingest data = 25165 seconds
10:09:29 INFO  Build index SPO  = 11241 seconds
10:09:29 INFO  Build index POS  = 14100 seconds
10:09:29 INFO  Build index OSP  = 12435 seconds
10:09:29 INFO  Overall          98496 seconds
10:09:29 INFO  Overall          27h 21m 36s
10:09:29 INFO  Triples loaded   = 6756025616
10:09:29 INFO  Quads loaded     = 0
10:09:29 INFO  Overall Rate     68591 tuples per second

This was done on a large machine with 2TB RAM and -threads=48, but anyway: It 
looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT brought HUGE improvements 
over prior versions (unfortunately I cannot find a log, but it took multiple 
days with 3.x on the same machine).

Two observations:


-        As Andy (thanks again for all your help!) already mentioned, gzip 
files apparently load significantly faster then bzip2 files. I experienced  
200,000 vs. 100,000 triples/second in the parse nodes step (though colleagues 
had jobs on the machine too, which might have influenced the results).

-        During the extended POS/POS/OSP sort periods, I saw only one or two 
gzip instances (used in the background), which perhaps were a bottleneck. I 
wonder if using pigz could extend parallel processing.

If you think it usefull, I am happy to share more details. If I can help with 
running some particular tests on a massive parallel machine, please let me know.

Cheers, Joachim

--
Joachim Neubert

ZBW - Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-40-42834-462

Reply via email to