I've loaded the Wikidata "truthy" dataset with 6b triples. Summary stats is:
10:09:29 INFO Load node table = 35555 seconds 10:09:29 INFO Load ingest data = 25165 seconds 10:09:29 INFO Build index SPO = 11241 seconds 10:09:29 INFO Build index POS = 14100 seconds 10:09:29 INFO Build index OSP = 12435 seconds 10:09:29 INFO Overall 98496 seconds 10:09:29 INFO Overall 27h 21m 36s 10:09:29 INFO Triples loaded = 6756025616 10:09:29 INFO Quads loaded = 0 10:09:29 INFO Overall Rate 68591 tuples per second This was done on a large machine with 2TB RAM and -threads=48, but anyway: It looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT brought HUGE improvements over prior versions (unfortunately I cannot find a log, but it took multiple days with 3.x on the same machine). Two observations: - As Andy (thanks again for all your help!) already mentioned, gzip files apparently load significantly faster then bzip2 files. I experienced 200,000 vs. 100,000 triples/second in the parse nodes step (though colleagues had jobs on the machine too, which might have influenced the results). - During the extended POS/POS/OSP sort periods, I saw only one or two gzip instances (used in the background), which perhaps were a bottleneck. I wonder if using pigz could extend parallel processing. If you think it usefull, I am happy to share more details. If I can help with running some particular tests on a massive parallel machine, please let me know. Cheers, Joachim -- Joachim Neubert ZBW - Leibniz Information Centre for Economics Neuer Jungfernstieg 21 20354 Hamburg Phone +49-40-42834-462