It maybe that SSD is the important factor.

1/ From a while ago, on truthy:

https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412%40%3Cusers.jena.apache.org%3E

before tdb2.tdbloader was a thing.

2/ I did some (not open) testing on a mere 800M and tdb2.tdbloader with a Dell XPS laptop (2015 model, 16G RAM, 1T M.2 SSD) and a big AWS server (local NVMe, but virtualized, SSD).

The laptop was nearly as fast as a big AWS server.

My assumption was that as the database grew, RAM caching become less significant and the speed of I/O was dominant.

FYI When "tdb2.tdbloader --loader=parallel" gets going it will saturate the I/O.

----

I don't have access to hardware (or ad hoc AWS machines) at the moment otherwise I'd give this a try.

Previously, downloading the data to AWS is much faster and much more reliable than to my local setup. That said, I think dumps.wikimedia.org does some rate limiting of downloads as well or my route to the site ends up on a virtual T3 - I get the magic number of 5MBytes/s sustained download speed a lot out of working hours.

    Andy

On 09/06/2020 08:04, Wolfgang Fahl wrote:
Hi Johannes,

thank you for bringing the issue to this mailinglist again.

At
https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits
there is a question describing the issue and at
http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Test_with_Apache_Jena
a documentation of my own attempts. There has been some feedback by a
few people in the mean time but i have no report of a success yet. Also
the only hints to achieve better performance are currently related to
RAM and disk so using lots of RAM (up to 2 Terrrabyte) and SSDs (also
some 2 Terrabyte) was mentioned. I asked at my local IT center and the
machine with such RAM is around 30-60 thousand EUR and definitely out of
my budget. I might invest in a 200 EUR 2 Terrabyte SSD if i could be
sure that this would solve the problem. At this time i doubt it since
the software keeps crashing on me and there seem to be bugs in Operating
System, Java Virtual Machine and Jena itself that prevent the success as
well as the severe degradation in performance for multi-billion triple
imports that make it almost impossible to test given a estimated time of
finish of half a year on (old but sophisticated) hardware that i am
using daily.

Cheers
   Wolfgang

Am 08.06.20 um 17:54 schrieb Hoffart, Johannes:
Hi,

I want to load the full Wikidata dump, available at 
https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use in 
Jena.

I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially, 
the progress (measured by dataset size) is quick. It slows down very much after 
a couple of 100GB written, and finally, at around 500GB, the progress is almost 
halted.

Did anyone ingest Wikidata into Jena before? What are the system requirements? 
Is there a specific tdb2.tdbloader configuration that would speed things up? 
For example building an index after data ingest?

Thanks
Johannes

Johannes Hoffart, Executive Director, Technology Division
Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329 
Frankfurt am Main
Email: johannes.hoff...@gs.com<mailto:johannes.hoff...@gs.com> | Tel: +49 (0)69 
7532 3558
Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr. 
Matthias Bock
Vorsitzender des Aufsichtsrats: Dermot McDonogh
Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190


________________________________

Your Personal Data: We may collect and process information about you that may be 
subject to data protection laws. For more information about how we use and disclose 
your personal data, how we protect your information, our legal basis to use your 
information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Reply via email to