Hi Andy,

Thanks for the helpful pointers by you and others.

I will change the heap settings to see if this at least allows the process to 
finish. For reference, the machine has 128GB of main memory and a regular HDD 
attached.

I also changed the logging settings to see the progress (would be nice to have 
this enabled by default).

Thanks
Johannes

-----Original Message-----
From: Andy Seaborne <a...@apache.org>
Sent: Monday, June 8, 2020 11:43 PM
To: users@jena.apache.org
Subject: Re: Resource requirements and configuration for loading a Wikidata dump

Hi Johannes,

On 08/06/2020 16:54, Hoffart, Johannes wrote:
> Hi,
>
> I want to load the full Wikidata dump, available at 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__dumps.wikimedia.org_wikidatawiki_entities_latest-2Dall.ttl.bz2&d=DwIC-g&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=xf6--uwdcCl8ABKwQSkT2uFj8PgnlEqThex0udypM28&m=GvGO8rPB3XdHz-_iF_4fClXgvmdy_32YrUUTUvRRxQQ&s=x5WddEwbXWPtCCFiaZ2ytRIxJIRL_kIvxtOIkOcsNzg&e=
>   to use in Jena.
>
> I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. 
> Initially, the progress (measured by dataset size) is quick. It slows down 
> very much after a couple of 100GB written, and finally, at around 500GB, the 
> progress is almost halted.

Loading performance is sensitive to the hardware used.  Large RAM, high 
performance SSD.

Setting the heap size larger actually slows the process down. The database 
indexes are cached outside the heap in the main OS filesystem case so a cache 
size of 120G is taking space away from that space.
A heap size of ~8G should be more than enough.

The other factor is the storage. A large SSD, and best of an M.2 connected 
local SSD, is significantly faster.

It can be worthwhile to build the database on a machine spec'ed for loading and 
move it elsewhere for query use. The database, once built, can be file-copied.

It will take many hours to load under optimal conditions - it has been reported 
it takes over an hour just to count the lines in the
latest-all.ttl.bz2 file using the standard unix tools (no java in sight!). I'm 
trying to just parse the file and the parser is taking hours. There are ea lot 
of warnings (you can ignore them - they are just warnings, not errors).

latest-truthy is a significantly smaller. Getting the process working (it's 
only in NT format but you can just load the prefixes taken from the TTL version 
separately)

And check the download of any of these large files - I have had it truncate in 
one attempt I made.

     Andy

> Did anyone ingest Wikidata into Jena before? What are the system 
> requirements? Is there a specific tdb2.tdbloader configuration that would 
> speed things up? For example building an index after data ingest?

tdb2.tdbloader has options for loader algorithm. --loader=parallel is probably 
fastest if you have the SSD space.

>
> Thanks
> Johannes
>
> Johannes Hoffart, Executive Director, Technology Division Goldman
> Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329
> Frankfurt am Main
> Email: johannes.hoff...@gs.com<mailto:johannes.hoff...@gs.com> | Tel:
> +49 (0)69 7532 3558
> Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen |
> Dr. Matthias Bock Vorsitzender des Aufsichtsrats: Dermot McDonogh
> Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
>
>
> ________________________________
>
> Your Personal Data: We may collect and process information about you
> that may be subject to data protection laws. For more information
> about how we use and disclose your personal data, how we protect your
> information, our legal basis to use your information, your rights and
> who you can contact, please refer to:
> http://www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>
>

________________________________

Your Personal Data: We may collect and process information about you that may 
be subject to data protection laws. For more information about how we use and 
disclose your personal data, how we protect your information, our legal basis 
to use your information, your rights and who you can contact, please refer to: 
www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Reply via email to