AW: Loading Wikidata

Neubert, Joachim Fri, 18 Feb 2022 00:51:35 -0800

Storage of the machine is one 10TB raid6 SSD.

Cheers, Joachim


> -----Ursprüngliche Nachricht-----
> Von: Andy Seaborne <a...@apache.org>
> Gesendet: Mittwoch, 16. Februar 2022 20:05
> An: users@jena.apache.org
> Betreff: Re: Loading Wikidata
> 
> 
> 
> On 16/02/2022 11:56, Neubert, Joachim wrote:
> > I've loaded the Wikidata "truthy" dataset with 6b triples. Summary stats is:
> >
> > 10:09:29 INFO  Load node table  = 35555 seconds
> > 10:09:29 INFO  Load ingest data = 25165 seconds
> > 10:09:29 INFO  Build index SPO  = 11241 seconds
> > 10:09:29 INFO  Build index POS  = 14100 seconds
> > 10:09:29 INFO  Build index OSP  = 12435 seconds
> > 10:09:29 INFO  Overall          98496 seconds
> > 10:09:29 INFO  Overall          27h 21m 36s
> > 10:09:29 INFO  Triples loaded   = 6756025616
> > 10:09:29 INFO  Quads loaded     = 0
> > 10:09:29 INFO  Overall Rate     68591 tuples per second
> >
> > This was done on a large machine with 2TB RAM and -threads=48, but
> anyway: It looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT brought
> HUGE improvements over prior versions (unfortunately I cannot find a log,
> but it took multiple days with 3.x on the same machine).
> 
> This is very helpful - faster than Lorenz reported on a 128G / 12 threads
> (31h). It does suggests there is effectively a soft upper bound on going 
> faster
> by more RAM, more threads.
> 
> That seems likely - disk bandwith also matters and because xloader is phased
> between sort and index writing steps, it is unlikely to be getting the best
> overlap of CPU crunching and I/O.
> 
> This all gets into RAID0, or allocating files across different disk.
> 
> There comes a point where it gets quite a task to setup the machine.
> 
> One other area I think might be easy to improve - more for smaller machines
> - is during data ingest. There, the node table index is being randomly read.
> On smaller RAM machines, the ingest phase is proporiately longer,
> sometimes a lot.
> 
> An idea I had is calling the madvise system call on the mmap segments to tell
> the kernel the access is random (requires native code; Java17 makes it
> possible to directly call mdavise(2) without needing a C (etc) converter 
> layer).
> 
>  > If you think it useful, I am happy to share more details.
> 
> What was the storage?
> 
>      Andy
> >
> > Two observations:
> >
> >
> > -        As Andy (thanks again for all your help!) already mentioned, gzip 
> > files
> apparently load significantly faster then bzip2 files. I experienced  200,000 
> vs.
> 100,000 triples/second in the parse nodes step (though colleagues had jobs
> on the machine too, which might have influenced the results).
> >
> > -        During the extended POS/POS/OSP sort periods, I saw only one or two
> gzip instances (used in the background), which perhaps were a bottleneck. I
> wonder if using pigz could extend parallel processing.
> >
> > If you think it usefull, I am happy to share more details. If I can help 
> > with
> running some particular tests on a massive parallel machine, please let me
> know.
> >
> > Cheers, Joachim
> >
> > --
> > Joachim Neubert
> >
> > ZBW - Leibniz Information Centre for Economics Neuer Jungfernstieg 21
> > 20354 Hamburg
> > Phone +49-40-42834-462
> >
> >

AW: Loading Wikidata

Reply via email to