Storage of the machine is one 10TB raid6 SSD. Cheers, Joachim
> -----Ursprüngliche Nachricht----- > Von: Andy Seaborne <a...@apache.org> > Gesendet: Mittwoch, 16. Februar 2022 20:05 > An: users@jena.apache.org > Betreff: Re: Loading Wikidata > > > > On 16/02/2022 11:56, Neubert, Joachim wrote: > > I've loaded the Wikidata "truthy" dataset with 6b triples. Summary stats is: > > > > 10:09:29 INFO Load node table = 35555 seconds > > 10:09:29 INFO Load ingest data = 25165 seconds > > 10:09:29 INFO Build index SPO = 11241 seconds > > 10:09:29 INFO Build index POS = 14100 seconds > > 10:09:29 INFO Build index OSP = 12435 seconds > > 10:09:29 INFO Overall 98496 seconds > > 10:09:29 INFO Overall 27h 21m 36s > > 10:09:29 INFO Triples loaded = 6756025616 > > 10:09:29 INFO Quads loaded = 0 > > 10:09:29 INFO Overall Rate 68591 tuples per second > > > > This was done on a large machine with 2TB RAM and -threads=48, but > anyway: It looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT brought > HUGE improvements over prior versions (unfortunately I cannot find a log, > but it took multiple days with 3.x on the same machine). > > This is very helpful - faster than Lorenz reported on a 128G / 12 threads > (31h). It does suggests there is effectively a soft upper bound on going > faster > by more RAM, more threads. > > That seems likely - disk bandwith also matters and because xloader is phased > between sort and index writing steps, it is unlikely to be getting the best > overlap of CPU crunching and I/O. > > This all gets into RAID0, or allocating files across different disk. > > There comes a point where it gets quite a task to setup the machine. > > One other area I think might be easy to improve - more for smaller machines > - is during data ingest. There, the node table index is being randomly read. > On smaller RAM machines, the ingest phase is proporiately longer, > sometimes a lot. > > An idea I had is calling the madvise system call on the mmap segments to tell > the kernel the access is random (requires native code; Java17 makes it > possible to directly call mdavise(2) without needing a C (etc) converter > layer). > > > If you think it useful, I am happy to share more details. > > What was the storage? > > Andy > > > > Two observations: > > > > > > - As Andy (thanks again for all your help!) already mentioned, gzip > > files > apparently load significantly faster then bzip2 files. I experienced 200,000 > vs. > 100,000 triples/second in the parse nodes step (though colleagues had jobs > on the machine too, which might have influenced the results). > > > > - During the extended POS/POS/OSP sort periods, I saw only one or two > gzip instances (used in the background), which perhaps were a bottleneck. I > wonder if using pigz could extend parallel processing. > > > > If you think it usefull, I am happy to share more details. If I can help > > with > running some particular tests on a massive parallel machine, please let me > know. > > > > Cheers, Joachim > > > > -- > > Joachim Neubert > > > > ZBW - Leibniz Information Centre for Economics Neuer Jungfernstieg 21 > > 20354 Hamburg > > Phone +49-40-42834-462 > > > >