thank you Joachim, I suspect Lorenz's machine would produce similar results now with the most recent release. at lower cost and energy consumption. (AMD Ryzen 9 5950X)
On Fri, Feb 18, 2022 at 9:16 AM Neubert, Joachim <j.neub...@zbw.eu> wrote: > OS is Centos 7.9 in a docker container running on Ubuntu 9.3.0. > > CPU is 4 x Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz / 18 core (144 cores > in total) > > Cheers, Joachim > > > -----Ursprüngliche Nachricht----- > > Von: Marco Neumann <marco.neum...@gmail.com> > > Gesendet: Freitag, 18. Februar 2022 10:00 > > An: users@jena.apache.org > > Betreff: Re: Loading Wikidata > > > > Thank you for the effort Joachim, what CPU and OS was used for the load > > test? > > > > Best, > > Marco > > > > On Fri, Feb 18, 2022 at 8:51 AM Neubert, Joachim <j.neub...@zbw.eu> > > wrote: > > > > > Storage of the machine is one 10TB raid6 SSD. > > > > > > Cheers, Joachim > > > > > > > -----Ursprüngliche Nachricht----- > > > > Von: Andy Seaborne <a...@apache.org> > > > > Gesendet: Mittwoch, 16. Februar 2022 20:05 > > > > An: users@jena.apache.org > > > > Betreff: Re: Loading Wikidata > > > > > > > > > > > > > > > > On 16/02/2022 11:56, Neubert, Joachim wrote: > > > > > I've loaded the Wikidata "truthy" dataset with 6b triples. Summary > > > stats is: > > > > > > > > > > 10:09:29 INFO Load node table = 35555 seconds > > > > > 10:09:29 INFO Load ingest data = 25165 seconds > > > > > 10:09:29 INFO Build index SPO = 11241 seconds > > > > > 10:09:29 INFO Build index POS = 14100 seconds > > > > > 10:09:29 INFO Build index OSP = 12435 seconds > > > > > 10:09:29 INFO Overall 98496 seconds > > > > > 10:09:29 INFO Overall 27h 21m 36s > > > > > 10:09:29 INFO Triples loaded = 6756025616 > > > > > 10:09:29 INFO Quads loaded = 0 > > > > > 10:09:29 INFO Overall Rate 68591 tuples per second > > > > > > > > > > This was done on a large machine with 2TB RAM and -threads=48, but > > > > anyway: It looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT > > > > brought HUGE improvements over prior versions (unfortunately I > > > > cannot find a log, but it took multiple days with 3.x on the same > > machine). > > > > > > > > This is very helpful - faster than Lorenz reported on a 128G / 12 > > > > threads (31h). It does suggests there is effectively a soft upper > > > > bound on going > > > faster > > > > by more RAM, more threads. > > > > > > > > That seems likely - disk bandwith also matters and because xloader > > > > is > > > phased > > > > between sort and index writing steps, it is unlikely to be getting > > > > the > > > best > > > > overlap of CPU crunching and I/O. > > > > > > > > This all gets into RAID0, or allocating files across different disk. > > > > > > > > There comes a point where it gets quite a task to setup the machine. > > > > > > > > One other area I think might be easy to improve - more for smaller > > > machines > > > > - is during data ingest. There, the node table index is being > > > > randomly > > > read. > > > > On smaller RAM machines, the ingest phase is proporiately longer, > > > > sometimes a lot. > > > > > > > > An idea I had is calling the madvise system call on the mmap > > > > segments to > > > tell > > > > the kernel the access is random (requires native code; Java17 makes > > > > it possible to directly call mdavise(2) without needing a C (etc) > > > > converter > > > layer). > > > > > > > > > If you think it useful, I am happy to share more details. > > > > > > > > What was the storage? > > > > > > > > Andy > > > > > > > > > > Two observations: > > > > > > > > > > > > > > > - As Andy (thanks again for all your help!) already > mentioned, > > > gzip files > > > > apparently load significantly faster then bzip2 files. I experienced > > > 200,000 vs. > > > > 100,000 triples/second in the parse nodes step (though colleagues > > > > had > > > jobs > > > > on the machine too, which might have influenced the results). > > > > > > > > > > - During the extended POS/POS/OSP sort periods, I saw only > one > > > or two > > > > gzip instances (used in the background), which perhaps were a > > > bottleneck. I > > > > wonder if using pigz could extend parallel processing. > > > > > > > > > > If you think it usefull, I am happy to share more details. If I > > > > > can > > > help with > > > > running some particular tests on a massive parallel machine, please > > > > let > > > me > > > > know. > > > > > > > > > > Cheers, Joachim > > > > > > > > > > -- > > > > > Joachim Neubert > > > > > > > > > > ZBW - Leibniz Information Centre for Economics Neuer Jungfernstieg > > > > > 21 > > > > > 20354 Hamburg > > > > > Phone +49-40-42834-462 > > > > > > > > > > > > > > > > > > > -- > > > > > > --- > > Marco Neumann > > KONA > -- --- Marco Neumann KONA