Re: Loading Wikidata

Andrii Berezovskyi Fri, 18 Feb 2022 01:35:08 -0800

May I ask an unrelated question: how do you get Ubuntu version in such a 
format? 'cat /etc/os-release' (or lsb_release, hostnamectl, neofetch) only 
gives me the '20.04.3' format or Focal.


On 2022-02-18, 10:17, "Neubert, Joachim" <[email protected]> wrote:

    OS is Centos 7.9 in a docker container running on Ubuntu 9.3.0.

    CPU is 4 x Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz / 18 core (144 cores in 
total) 

    Cheers, Joachim

    > -----Ursprüngliche Nachricht-----
    > Von: Marco Neumann <[email protected]>
    > Gesendet: Freitag, 18. Februar 2022 10:00
    > An: [email protected]
    > Betreff: Re: Loading Wikidata
    > 
    > Thank you for the effort Joachim, what CPU and OS was used for the load
    > test?
    > 
    > Best,
    > Marco
    > 
    > On Fri, Feb 18, 2022 at 8:51 AM Neubert, Joachim <[email protected]>
    > wrote:
    > 
    > > Storage of the machine is one 10TB raid6 SSD.
    > >
    > > Cheers, Joachim
    > >
    > > > -----Ursprüngliche Nachricht-----
    > > > Von: Andy Seaborne <[email protected]>
    > > > Gesendet: Mittwoch, 16. Februar 2022 20:05
    > > > An: [email protected]
    > > > Betreff: Re: Loading Wikidata
    > > >
    > > >
    > > >
    > > > On 16/02/2022 11:56, Neubert, Joachim wrote:
    > > > > I've loaded the Wikidata "truthy" dataset with 6b triples. Summary
    > > stats is:
    > > > >
    > > > > 10:09:29 INFO  Load node table  = 35555 seconds
    > > > > 10:09:29 INFO  Load ingest data = 25165 seconds
    > > > > 10:09:29 INFO  Build index SPO  = 11241 seconds
    > > > > 10:09:29 INFO  Build index POS  = 14100 seconds
    > > > > 10:09:29 INFO  Build index OSP  = 12435 seconds
    > > > > 10:09:29 INFO  Overall          98496 seconds
    > > > > 10:09:29 INFO  Overall          27h 21m 36s
    > > > > 10:09:29 INFO  Triples loaded   = 6756025616
    > > > > 10:09:29 INFO  Quads loaded     = 0
    > > > > 10:09:29 INFO  Overall Rate     68591 tuples per second
    > > > >
    > > > > This was done on a large machine with 2TB RAM and -threads=48, but
    > > > anyway: It looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT
    > > > brought HUGE improvements over prior versions (unfortunately I
    > > > cannot find a log, but it took multiple days with 3.x on the same
    > machine).
    > > >
    > > > This is very helpful - faster than Lorenz reported on a 128G / 12
    > > > threads (31h). It does suggests there is effectively a soft upper
    > > > bound on going
    > > faster
    > > > by more RAM, more threads.
    > > >
    > > > That seems likely - disk bandwith also matters and because xloader
    > > > is
    > > phased
    > > > between sort and index writing steps, it is unlikely to be getting
    > > > the
    > > best
    > > > overlap of CPU crunching and I/O.
    > > >
    > > > This all gets into RAID0, or allocating files across different disk.
    > > >
    > > > There comes a point where it gets quite a task to setup the machine.
    > > >
    > > > One other area I think might be easy to improve - more for smaller
    > > machines
    > > > - is during data ingest. There, the node table index is being
    > > > randomly
    > > read.
    > > > On smaller RAM machines, the ingest phase is proporiately longer,
    > > > sometimes a lot.
    > > >
    > > > An idea I had is calling the madvise system call on the mmap
    > > > segments to
    > > tell
    > > > the kernel the access is random (requires native code; Java17 makes
    > > > it possible to directly call mdavise(2) without needing a C (etc)
    > > > converter
    > > layer).
    > > >
    > > >  > If you think it useful, I am happy to share more details.
    > > >
    > > > What was the storage?
    > > >
    > > >      Andy
    > > > >
    > > > > Two observations:
    > > > >
    > > > >
    > > > > -        As Andy (thanks again for all your help!) already 
mentioned,
    > > gzip files
    > > > apparently load significantly faster then bzip2 files. I experienced
    > > 200,000 vs.
    > > > 100,000 triples/second in the parse nodes step (though colleagues
    > > > had
    > > jobs
    > > > on the machine too, which might have influenced the results).
    > > > >
    > > > > -        During the extended POS/POS/OSP sort periods, I saw only 
one
    > > or two
    > > > gzip instances (used in the background), which perhaps were a
    > > bottleneck. I
    > > > wonder if using pigz could extend parallel processing.
    > > > >
    > > > > If you think it usefull, I am happy to share more details. If I
    > > > > can
    > > help with
    > > > running some particular tests on a massive parallel machine, please
    > > > let
    > > me
    > > > know.
    > > > >
    > > > > Cheers, Joachim
    > > > >
    > > > > --
    > > > > Joachim Neubert
    > > > >
    > > > > ZBW - Leibniz Information Centre for Economics Neuer Jungfernstieg
    > > > > 21
    > > > > 20354 Hamburg
    > > > > Phone +49-40-42834-462
    > > > >
    > > > >
    > >
    > 
    > 
    > --
    > 
    > 
    > ---
    > Marco Neumann
    > KONA

Re: Loading Wikidata

Reply via email to