Re: Loading Wikidata

Andrii Berezovskyi Fri, 18 Feb 2022 01:49:21 -0800
I see, thanks. Are you sure 9.3.0 is not the version of GCC but Ubuntu?

> On 18 Feb 2022, at 10:46, Neubert, Joachim <j.neub...@zbw.eu> wrote:
> 
> I used cat /proc/version
> 
>> -----Ursprüngliche Nachricht-----
>> Von: Andrii Berezovskyi <andr...@kth.se>
>> Gesendet: Freitag, 18. Februar 2022 10:35
>> An: users@jena.apache.org
>> Betreff: Re: Loading Wikidata
>> 
>> May I ask an unrelated question: how do you get Ubuntu version in such a
>> format? 'cat /etc/os-release' (or lsb_release, hostnamectl, neofetch) only
>> gives me the '20.04.3' format or Focal.
>> 
>> On 2022-02-18, 10:17, "Neubert, Joachim" <j.neub...@zbw.eu> wrote:
>> 
>>    OS is Centos 7.9 in a docker container running on Ubuntu 9.3.0.
>> 
>>    CPU is 4 x Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz / 18 core (144 cores
>> in total)
>> 
>>    Cheers, Joachim
>> 
>>> -----Ursprüngliche Nachricht-----
>>> Von: Marco Neumann <marco.neum...@gmail.com>
>>> Gesendet: Freitag, 18. Februar 2022 10:00
>>> An: users@jena.apache.org
>>> Betreff: Re: Loading Wikidata
>>> 
>>> Thank you for the effort Joachim, what CPU and OS was used for the load
>>> test?
>>> 
>>> Best,
>>> Marco
>>> 
>>> On Fri, Feb 18, 2022 at 8:51 AM Neubert, Joachim <j.neub...@zbw.eu>
>>> wrote:
>>> 
>>>> Storage of the machine is one 10TB raid6 SSD.
>>>> 
>>>> Cheers, Joachim
>>>> 
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: Andy Seaborne <a...@apache.org>
>>>>> Gesendet: Mittwoch, 16. Februar 2022 20:05
>>>>> An: users@jena.apache.org
>>>>> Betreff: Re: Loading Wikidata
>>>>> 
>>>>> 
>>>>> 
>>>>> On 16/02/2022 11:56, Neubert, Joachim wrote:
>>>>>> I've loaded the Wikidata "truthy" dataset with 6b triples. Summary
>>>> stats is:
>>>>>> 
>>>>>> 10:09:29 INFO  Load node table  = 35555 seconds
>>>>>> 10:09:29 INFO  Load ingest data = 25165 seconds
>>>>>> 10:09:29 INFO  Build index SPO  = 11241 seconds
>>>>>> 10:09:29 INFO  Build index POS  = 14100 seconds
>>>>>> 10:09:29 INFO  Build index OSP  = 12435 seconds
>>>>>> 10:09:29 INFO  Overall          98496 seconds
>>>>>> 10:09:29 INFO  Overall          27h 21m 36s
>>>>>> 10:09:29 INFO  Triples loaded   = 6756025616
>>>>>> 10:09:29 INFO  Quads loaded     = 0
>>>>>> 10:09:29 INFO  Overall Rate     68591 tuples per second
>>>>>> 
>>>>>> This was done on a large machine with 2TB RAM and -threads=48,
>> but
>>>>> anyway: It looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT
>>>>> brought HUGE improvements over prior versions (unfortunately I
>>>>> cannot find a log, but it took multiple days with 3.x on the same
>>> machine).
>>>>> 
>>>>> This is very helpful - faster than Lorenz reported on a 128G / 12
>>>>> threads (31h). It does suggests there is effectively a soft upper
>>>>> bound on going
>>>> faster
>>>>> by more RAM, more threads.
>>>>> 
>>>>> That seems likely - disk bandwith also matters and because xloader
>>>>> is
>>>> phased
>>>>> between sort and index writing steps, it is unlikely to be getting
>>>>> the
>>>> best
>>>>> overlap of CPU crunching and I/O.
>>>>> 
>>>>> This all gets into RAID0, or allocating files across different disk.
>>>>> 
>>>>> There comes a point where it gets quite a task to setup the machine.
>>>>> 
>>>>> One other area I think might be easy to improve - more for smaller
>>>> machines
>>>>> - is during data ingest. There, the node table index is being
>>>>> randomly
>>>> read.
>>>>> On smaller RAM machines, the ingest phase is proporiately longer,
>>>>> sometimes a lot.
>>>>> 
>>>>> An idea I had is calling the madvise system call on the mmap
>>>>> segments to
>>>> tell
>>>>> the kernel the access is random (requires native code; Java17 makes
>>>>> it possible to directly call mdavise(2) without needing a C (etc)
>>>>> converter
>>>> layer).
>>>>> 
>>>>>> If you think it useful, I am happy to share more details.
>>>>> 
>>>>> What was the storage?
>>>>> 
>>>>>     Andy
>>>>>> 
>>>>>> Two observations:
>>>>>> 
>>>>>> 
>>>>>> -        As Andy (thanks again for all your help!) already mentioned,
>>>> gzip files
>>>>> apparently load significantly faster then bzip2 files. I experienced
>>>> 200,000 vs.
>>>>> 100,000 triples/second in the parse nodes step (though colleagues
>>>>> had
>>>> jobs
>>>>> on the machine too, which might have influenced the results).
>>>>>> 
>>>>>> -        During the extended POS/POS/OSP sort periods, I saw only
>> one
>>>> or two
>>>>> gzip instances (used in the background), which perhaps were a
>>>> bottleneck. I
>>>>> wonder if using pigz could extend parallel processing.
>>>>>> 
>>>>>> If you think it usefull, I am happy to share more details. If I
>>>>>> can
>>>> help with
>>>>> running some particular tests on a massive parallel machine, please
>>>>> let
>>>> me
>>>>> know.
>>>>>> 
>>>>>> Cheers, Joachim
>>>>>> 
>>>>>> --
>>>>>> Joachim Neubert
>>>>>> 
>>>>>> ZBW - Leibniz Information Centre for Economics Neuer
>> Jungfernstieg
>>>>>> 21
>>>>>> 20354 Hamburg
>>>>>> Phone +49-40-42834-462
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> 
>>> 
>>> ---
>>> Marco Neumann
>>> KONA
>
Re: Loading Wikidata

Reply via email to