I see, thanks. Are you sure 9.3.0 is not the version of GCC but Ubuntu?
> On 18 Feb 2022, at 10:46, Neubert, Joachim <j.neub...@zbw.eu> wrote:
>
> I used cat /proc/version
>
>> -----Ursprüngliche Nachricht-----
>> Von: Andrii Berezovskyi <andr...@kth.se>
>> Gesendet: Freitag, 18. Februar 2022 10:35
>> An: users@jena.apache.org
>> Betreff: Re: Loading Wikidata
>>
>> May I ask an unrelated question: how do you get Ubuntu version in such a
>> format? 'cat /etc/os-release' (or lsb_release, hostnamectl, neofetch) only
>> gives me the '20.04.3' format or Focal.
>>
>> On 2022-02-18, 10:17, "Neubert, Joachim" <j.neub...@zbw.eu> wrote:
>>
>> OS is Centos 7.9 in a docker container running on Ubuntu 9.3.0.
>>
>> CPU is 4 x Intel(R) Xeon(R) Gold 6254 CPU @ 3.10GHz / 18 core (144 cores
>> in total)
>>
>> Cheers, Joachim
>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Marco Neumann <marco.neum...@gmail.com>
>>> Gesendet: Freitag, 18. Februar 2022 10:00
>>> An: users@jena.apache.org
>>> Betreff: Re: Loading Wikidata
>>>
>>> Thank you for the effort Joachim, what CPU and OS was used for the load
>>> test?
>>>
>>> Best,
>>> Marco
>>>
>>> On Fri, Feb 18, 2022 at 8:51 AM Neubert, Joachim <j.neub...@zbw.eu>
>>> wrote:
>>>
>>>> Storage of the machine is one 10TB raid6 SSD.
>>>>
>>>> Cheers, Joachim
>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: Andy Seaborne <a...@apache.org>
>>>>> Gesendet: Mittwoch, 16. Februar 2022 20:05
>>>>> An: users@jena.apache.org
>>>>> Betreff: Re: Loading Wikidata
>>>>>
>>>>>
>>>>>
>>>>> On 16/02/2022 11:56, Neubert, Joachim wrote:
>>>>>> I've loaded the Wikidata "truthy" dataset with 6b triples. Summary
>>>> stats is:
>>>>>>
>>>>>> 10:09:29 INFO Load node table = 35555 seconds
>>>>>> 10:09:29 INFO Load ingest data = 25165 seconds
>>>>>> 10:09:29 INFO Build index SPO = 11241 seconds
>>>>>> 10:09:29 INFO Build index POS = 14100 seconds
>>>>>> 10:09:29 INFO Build index OSP = 12435 seconds
>>>>>> 10:09:29 INFO Overall 98496 seconds
>>>>>> 10:09:29 INFO Overall 27h 21m 36s
>>>>>> 10:09:29 INFO Triples loaded = 6756025616
>>>>>> 10:09:29 INFO Quads loaded = 0
>>>>>> 10:09:29 INFO Overall Rate 68591 tuples per second
>>>>>>
>>>>>> This was done on a large machine with 2TB RAM and -threads=48,
>> but
>>>>> anyway: It looks like tdb2.xloader in apache-jena-4.5.0-SNAPSHOT
>>>>> brought HUGE improvements over prior versions (unfortunately I
>>>>> cannot find a log, but it took multiple days with 3.x on the same
>>> machine).
>>>>>
>>>>> This is very helpful - faster than Lorenz reported on a 128G / 12
>>>>> threads (31h). It does suggests there is effectively a soft upper
>>>>> bound on going
>>>> faster
>>>>> by more RAM, more threads.
>>>>>
>>>>> That seems likely - disk bandwith also matters and because xloader
>>>>> is
>>>> phased
>>>>> between sort and index writing steps, it is unlikely to be getting
>>>>> the
>>>> best
>>>>> overlap of CPU crunching and I/O.
>>>>>
>>>>> This all gets into RAID0, or allocating files across different disk.
>>>>>
>>>>> There comes a point where it gets quite a task to setup the machine.
>>>>>
>>>>> One other area I think might be easy to improve - more for smaller
>>>> machines
>>>>> - is during data ingest. There, the node table index is being
>>>>> randomly
>>>> read.
>>>>> On smaller RAM machines, the ingest phase is proporiately longer,
>>>>> sometimes a lot.
>>>>>
>>>>> An idea I had is calling the madvise system call on the mmap
>>>>> segments to
>>>> tell
>>>>> the kernel the access is random (requires native code; Java17 makes
>>>>> it possible to directly call mdavise(2) without needing a C (etc)
>>>>> converter
>>>> layer).
>>>>>
>>>>>> If you think it useful, I am happy to share more details.
>>>>>
>>>>> What was the storage?
>>>>>
>>>>> Andy
>>>>>>
>>>>>> Two observations:
>>>>>>
>>>>>>
>>>>>> - As Andy (thanks again for all your help!) already mentioned,
>>>> gzip files
>>>>> apparently load significantly faster then bzip2 files. I experienced
>>>> 200,000 vs.
>>>>> 100,000 triples/second in the parse nodes step (though colleagues
>>>>> had
>>>> jobs
>>>>> on the machine too, which might have influenced the results).
>>>>>>
>>>>>> - During the extended POS/POS/OSP sort periods, I saw only
>> one
>>>> or two
>>>>> gzip instances (used in the background), which perhaps were a
>>>> bottleneck. I
>>>>> wonder if using pigz could extend parallel processing.
>>>>>>
>>>>>> If you think it usefull, I am happy to share more details. If I
>>>>>> can
>>>> help with
>>>>> running some particular tests on a massive parallel machine, please
>>>>> let
>>>> me
>>>>> know.
>>>>>>
>>>>>> Cheers, Joachim
>>>>>>
>>>>>> --
>>>>>> Joachim Neubert
>>>>>>
>>>>>> ZBW - Leibniz Information Centre for Economics Neuer
>> Jungfernstieg
>>>>>> 21
>>>>>> 20354 Hamburg
>>>>>> Phone +49-40-42834-462
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> ---
>>> Marco Neumann
>>> KONA
>