Le lun., mai 22 2023 at 16:18:21 +0100, Andy Seaborne <[email protected]>
a écrit :
Hello Andy,
Hi Steven,
How are you runnign xloader? Default settings?
Yes, we use your default settings.
The command line used is the following line :
tdb2.xloader --loc /nfs/uniprot_tmp/tdb2/UniProt_04_2022/ --tmpdir
/nfs/uniprot_tmp/ --threads 30
/nfs/uniprot_tmp/Download/2022_04/uniprotkb_*.rdf
What's the storage being used?
We use a Block Storage from a cloud providers with ssd on a mouted nsf
volume.
On 22/05/2023 10:49, Steven Blanchard wrote:
Hello,
I am currently trying to load a very large dataset ( 54 billion
triples) with the tdb2.xloader command.
The first two steps (Nodes and Terms) are completed with an average
load speed of ~ 120,000.
The third stage (Data) has an average load speed of only 800.
is thet "Avg" is 800 from teh start of the phase or "the average
drops to 800" during the phase?
The Avg is 800 from the start of the phase and he stay at 800.
This average load speed is incompatible with the amount of data to
be loaded.
Looking at the status of the job, it is possible that there is an
excessive demand on memory which slows down the process extremely.
We saw with a top that java required many memories :
```
top
# PID USER PR NI VIRT RES SHR S %CPU
%MEM TIME+ COMMAND
# 867362 sblanch+ 20 0 289,0g 90,2g 88,4g S 3,3 72,1
1102:32 java
```
xloader does not have much requirement for java heap memory.
Ok, since our email we have try to increase the -xmx and we have not an
increase of the performance.
That space may be mapped files.
But with a free -g, we see that it actually uses very little memory.
```
free -g
# total used free shared buff/cache available
# Mem: 125 3 0 0 121 120
```
Are there any possibilities to speed up this step? (Give a -xms to
java?)
Can this significant drop in loading speed for this step be due to
memory usage? Do you know of any other limiting causes in this
loading stage?
For previous insertions on smaller datasets, this Data step was not
limiting and the average speed was even slightly higher than the
Nodes and Terms steps.
How small is "smaller"?
For example, we have upload UniRef RDF Database (Same providers like
UniProt) with 12 Milliards of triples with an average for Data task of
230 000 tuples/s
That sounds like what I see when loading.
For information, the machine used has 32 CPUs and 128 Giga of Ram.
Thanks for your help,
Regards,
Steven