Re: Understanding the output of Jena TDB Loader

Andy Seaborne Sat, 13 Feb 2021 07:10:51 -0800



On 13/02/2021 13:21, Daniel Hernandez wrote:


Hi,

Thanks Lorenz for your answer.  Regarding a possible spill to disk, my
machine has 256 GB of RAM, and the Java process is taking only 20 G.  I
am not sure if changing the -Xmx Java argument would speed up the
loading process.  I see that tdbloader2 started the process using with
the argument -Xmx1200M.  Maybe is too conservative.


Hi there,

How much data are you loading?

Heap is only used for the node table cache and not index work which isout of heap in memory mapped filesmapped by the virtual memory of the OSprocess so caching is done by the OS filesystem cache machinery. It canmake the OS process look very large even if the heap is only 1.2G.

tdbloader2 may not be the right choice. It is a bit niche but if youhave much less RAM than total data it can be better than tdbloader andit is better if there is rotating disk, not SSD. It has been reported tobe the right choice for several billion for SSD.


--

tdb2.tdbloader is the same or faster that TDB1 loaders.

If you give it --loader=parallel, it'll max out on the I/O subsystem.

And don't use the machine for ohter I/O at the time.

---

Loading slows down because:

* Indexes get deeper
* I/O is jumping around and RAM caching is less effective.

---

Beware of ulimits on the Linux process - you may have a large machinebut can one process make use of it?


    Andy


Best,
Daniel

Lorenz Buehmann writes:

On 13.02.21 12:00, Daniel Hernandez wrote:

Hi,

I am loading an n-triples file using tdbloader2.  I am curious about
what is the meaning of the numbers in the loader output.  The loading
output started as follows:

  09:54:15 INFO -- TDB Bulk Loader Start
  09:54:15 INFO Data Load Phase
  09:54:15 INFO Got 1 data files to load
  09:54:15 INFO Data file 1: /home/ubuntu/dataset.nq.gz
09:54:59 INFO  loader          :: Load: /home/ubuntu/dataset.nq.gz -- 
2021/02/08 09:54:59 UTC
09:55:01 INFO  loader          :: Add: 50,000 Data (Batch: 19,912 / Avg: 19,912)
09:55:03 INFO  loader          :: Add: 100,000 Data (Batch: 23,288 / Avg: 
21,468)
09:55:05 INFO  loader          :: Add: 150,000 Data (Batch: 26,123 / Avg: 
22,824)
09:55:07 INFO  loader          :: Add: 200,000 Data (Batch: 24,987 / Avg: 
23,329)
09:55:09 INFO  loader          :: Add: 250,000 Data (Batch: 25,641 / Avg: 
23,757)
09:55:11 INFO  loader          :: Add: 300,000 Data (Batch: 25,100 / Avg: 
23,971)
09:55:13 INFO  loader          :: Add: 350,000 Data (Batch: 24,213 / Avg: 
24,005)
09:55:15 INFO  loader          :: Add: 400,000 Data (Batch: 24,461 / Avg: 
24,061)
09:55:17 INFO  loader          :: Add: 450,000 Data (Batch: 25,667 / Avg: 
24,230)
09:55:19 INFO  loader          :: Add: 500,000 Data (Batch: 25,879 / Avg: 
24,385)
09:55:19 INFO  loader          ::   Elapsed: 20.50 seconds [2021/02/08 09:55:19 
UTC]
09:55:21 INFO  loader          :: Add: 550,000 Data (Batch: 25,484 / Avg: 
24,481)
09:55:23 INFO  loader          :: Add: 600,000 Data (Batch: 23,419 / Avg: 
24,389)
09:55:25 INFO  loader          :: Add: 650,000 Data (Batch: 25,012 / Avg: 
24,436)
09:55:27 INFO  loader          :: Add: 700,000 Data (Batch: 25,201 / Avg: 
24,489)
09:55:29 INFO  loader          :: Add: 750,000 Data (Batch: 26,288 / Avg: 
24,601)
09:55:31 INFO  loader          :: Add: 800,000 Data (Batch: 25,960 / Avg: 
24,682)
09:55:33 INFO  loader          :: Add: 850,000 Data (Batch: 24,330 / Avg: 
24,661)
09:55:35 INFO  loader          :: Add: 900,000 Data (Batch: 25,813 / Avg: 
24,722)
09:55:37 INFO  loader          :: Add: 950,000 Data (Batch: 26,164 / Avg: 
24,794)
09:55:39 INFO  loader          :: Add: 1,000,000 Data (Batch: 26,357 / Avg: 
24,868)
09:55:39 INFO  loader          ::   Elapsed: 40.21 seconds [2021/02/08 09:55:39 
UTC]

My first questions are:

1) I guess that 600,000 is the number of data loaded at 09:55:23.  What
means data?  Does it mean bytes or triples?

#triples


2) What are the numbers Batch: 23,419 and Avg: 24,389?  I guess that are
associated to the loading speed.


Batch means in the last 50K triples the number of triples per second

Avg means among all triples parsed so far the number of triples per second


After some days of loading the output shows different numbers:

10:21:45 INFO  loader          ::   Elapsed: 433,606.84 seconds [2021/02/13 
10:21:45 UTC]
10:21:48 INFO  loader          :: Add: 505,550,000 Data (Batch: 18,348 / Avg: 
1,165)
10:21:51 INFO  loader          :: Add: 505,600,000 Data (Batch: 18,656 / Avg: 
1,166)
10:22:55 INFO  loader          :: Add: 505,650,000 Data (Batch: 781 / Avg: 
1,165)
10:36:12 INFO  loader          :: Add: 505,700,000 Data (Batch: 62 / Avg: 1,163)
10:36:14 INFO  loader          :: Add: 505,750,000 Data (Batch: 17,543 / Avg: 
1,164)
10:36:17 INFO  loader          :: Add: 505,800,000 Data (Batch: 17,385 / Avg: 
1,164)
10:36:20 INFO  loader          :: Add: 505,850,000 Data (Batch: 17,998 / Avg: 
1,164)
10:36:23 INFO  loader          :: Add: 505,900,000 Data (Batch: 17,170 / Avg: 
1,164)
10:37:12 INFO  loader          :: Add: 505,950,000 Data (Batch: 1,025 / Avg: 
1,164)
10:37:14 INFO  loader          :: Add: 506,000,000 Data (Batch: 18,301 / Avg: 
1,164)
10:37:14 INFO  loader          ::   Elapsed: 434,535.94 seconds [2021/02/13 
10:37:14 UTC]

Now the numbers Batch and Avg are smaller.  Also, it is taking longer to
load each 500,000 data.  At the beginning it takes 20 seconds to load
500,000 data.  Now it is taking 929 seconds.  Why the load speed is
degraded?  In my experience loading big datasets in Jena I always see
that loading start getting slower as much data have been already loaded.

Andy Seaborne will explain better, but in general IO and memory pressure
increase, spill to disk, etc. - those heavy drops to 781 and 62 can also
stem from other workloads on your system


Best regards,
Daniel

Re: Understanding the output of Jena TDB Loader

Reply via email to