On 13.02.21 12:00, Daniel Hernandez wrote:
Hi,
I am loading an n-triples file using tdbloader2. I am curious about
what is the meaning of the numbers in the loader output. The loading
output started as follows:
09:54:15 INFO -- TDB Bulk Loader Start
09:54:15 INFO Data Load Phase
09:54:15 INFO Got 1 data files to load
09:54:15 INFO Data file 1: /home/ubuntu/dataset.nq.gz
09:54:59 INFO loader :: Load: /home/ubuntu/dataset.nq.gz --
2021/02/08 09:54:59 UTC
09:55:01 INFO loader :: Add: 50,000 Data (Batch: 19,912 / Avg: 19,912)
09:55:03 INFO loader :: Add: 100,000 Data (Batch: 23,288 / Avg:
21,468)
09:55:05 INFO loader :: Add: 150,000 Data (Batch: 26,123 / Avg:
22,824)
09:55:07 INFO loader :: Add: 200,000 Data (Batch: 24,987 / Avg:
23,329)
09:55:09 INFO loader :: Add: 250,000 Data (Batch: 25,641 / Avg:
23,757)
09:55:11 INFO loader :: Add: 300,000 Data (Batch: 25,100 / Avg:
23,971)
09:55:13 INFO loader :: Add: 350,000 Data (Batch: 24,213 / Avg:
24,005)
09:55:15 INFO loader :: Add: 400,000 Data (Batch: 24,461 / Avg:
24,061)
09:55:17 INFO loader :: Add: 450,000 Data (Batch: 25,667 / Avg:
24,230)
09:55:19 INFO loader :: Add: 500,000 Data (Batch: 25,879 / Avg:
24,385)
09:55:19 INFO loader :: Elapsed: 20.50 seconds [2021/02/08 09:55:19
UTC]
09:55:21 INFO loader :: Add: 550,000 Data (Batch: 25,484 / Avg:
24,481)
09:55:23 INFO loader :: Add: 600,000 Data (Batch: 23,419 / Avg:
24,389)
09:55:25 INFO loader :: Add: 650,000 Data (Batch: 25,012 / Avg:
24,436)
09:55:27 INFO loader :: Add: 700,000 Data (Batch: 25,201 / Avg:
24,489)
09:55:29 INFO loader :: Add: 750,000 Data (Batch: 26,288 / Avg:
24,601)
09:55:31 INFO loader :: Add: 800,000 Data (Batch: 25,960 / Avg:
24,682)
09:55:33 INFO loader :: Add: 850,000 Data (Batch: 24,330 / Avg:
24,661)
09:55:35 INFO loader :: Add: 900,000 Data (Batch: 25,813 / Avg:
24,722)
09:55:37 INFO loader :: Add: 950,000 Data (Batch: 26,164 / Avg:
24,794)
09:55:39 INFO loader :: Add: 1,000,000 Data (Batch: 26,357 / Avg:
24,868)
09:55:39 INFO loader :: Elapsed: 40.21 seconds [2021/02/08 09:55:39
UTC]
My first questions are:
1) I guess that 600,000 is the number of data loaded at 09:55:23. What
means data? Does it mean bytes or triples?
#triples
2) What are the numbers Batch: 23,419 and Avg: 24,389? I guess that are
associated to the loading speed.
Batch means in the last 50K triples the number of triples per second
Avg means among all triples parsed so far the number of triples per second
After some days of loading the output shows different numbers:
10:21:45 INFO loader :: Elapsed: 433,606.84 seconds [2021/02/13
10:21:45 UTC]
10:21:48 INFO loader :: Add: 505,550,000 Data (Batch: 18,348 / Avg:
1,165)
10:21:51 INFO loader :: Add: 505,600,000 Data (Batch: 18,656 / Avg:
1,166)
10:22:55 INFO loader :: Add: 505,650,000 Data (Batch: 781 / Avg:
1,165)
10:36:12 INFO loader :: Add: 505,700,000 Data (Batch: 62 / Avg: 1,163)
10:36:14 INFO loader :: Add: 505,750,000 Data (Batch: 17,543 / Avg:
1,164)
10:36:17 INFO loader :: Add: 505,800,000 Data (Batch: 17,385 / Avg:
1,164)
10:36:20 INFO loader :: Add: 505,850,000 Data (Batch: 17,998 / Avg:
1,164)
10:36:23 INFO loader :: Add: 505,900,000 Data (Batch: 17,170 / Avg:
1,164)
10:37:12 INFO loader :: Add: 505,950,000 Data (Batch: 1,025 / Avg:
1,164)
10:37:14 INFO loader :: Add: 506,000,000 Data (Batch: 18,301 / Avg:
1,164)
10:37:14 INFO loader :: Elapsed: 434,535.94 seconds [2021/02/13
10:37:14 UTC]
Now the numbers Batch and Avg are smaller. Also, it is taking longer to
load each 500,000 data. At the beginning it takes 20 seconds to load
500,000 data. Now it is taking 929 seconds. Why the load speed is
degraded? In my experience loading big datasets in Jena I always see
that loading start getting slower as much data have been already loaded.
Andy Seaborne will explain better, but in general IO and memory pressure
increase, spill to disk, etc. - those heavy drops to 781 and 62 can also
stem from other workloads on your system
Best regards,
Daniel