On 13.02.21 12:00, Daniel Hernandez wrote: > Hi, > > I am loading an n-triples file using tdbloader2. I am curious about > what is the meaning of the numbers in the loader output. The loading > output started as follows: > > 09:54:15 INFO -- TDB Bulk Loader Start > 09:54:15 INFO Data Load Phase > 09:54:15 INFO Got 1 data files to load > 09:54:15 INFO Data file 1: /home/ubuntu/dataset.nq.gz > 09:54:59 INFO loader :: Load: /home/ubuntu/dataset.nq.gz -- > 2021/02/08 09:54:59 UTC > 09:55:01 INFO loader :: Add: 50,000 Data (Batch: 19,912 / Avg: > 19,912) > 09:55:03 INFO loader :: Add: 100,000 Data (Batch: 23,288 / Avg: > 21,468) > 09:55:05 INFO loader :: Add: 150,000 Data (Batch: 26,123 / Avg: > 22,824) > 09:55:07 INFO loader :: Add: 200,000 Data (Batch: 24,987 / Avg: > 23,329) > 09:55:09 INFO loader :: Add: 250,000 Data (Batch: 25,641 / Avg: > 23,757) > 09:55:11 INFO loader :: Add: 300,000 Data (Batch: 25,100 / Avg: > 23,971) > 09:55:13 INFO loader :: Add: 350,000 Data (Batch: 24,213 / Avg: > 24,005) > 09:55:15 INFO loader :: Add: 400,000 Data (Batch: 24,461 / Avg: > 24,061) > 09:55:17 INFO loader :: Add: 450,000 Data (Batch: 25,667 / Avg: > 24,230) > 09:55:19 INFO loader :: Add: 500,000 Data (Batch: 25,879 / Avg: > 24,385) > 09:55:19 INFO loader :: Elapsed: 20.50 seconds [2021/02/08 > 09:55:19 UTC] > 09:55:21 INFO loader :: Add: 550,000 Data (Batch: 25,484 / Avg: > 24,481) > 09:55:23 INFO loader :: Add: 600,000 Data (Batch: 23,419 / Avg: > 24,389) > 09:55:25 INFO loader :: Add: 650,000 Data (Batch: 25,012 / Avg: > 24,436) > 09:55:27 INFO loader :: Add: 700,000 Data (Batch: 25,201 / Avg: > 24,489) > 09:55:29 INFO loader :: Add: 750,000 Data (Batch: 26,288 / Avg: > 24,601) > 09:55:31 INFO loader :: Add: 800,000 Data (Batch: 25,960 / Avg: > 24,682) > 09:55:33 INFO loader :: Add: 850,000 Data (Batch: 24,330 / Avg: > 24,661) > 09:55:35 INFO loader :: Add: 900,000 Data (Batch: 25,813 / Avg: > 24,722) > 09:55:37 INFO loader :: Add: 950,000 Data (Batch: 26,164 / Avg: > 24,794) > 09:55:39 INFO loader :: Add: 1,000,000 Data (Batch: 26,357 / Avg: > 24,868) > 09:55:39 INFO loader :: Elapsed: 40.21 seconds [2021/02/08 > 09:55:39 UTC] > > My first questions are: > > 1) I guess that 600,000 is the number of data loaded at 09:55:23. What > means data? Does it mean bytes or triples? #triples > > 2) What are the numbers Batch: 23,419 and Avg: 24,389? I guess that are > associated to the loading speed.
Batch means in the last 50K triples the number of triples per second Avg means among all triples parsed so far the number of triples per second > > After some days of loading the output shows different numbers: > > 10:21:45 INFO loader :: Elapsed: 433,606.84 seconds [2021/02/13 > 10:21:45 UTC] > 10:21:48 INFO loader :: Add: 505,550,000 Data (Batch: 18,348 / Avg: > 1,165) > 10:21:51 INFO loader :: Add: 505,600,000 Data (Batch: 18,656 / Avg: > 1,166) > 10:22:55 INFO loader :: Add: 505,650,000 Data (Batch: 781 / Avg: > 1,165) > 10:36:12 INFO loader :: Add: 505,700,000 Data (Batch: 62 / Avg: > 1,163) > 10:36:14 INFO loader :: Add: 505,750,000 Data (Batch: 17,543 / Avg: > 1,164) > 10:36:17 INFO loader :: Add: 505,800,000 Data (Batch: 17,385 / Avg: > 1,164) > 10:36:20 INFO loader :: Add: 505,850,000 Data (Batch: 17,998 / Avg: > 1,164) > 10:36:23 INFO loader :: Add: 505,900,000 Data (Batch: 17,170 / Avg: > 1,164) > 10:37:12 INFO loader :: Add: 505,950,000 Data (Batch: 1,025 / Avg: > 1,164) > 10:37:14 INFO loader :: Add: 506,000,000 Data (Batch: 18,301 / Avg: > 1,164) > 10:37:14 INFO loader :: Elapsed: 434,535.94 seconds [2021/02/13 > 10:37:14 UTC] > > Now the numbers Batch and Avg are smaller. Also, it is taking longer to > load each 500,000 data. At the beginning it takes 20 seconds to load > 500,000 data. Now it is taking 929 seconds. Why the load speed is > degraded? In my experience loading big datasets in Jena I always see > that loading start getting slower as much data have been already loaded. Andy Seaborne will explain better, but in general IO and memory pressure increase, spill to disk, etc. - those heavy drops to 781 and 62 can also stem from other workloads on your system > > Best regards, > Daniel
