Hi,
Thanks Lorenz for your answer. Regarding a possible spill to disk, my machine has 256 GB of RAM, and the Java process is taking only 20 G. I am not sure if changing the -Xmx Java argument would speed up the loading process. I see that tdbloader2 started the process using with the argument -Xmx1200M. Maybe is too conservative. Best, Daniel Lorenz Buehmann writes: > On 13.02.21 12:00, Daniel Hernandez wrote: >> Hi, >> >> I am loading an n-triples file using tdbloader2. I am curious about >> what is the meaning of the numbers in the loader output. The loading >> output started as follows: >> >> 09:54:15 INFO -- TDB Bulk Loader Start >> 09:54:15 INFO Data Load Phase >> 09:54:15 INFO Got 1 data files to load >> 09:54:15 INFO Data file 1: /home/ubuntu/dataset.nq.gz >> 09:54:59 INFO loader :: Load: /home/ubuntu/dataset.nq.gz -- >> 2021/02/08 09:54:59 UTC >> 09:55:01 INFO loader :: Add: 50,000 Data (Batch: 19,912 / Avg: >> 19,912) >> 09:55:03 INFO loader :: Add: 100,000 Data (Batch: 23,288 / Avg: >> 21,468) >> 09:55:05 INFO loader :: Add: 150,000 Data (Batch: 26,123 / Avg: >> 22,824) >> 09:55:07 INFO loader :: Add: 200,000 Data (Batch: 24,987 / Avg: >> 23,329) >> 09:55:09 INFO loader :: Add: 250,000 Data (Batch: 25,641 / Avg: >> 23,757) >> 09:55:11 INFO loader :: Add: 300,000 Data (Batch: 25,100 / Avg: >> 23,971) >> 09:55:13 INFO loader :: Add: 350,000 Data (Batch: 24,213 / Avg: >> 24,005) >> 09:55:15 INFO loader :: Add: 400,000 Data (Batch: 24,461 / Avg: >> 24,061) >> 09:55:17 INFO loader :: Add: 450,000 Data (Batch: 25,667 / Avg: >> 24,230) >> 09:55:19 INFO loader :: Add: 500,000 Data (Batch: 25,879 / Avg: >> 24,385) >> 09:55:19 INFO loader :: Elapsed: 20.50 seconds [2021/02/08 >> 09:55:19 UTC] >> 09:55:21 INFO loader :: Add: 550,000 Data (Batch: 25,484 / Avg: >> 24,481) >> 09:55:23 INFO loader :: Add: 600,000 Data (Batch: 23,419 / Avg: >> 24,389) >> 09:55:25 INFO loader :: Add: 650,000 Data (Batch: 25,012 / Avg: >> 24,436) >> 09:55:27 INFO loader :: Add: 700,000 Data (Batch: 25,201 / Avg: >> 24,489) >> 09:55:29 INFO loader :: Add: 750,000 Data (Batch: 26,288 / Avg: >> 24,601) >> 09:55:31 INFO loader :: Add: 800,000 Data (Batch: 25,960 / Avg: >> 24,682) >> 09:55:33 INFO loader :: Add: 850,000 Data (Batch: 24,330 / Avg: >> 24,661) >> 09:55:35 INFO loader :: Add: 900,000 Data (Batch: 25,813 / Avg: >> 24,722) >> 09:55:37 INFO loader :: Add: 950,000 Data (Batch: 26,164 / Avg: >> 24,794) >> 09:55:39 INFO loader :: Add: 1,000,000 Data (Batch: 26,357 / Avg: >> 24,868) >> 09:55:39 INFO loader :: Elapsed: 40.21 seconds [2021/02/08 >> 09:55:39 UTC] >> >> My first questions are: >> >> 1) I guess that 600,000 is the number of data loaded at 09:55:23. What >> means data? Does it mean bytes or triples? > #triples >> >> 2) What are the numbers Batch: 23,419 and Avg: 24,389? I guess that are >> associated to the loading speed. > > Batch means in the last 50K triples the number of triples per second > > Avg means among all triples parsed so far the number of triples per second > >> >> After some days of loading the output shows different numbers: >> >> 10:21:45 INFO loader :: Elapsed: 433,606.84 seconds [2021/02/13 >> 10:21:45 UTC] >> 10:21:48 INFO loader :: Add: 505,550,000 Data (Batch: 18,348 / >> Avg: 1,165) >> 10:21:51 INFO loader :: Add: 505,600,000 Data (Batch: 18,656 / >> Avg: 1,166) >> 10:22:55 INFO loader :: Add: 505,650,000 Data (Batch: 781 / Avg: >> 1,165) >> 10:36:12 INFO loader :: Add: 505,700,000 Data (Batch: 62 / Avg: >> 1,163) >> 10:36:14 INFO loader :: Add: 505,750,000 Data (Batch: 17,543 / >> Avg: 1,164) >> 10:36:17 INFO loader :: Add: 505,800,000 Data (Batch: 17,385 / >> Avg: 1,164) >> 10:36:20 INFO loader :: Add: 505,850,000 Data (Batch: 17,998 / >> Avg: 1,164) >> 10:36:23 INFO loader :: Add: 505,900,000 Data (Batch: 17,170 / >> Avg: 1,164) >> 10:37:12 INFO loader :: Add: 505,950,000 Data (Batch: 1,025 / Avg: >> 1,164) >> 10:37:14 INFO loader :: Add: 506,000,000 Data (Batch: 18,301 / >> Avg: 1,164) >> 10:37:14 INFO loader :: Elapsed: 434,535.94 seconds [2021/02/13 >> 10:37:14 UTC] >> >> Now the numbers Batch and Avg are smaller. Also, it is taking longer to >> load each 500,000 data. At the beginning it takes 20 seconds to load >> 500,000 data. Now it is taking 929 seconds. Why the load speed is >> degraded? In my experience loading big datasets in Jena I always see >> that loading start getting slower as much data have been already loaded. > Andy Seaborne will explain better, but in general IO and memory pressure > increase, spill to disk, etc. - those heavy drops to 781 and 62 can also > stem from other workloads on your system >> >> Best regards, >> Daniel
