Hi, I want to share a couple of experiments consisting in loading Freebase RDF data dump into TDB using tdbloader2 and tdbloader3.
tdbloader2 ========== This is how I run tdbloader2 using an EC2 m1.xlarge instance (i.e. 15 GB memory): export JVM_ARGS="-Xmx4096m -server" export TMPDIR=/mnt/data/tmp tdbloader2 --loc /mnt/data/freebase /mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz Total elapsed time to load 618,465,279 triples: ~12 hours (i.e. ~10,000 triples/s overall speed) This is the log: Mar 7 13:11:37 ip-10-54-167-166 build: 13:11:37 -- TDB Bulk Loader Start Mar 7 13:11:37 ip-10-54-167-166 build: 13:11:37 Data phase Mar 7 13:11:39 ip-10-54-167-166 build: Load: /mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz -- 2012/03/07 13:11:38 UTC Mar 7 13:11:42 ip-10-54-167-166 build: Add: 50,000 Data (Batch: 16,550 / Avg: 16,550) Mar 7 13:11:43 ip-10-54-167-166 build: Add: 100,000 Data (Batch: 39,184 / Avg: 23,272) [...] Mar 7 19:13:51 ip-10-54-167-166 build: Add: 618,450,000 Data (Batch: 53,078 / Avg: 28,457) Mar 7 19:17:01 ip-10-54-167-166 CRON[7725]: pam_unix(cron:session): session opened for user root by (uid=0) Mar 7 19:17:01 ip-10-54-167-166 CRON[7727]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Mar 7 19:17:01 ip-10-54-167-166 CRON[7725]: pam_unix(cron:session): session closed for user root Mar 7 19:24:44 ip-10-54-167-166 build: Total: 618,465,279 tuples : 22,385.15 seconds : 27,628.37 tuples/sec [2012/03/07 19:24:44 UTC] Mar 7 19:24:45 ip-10-54-167-166 build: 19:24:44 Index phase Mar 7 19:24:45 ip-10-54-167-166 build: 19:24:45 Index SPO Mar 7 21:03:18 ip-10-54-167-166 build: 21:03:18 Build SPO Mar 7 21:14:24 ip-10-54-167-166 build: 21:14:24 Index POS Mar 7 23:38:28 ip-10-54-167-166 build: 23:38:28 Build POS Mar 7 23:49:03 ip-10-54-167-166 build: 23:49:03 Index OSP Mar 8 00:56:13 ip-10-54-167-166 build: 00:56:13 Build OSP Mar 8 01:08:17 ip-10-54-167-166 build: 01:08:17 Index phase end Mar 8 01:08:59 ip-10-54-167-166 build: 01:08:59 -- TDB Bulk Loader Finish Mar 8 01:08:59 ip-10-54-167-166 build: 01:08:59 -- 43000 seconds tdbloader3 ========== This is how I run tdbloader3 using an EC2 m1.xlarge instance (i.e. 15 GB memory): java -Djava.io.tmpdir=/mnt/data/tmp -cp target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar -server -d64 -Xmx12288M cmd.tdbloader3 --no-stats --compression --spill-size 10000000 --loc /mnt/data/freebase /mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz Total elapsed time to load 618,465,279 triples: Total: 618,465,279 tuples : 53,608.12 seconds : 11,536.78 tuples/sec This is the log: Mar 6 11:43:59 ip-10-53-130-32 build: INFO Load: /mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz -- 2012/03/06 11:43:59 UTC Mar 6 11:44:00 ip-10-53-130-32 build: INFO Add: 50,000 tuples (Batch: 35,335 / Avg: 35,335) Mar 6 11:44:01 ip-10-53-130-32 build: INFO Add: 100,000 tuples (Batch: 68,212 / Avg: 46,554) [...] Mar 6 15:32:38 ip-10-53-130-32 build: INFO Add: 618,450,000 tuples (Batch: 89,766 / Avg: 45,079) Mar 6 15:32:38 ip-10-53-130-32 build: INFO Node Table (1/3): building nodes.dat and sorting hash|id ... Mar 6 17:24:46 ip-10-53-130-32 build: INFO Add: 50,000 records for node table (1/3) phase (Batch: 7 / Avg: 7) Mar 6 17:24:47 ip-10-53-130-32 build: INFO Add: 100,000 records for node table (1/3) phase (Batch: 82,236 / Avg: 14) [...] Mar 6 21:23:09 ip-10-53-130-32 build: INFO Add: 1,855,350,000 records for node table (1/3) phase (Batch: 216,450 / Avg: 88,220) Mar 6 21:23:09 ip-10-53-130-32 build: INFO Total: 1,855,395,837 tuples : 21,031.01 seconds : 88,221.91 tuples/sec [2012/03/06 21:23:09 UTC] Mar 6 21:23:40 ip-10-53-130-32 build: INFO Node Table (2/3): generating input data using node ids... Mar 6 23:00:17 ip-10-53-130-32 build: INFO Add: 50,000 records for node table (2/3) phase (Batch: 8 / Avg: 8) Mar 6 23:00:17 ip-10-53-130-32 build: INFO Add: 100,000 records for node table (2/3) phase (Batch: 96,899 / Avg: 17) [...] Mar 7 01:04:18 ip-10-53-130-32 build: INFO Add: 618,450,000 records for node table (2/3) phase (Batch: 95,969 / Avg: 46,718) Mar 7 01:04:18 ip-10-53-130-32 build: INFO Total: 618,463,448 tuples : 13,237.97 seconds : 46,718.90 tuples/sec [2012/03/07 01:04:18 UTC] Mar 7 01:04:23 ip-10-53-130-32 build: INFO Node Table (3/3): building node table B+Tree index (i.e. node2id.dat and node2id.idn files)... Mar 7 01:04:38 ip-10-53-130-32 build: INFO Add: 50,000 records for node table (3/3) phase (Batch: 3,511 / Avg: 3,511) Mar 7 01:04:38 ip-10-53-130-32 build: INFO Add: 100,000 records for node table (3/3) phase (Batch: 375,939 / Avg: 6,958) [...] Mar 7 01:07:21 ip-10-53-130-32 build: INFO Add: 149,050,000 records for node table (3/3) phase (Batch: 980,392 / Avg: 838,537) Mar 7 01:07:24 ip-10-53-130-32 build: INFO Total: 149,066,002 tuples : 180.42 seconds : 826,225.75 tuples/sec [2012/03/07 01:07:24 UTC] Mar 7 01:07:27 ip-10-53-130-32 build: INFO Index: creating SPO index... Mar 7 01:08:14 ip-10-53-130-32 build: INFO Add: 50,000 records to SPO (Batch: 1,065 / Avg: 1,065) Mar 7 01:08:15 ip-10-53-130-32 build: INFO Add: 100,000 records to SPO (Batch: 54,764 / Avg: 2,090) [...] Mar 7 01:18:47 ip-10-53-130-32 build: INFO Add: 618,450,000 records to SPO (Batch: 1,020,408 / Avg: 908,977) Mar 7 01:18:50 ip-10-53-130-32 build: INFO Total: 618,463,449 tuples: 682.99 seconds : 905,528.69 tuples/sec [2012/03/07 01:18:50 UTC] Mar 7 01:18:50 ip-10-53-130-32 build: INFO Index: creating GSPO index... Mar 7 01:18:50 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.12 seconds : 0.00 tuples/sec [2012/03/07 01:18:50 UTC] Mar 7 01:18:56 ip-10-53-130-32 build: INFO Index: sorting data for POS index... Mar 7 01:18:57 ip-10-53-130-32 build: INFO Add: 50,000 records to POS (Batch: 210,084 / Avg: 210,084) Mar 7 01:18:57 ip-10-53-130-32 build: INFO Add: 100,000 records to POS (Batch: 1,724,137 / Avg: 374,531) [...] Mar 7 01:47:03 ip-10-53-130-32 build: INFO Add: 618,450,000 records to POS (Batch: 4,545,454 / Avg: 366,790) Mar 7 01:47:03 ip-10-53-130-32 build: INFO Total: 618,463,449 tuples : 1,686.18 seconds : 366,783.97 tuples/sec [2012/03/07 01:47:03 UTC] Mar 7 01:47:03 ip-10-53-130-32 build: INFO Index: creating POS index... Mar 7 01:47:41 ip-10-53-130-32 build: INFO Add: 50,000 records to POS (Batch: 1,321 / Avg: 1,321) Mar 7 01:47:41 ip-10-53-130-32 build: INFO Add: 100,000 records to POS (Batch: 1,086,956 / Avg: 2,639) [...] Mar 7 01:57:37 ip-10-53-130-32 build: INFO Add: 618,450,000 records to POS (Batch: 1,162,790 / Avg: 974,417) Mar 7 01:57:42 ip-10-53-130-32 build: INFO Total: 618,463,449 tuples : 638.92 seconds : 967,976.50 tuples/sec [2012/03/07 01:57:42 UTC] Mar 7 01:57:47 ip-10-53-130-32 build: INFO Index: sorting data for OSP index... Mar 7 01:57:47 ip-10-53-130-32 build: INFO Add: 50,000 records to OSP (Batch: 373,134 / Avg: 373,134) Mar 7 01:57:47 ip-10-53-130-32 build: INFO Add: 100,000 records to OSP (Batch: 549,450 / Avg: 444,444) [...] Mar 7 02:26:23 ip-10-53-130-32 build: INFO Add: 618,450,000 records to OSP (Batch: 4,166,666 / Avg: 360,257) Mar 7 02:26:23 ip-10-53-130-32 build: INFO Total: 618,463,449 tuples : 1,716.69 seconds : 360,264.44 tuples/sec [2012/03/07 02:26:23 UTC] Mar 7 02:26:23 ip-10-53-130-32 build: INFO Index: creating OSP index... Mar 7 02:27:02 ip-10-53-130-32 build: INFO Add: 50,000 records to OSP (Batch: 1,284 / Avg: 1,284) Mar 7 02:27:03 ip-10-53-130-32 build: INFO Add: 100,000 records to OSP (Batch: 364,963 / Avg: 2,560) [...] Mar 7 02:37:18 ip-10-53-130-32 build: INFO Add: 618,450,000 records to OSP (Batch: 1,020,408 / Avg: 944,877) Mar 7 02:37:22 ip-10-53-130-32 build: INFO Total: 618,463,449 tuples : 658.94 seconds : 938,578.94 tuples/sec [2012/03/07 02:37:22 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: sorting data for GPOS index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.03 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: creating GPOS index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: sorting data for GOSP index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: creating GOSP index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: sorting data for POSG index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: creating POSG index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: sorting data for OSPG index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: creating OSPG index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: sorting data for SPOG index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Index: creating SPOG index... Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC] Mar 7 02:37:27 ip-10-53-130-32 build: INFO Total: 618,465,279 tuples : 53,608.12 seconds : 11,536.78 tuples/sec [2012/03/07 02:37:27 UTC] tdbloader3 leverages most of the code and tricks of tdbloader2 (and it would not have been possible without that, kudos to Andy), but it replaces the UNIX sort with a pure Java implementation of an external sort (which can works with binary files). This stuff was contributed by Stephen, so kudos to Stephen. The algorithm is the same to a MapReduce implementation (a.k.a. tdbloader4) and the aim is to have only good IO going on (i.e. no disk seeks) while building the indexes. Paolo
