Re: JENA Loader Benchmarks
Hi Marco, that reminds me of a previous discussions in Nov./Dec. 2017, one regarding general performance titled "tdb2.tdbloader performance" [1, 2] and then as followup, "Report on loading wikidata" [3]. Maybe you can also have a look at it, some people like Dick and Andy also did some kind of (light-weight) performance benchmark [1] https://lists.apache.org/thread.html/a5a2751a4fc4387c3db929b95927a95cbc4d0116664c7f3d32dca576@%3Cusers.jena.apache.org%3E [2] https://lists.apache.org/thread.html/34b53d7ee75e484cdbcc2ac75e075e6d7321ba1ee4a143c58c95b793@%3Cusers.jena.apache.org%3E [3] https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412@%3Cusers.jena.apache.org%3E > absolutely it does, preferably NVMe SSD. tdbloaders are almost a showcase > themselves for good up-to-date hardware.. > > if possible I'd like to load the wikidata dataset* at at some point to see > where 57GB fits in terms of tdb. The wikidata team is currently looking at > new solutions that can go beyond blazegraph. And I get the impression that > they have not yet actively considered to give jena tdb try. > > https://dumps.wikimedia.org/wikidatawiki/entities/ > > > On Fri, Jun 14, 2019 at 11:47 PM Martynas Jusevičius > wrote: > >> What about SSD disks, don't they make a difference? >> >> On Sat, Jun 15, 2019 at 12:36 AM Marco Neumann >> wrote: >>> that did the trick Andy, very good might be a good idea to add this to >> the >>> distribution in jena-log4j.properties >>> >>> I am getting these numbers for a midsize dedicated server, very nice >>> numbers indeed Andy. well done! >>> >>> 00:24:53 INFO loader :: Loader = LoaderPhased >>> 00:24:53 INFO loader :: Start: >>> ../../public_html/lotico.ttl.gz >>> 00:24:55 INFO loader :: Add: 500,000 lotico.ttl.gz (Batch: >>> 237,755 / Avg: 237,755) >>> 00:24:56 INFO loader :: Add: 1,000,000 lotico.ttl.gz >> (Batch: >>> 305,250 / Avg: 267,308) >>> 00:24:58 INFO loader :: Add: 1,500,000 lotico.ttl.gz >> (Batch: >>> 313,087 / Avg: 281,004) >>> 00:25:00 INFO loader :: Add: 2,000,000 lotico.ttl.gz >> (Batch: >>> 328,299 / Avg: 291,502) >>> 00:25:01 INFO loader :: Add: 2,500,000 lotico.ttl.gz >> (Batch: >>> 341,763 / Avg: 300,336) >>> 00:25:03 INFO loader :: Add: 3,000,000 lotico.ttl.gz >> (Batch: >>> 337,381 / Avg: 305,935) >>> 00:25:04 INFO loader :: Add: 3,500,000 lotico.ttl.gz >> (Batch: >>> 318,877 / Avg: 307,719) >>> 00:25:06 INFO loader :: Add: 4,000,000 lotico.ttl.gz >> (Batch: >>> 295,857 / Avg: 306,184) >>> 00:25:07 INFO loader :: Add: 4,500,000 lotico.ttl.gz >> (Batch: >>> 327,225 / Avg: 308,388) >>> 00:25:09 INFO loader :: Add: 5,000,000 lotico.ttl.gz >> (Batch: >>> 349,406 / Avg: 312,051) >>> 00:25:09 INFO loader :: Elapsed: 16.02 seconds >> [2019/06/15 >>> 00:25:09 CEST] >>> 00:25:11 INFO loader :: Add: 5,500,000 lotico.ttl.gz >> (Batch: >>> 285,062 / Avg: 309,388) >>> 00:25:13 INFO loader :: Add: 6,000,000 lotico.ttl.gz >> (Batch: >>> 203,665 / Avg: 296,559) >>> 00:25:16 INFO loader :: Add: 6,500,000 lotico.ttl.gz >> (Batch: >>> 189,393 / Avg: 284,190) >>> >>> on another machine that sits in the Azure infrastructure somewhere it >>> tdbloader doesn't look as good, even with decent hardware it seems to >> die a >>> slow death of memory exhaustion at 16GB. started off with 70kT/s and is >> now >>> down to 17kT/s and still going. >>> >>> lesson learned big iron and big memory is the way to go with Jena >>> tdbloaders. >>> >>> >>> >>> >>> On Fri, Jun 14, 2019 at 10:53 PM Andy Seaborne wrote: >>> These messages are logged (to logger "org.apache.jena.tdb2.loader") - >> do you have log4j.proprties in the current working directory? Do you get any output? INFO Loader = LoaderParallel INFO Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz INFO Add: 500,000 bsbm-5m.nt.gz (Batch: 134,770 / Avg: 134,770) INFO Add: 1,000,000 bsbm-5m.nt.gz (Batch: 189,753 / Avg: 157,604) INFO Add: 1,500,000 bsbm-5m.nt.gz (Batch: 205,676 / Avg: 170,920) INFO Add: 2,000,000 bsbm-5m.nt.gz (Batch: 204,248 / Avg: 178,189) INFO Add: 2,500,000 bsbm-5m.nt.gz (Batch: 202,101 / Avg: 182,508) INFO Add: 3,000,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 186,173) INFO Add: 3,500,000 bsbm-5m.nt.gz (Batch: 183,621 / Avg: 185,804) INFO Add: 4,000,000 bsbm-5m.nt.gz (Batch: 151,423 / Avg: 180,676) INFO Add: 4,500,000 bsbm-5m.nt.gz (Batch: 152,765 / Avg: 177,081) INFO Add: 5,000,000 bsbm-5m.nt.gz (Batch: 158,881 / Avg: 175,076) INFOElapsed: 28.56 seconds [2019/06/14 22:51:37 BST] INFO Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599 tuples in 28.63s (Avg: 174,644) INFO Finish - index SPO INFO Finish - index POS INFO Finish - inde
Re: JENA Loader Benchmarks
Very good, thank you for the links Lorenz! Marco On Sat, Jun 15, 2019 at 8:10 AM Lorenz B. < buehm...@informatik.uni-leipzig.de> wrote: > Hi Marco, > > that reminds me of a previous discussions in Nov./Dec. 2017, one > regarding general performance titled "tdb2.tdbloader performance" [1, 2] > and then as followup, "Report on loading wikidata" [3]. Maybe you can > also have a look at it, some people like Dick and Andy also did some > kind of (light-weight) performance benchmark > > [1] > > https://lists.apache.org/thread.html/a5a2751a4fc4387c3db929b95927a95cbc4d0116664c7f3d32dca576@%3Cusers.jena.apache.org%3E > [2] > > https://lists.apache.org/thread.html/34b53d7ee75e484cdbcc2ac75e075e6d7321ba1ee4a143c58c95b793@%3Cusers.jena.apache.org%3E > [3] > > https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412@%3Cusers.jena.apache.org%3E > > > absolutely it does, preferably NVMe SSD. tdbloaders are almost a showcase > > themselves for good up-to-date hardware.. > > > > if possible I'd like to load the wikidata dataset* at at some point to > see > > where 57GB fits in terms of tdb. The wikidata team is currently looking > at > > new solutions that can go beyond blazegraph. And I get the impression > that > > they have not yet actively considered to give jena tdb try. > > > > https://dumps.wikimedia.org/wikidatawiki/entities/ > > > > > > On Fri, Jun 14, 2019 at 11:47 PM Martynas Jusevičius < > marty...@atomgraph.com> > > wrote: > > > >> What about SSD disks, don't they make a difference? > >> > >> On Sat, Jun 15, 2019 at 12:36 AM Marco Neumann > > >> wrote: > >>> that did the trick Andy, very good might be a good idea to add this to > >> the > >>> distribution in jena-log4j.properties > >>> > >>> I am getting these numbers for a midsize dedicated server, very nice > >>> numbers indeed Andy. well done! > >>> > >>> 00:24:53 INFO loader :: Loader = LoaderPhased > >>> 00:24:53 INFO loader :: Start: > >>> ../../public_html/lotico.ttl.gz > >>> 00:24:55 INFO loader :: Add: 500,000 lotico.ttl.gz > (Batch: > >>> 237,755 / Avg: 237,755) > >>> 00:24:56 INFO loader :: Add: 1,000,000 lotico.ttl.gz > >> (Batch: > >>> 305,250 / Avg: 267,308) > >>> 00:24:58 INFO loader :: Add: 1,500,000 lotico.ttl.gz > >> (Batch: > >>> 313,087 / Avg: 281,004) > >>> 00:25:00 INFO loader :: Add: 2,000,000 lotico.ttl.gz > >> (Batch: > >>> 328,299 / Avg: 291,502) > >>> 00:25:01 INFO loader :: Add: 2,500,000 lotico.ttl.gz > >> (Batch: > >>> 341,763 / Avg: 300,336) > >>> 00:25:03 INFO loader :: Add: 3,000,000 lotico.ttl.gz > >> (Batch: > >>> 337,381 / Avg: 305,935) > >>> 00:25:04 INFO loader :: Add: 3,500,000 lotico.ttl.gz > >> (Batch: > >>> 318,877 / Avg: 307,719) > >>> 00:25:06 INFO loader :: Add: 4,000,000 lotico.ttl.gz > >> (Batch: > >>> 295,857 / Avg: 306,184) > >>> 00:25:07 INFO loader :: Add: 4,500,000 lotico.ttl.gz > >> (Batch: > >>> 327,225 / Avg: 308,388) > >>> 00:25:09 INFO loader :: Add: 5,000,000 lotico.ttl.gz > >> (Batch: > >>> 349,406 / Avg: 312,051) > >>> 00:25:09 INFO loader :: Elapsed: 16.02 seconds > >> [2019/06/15 > >>> 00:25:09 CEST] > >>> 00:25:11 INFO loader :: Add: 5,500,000 lotico.ttl.gz > >> (Batch: > >>> 285,062 / Avg: 309,388) > >>> 00:25:13 INFO loader :: Add: 6,000,000 lotico.ttl.gz > >> (Batch: > >>> 203,665 / Avg: 296,559) > >>> 00:25:16 INFO loader :: Add: 6,500,000 lotico.ttl.gz > >> (Batch: > >>> 189,393 / Avg: 284,190) > >>> > >>> on another machine that sits in the Azure infrastructure somewhere it > >>> tdbloader doesn't look as good, even with decent hardware it seems to > >> die a > >>> slow death of memory exhaustion at 16GB. started off with 70kT/s and is > >> now > >>> down to 17kT/s and still going. > >>> > >>> lesson learned big iron and big memory is the way to go with Jena > >>> tdbloaders. > >>> > >>> > >>> > >>> > >>> On Fri, Jun 14, 2019 at 10:53 PM Andy Seaborne > wrote: > >>> > These messages are logged (to logger "org.apache.jena.tdb2.loader") - > >> do > you have log4j.proprties in the current working directory? > > Do you get any output? > > INFO Loader = LoaderParallel > INFO Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz > INFO Add: 500,000 bsbm-5m.nt.gz (Batch: 134,770 / Avg: 134,770) > INFO Add: 1,000,000 bsbm-5m.nt.gz (Batch: 189,753 / Avg: 157,604) > INFO Add: 1,500,000 bsbm-5m.nt.gz (Batch: 205,676 / Avg: 170,920) > INFO Add: 2,000,000 bsbm-5m.nt.gz (Batch: 204,248 / Avg: 178,189) > INFO Add: 2,500,000 bsbm-5m.nt.gz (Batch: 202,101 / Avg: 182,508) > INFO Add: 3,000,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 186,173) > INFO Add: 3,500,000 bsbm-5m.nt.gz (Batch: 183,621 / Avg: 185,804) > INFO Add: 4,000,000 bsbm-5m.nt.gz (Batch: 151,423