Re: JENA Loader Benchmarks

2019-06-15 Thread Lorenz B.
Hi Marco,

that reminds me of a previous discussions in Nov./Dec. 2017, one
regarding general performance titled "tdb2.tdbloader performance" [1, 2]
and then as followup, "Report on loading wikidata" [3]. Maybe you can
also have a look at it, some people like Dick and Andy also did some
kind of (light-weight) performance benchmark

[1]
https://lists.apache.org/thread.html/a5a2751a4fc4387c3db929b95927a95cbc4d0116664c7f3d32dca576@%3Cusers.jena.apache.org%3E
[2]
https://lists.apache.org/thread.html/34b53d7ee75e484cdbcc2ac75e075e6d7321ba1ee4a143c58c95b793@%3Cusers.jena.apache.org%3E
[3]
https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412@%3Cusers.jena.apache.org%3E

> absolutely it does, preferably NVMe SSD. tdbloaders are almost a showcase
> themselves for good up-to-date hardware..
>
> if possible I'd like to load the wikidata dataset* at at some point to see
> where 57GB fits in terms of tdb. The wikidata team is currently looking at
> new solutions that can go beyond blazegraph. And I get the impression that
> they have not yet actively considered to give jena tdb try.
>
> https://dumps.wikimedia.org/wikidatawiki/entities/
>
>
> On Fri, Jun 14, 2019 at 11:47 PM Martynas Jusevičius 
> wrote:
>
>> What about SSD disks, don't they make a difference?
>>
>> On Sat, Jun 15, 2019 at 12:36 AM Marco Neumann 
>> wrote:
>>> that did the trick Andy, very good might be a good idea to add this to
>> the
>>> distribution in jena-log4j.properties
>>>
>>> I am getting these numbers for a midsize dedicated server, very nice
>>> numbers indeed Andy. well done!
>>>
>>> 00:24:53 INFO  loader   :: Loader = LoaderPhased
>>> 00:24:53 INFO  loader   :: Start:
>>> ../../public_html/lotico.ttl.gz
>>> 00:24:55 INFO  loader   :: Add: 500,000 lotico.ttl.gz (Batch:
>>> 237,755 / Avg: 237,755)
>>> 00:24:56 INFO  loader   :: Add: 1,000,000 lotico.ttl.gz
>> (Batch:
>>> 305,250 / Avg: 267,308)
>>> 00:24:58 INFO  loader   :: Add: 1,500,000 lotico.ttl.gz
>> (Batch:
>>> 313,087 / Avg: 281,004)
>>> 00:25:00 INFO  loader   :: Add: 2,000,000 lotico.ttl.gz
>> (Batch:
>>> 328,299 / Avg: 291,502)
>>> 00:25:01 INFO  loader   :: Add: 2,500,000 lotico.ttl.gz
>> (Batch:
>>> 341,763 / Avg: 300,336)
>>> 00:25:03 INFO  loader   :: Add: 3,000,000 lotico.ttl.gz
>> (Batch:
>>> 337,381 / Avg: 305,935)
>>> 00:25:04 INFO  loader   :: Add: 3,500,000 lotico.ttl.gz
>> (Batch:
>>> 318,877 / Avg: 307,719)
>>> 00:25:06 INFO  loader   :: Add: 4,000,000 lotico.ttl.gz
>> (Batch:
>>> 295,857 / Avg: 306,184)
>>> 00:25:07 INFO  loader   :: Add: 4,500,000 lotico.ttl.gz
>> (Batch:
>>> 327,225 / Avg: 308,388)
>>> 00:25:09 INFO  loader   :: Add: 5,000,000 lotico.ttl.gz
>> (Batch:
>>> 349,406 / Avg: 312,051)
>>> 00:25:09 INFO  loader   ::   Elapsed: 16.02 seconds
>> [2019/06/15
>>> 00:25:09 CEST]
>>> 00:25:11 INFO  loader   :: Add: 5,500,000 lotico.ttl.gz
>> (Batch:
>>> 285,062 / Avg: 309,388)
>>> 00:25:13 INFO  loader   :: Add: 6,000,000 lotico.ttl.gz
>> (Batch:
>>> 203,665 / Avg: 296,559)
>>> 00:25:16 INFO  loader   :: Add: 6,500,000 lotico.ttl.gz
>> (Batch:
>>> 189,393 / Avg: 284,190)
>>>
>>> on another machine that sits in the Azure infrastructure somewhere it
>>> tdbloader doesn't look as good, even with decent hardware it seems to
>> die a
>>> slow death of memory exhaustion at 16GB. started off with 70kT/s and is
>> now
>>> down to 17kT/s and still going.
>>>
>>> lesson learned big iron and big memory is the way to go with Jena
>>> tdbloaders.
>>>
>>>
>>>
>>>
>>> On Fri, Jun 14, 2019 at 10:53 PM Andy Seaborne  wrote:
>>>
 These messages are logged (to logger "org.apache.jena.tdb2.loader") -
>> do
 you have log4j.proprties in the current working directory?

 Do you get any output?

 INFO  Loader = LoaderParallel
 INFO  Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz
 INFO  Add: 500,000 bsbm-5m.nt.gz (Batch: 134,770 / Avg: 134,770)
 INFO  Add: 1,000,000 bsbm-5m.nt.gz (Batch: 189,753 / Avg: 157,604)
 INFO  Add: 1,500,000 bsbm-5m.nt.gz (Batch: 205,676 / Avg: 170,920)
 INFO  Add: 2,000,000 bsbm-5m.nt.gz (Batch: 204,248 / Avg: 178,189)
 INFO  Add: 2,500,000 bsbm-5m.nt.gz (Batch: 202,101 / Avg: 182,508)
 INFO  Add: 3,000,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 186,173)
 INFO  Add: 3,500,000 bsbm-5m.nt.gz (Batch: 183,621 / Avg: 185,804)
 INFO  Add: 4,000,000 bsbm-5m.nt.gz (Batch: 151,423 / Avg: 180,676)
 INFO  Add: 4,500,000 bsbm-5m.nt.gz (Batch: 152,765 / Avg: 177,081)
 INFO  Add: 5,000,000 bsbm-5m.nt.gz (Batch: 158,881 / Avg: 175,076)
 INFOElapsed: 28.56 seconds [2019/06/14 22:51:37 BST]
 INFO  Finished: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz: 5,000,599 tuples
 in 28.63s (Avg: 174,644)
 INFO  Finish - index SPO
 INFO  Finish - index POS
 INFO  Finish - inde

Re: JENA Loader Benchmarks

2019-06-15 Thread Marco Neumann
Very good, thank you for the links Lorenz!

Marco

On Sat, Jun 15, 2019 at 8:10 AM Lorenz B. <
buehm...@informatik.uni-leipzig.de> wrote:

> Hi Marco,
>
> that reminds me of a previous discussions in Nov./Dec. 2017, one
> regarding general performance titled "tdb2.tdbloader performance" [1, 2]
> and then as followup, "Report on loading wikidata" [3]. Maybe you can
> also have a look at it, some people like Dick and Andy also did some
> kind of (light-weight) performance benchmark
>
> [1]
>
> https://lists.apache.org/thread.html/a5a2751a4fc4387c3db929b95927a95cbc4d0116664c7f3d32dca576@%3Cusers.jena.apache.org%3E
> [2]
>
> https://lists.apache.org/thread.html/34b53d7ee75e484cdbcc2ac75e075e6d7321ba1ee4a143c58c95b793@%3Cusers.jena.apache.org%3E
> [3]
>
> https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412@%3Cusers.jena.apache.org%3E
>
> > absolutely it does, preferably NVMe SSD. tdbloaders are almost a showcase
> > themselves for good up-to-date hardware..
> >
> > if possible I'd like to load the wikidata dataset* at at some point to
> see
> > where 57GB fits in terms of tdb. The wikidata team is currently looking
> at
> > new solutions that can go beyond blazegraph. And I get the impression
> that
> > they have not yet actively considered to give jena tdb try.
> >
> > https://dumps.wikimedia.org/wikidatawiki/entities/
> >
> >
> > On Fri, Jun 14, 2019 at 11:47 PM Martynas Jusevičius <
> marty...@atomgraph.com>
> > wrote:
> >
> >> What about SSD disks, don't they make a difference?
> >>
> >> On Sat, Jun 15, 2019 at 12:36 AM Marco Neumann  >
> >> wrote:
> >>> that did the trick Andy, very good might be a good idea to add this to
> >> the
> >>> distribution in jena-log4j.properties
> >>>
> >>> I am getting these numbers for a midsize dedicated server, very nice
> >>> numbers indeed Andy. well done!
> >>>
> >>> 00:24:53 INFO  loader   :: Loader = LoaderPhased
> >>> 00:24:53 INFO  loader   :: Start:
> >>> ../../public_html/lotico.ttl.gz
> >>> 00:24:55 INFO  loader   :: Add: 500,000 lotico.ttl.gz
> (Batch:
> >>> 237,755 / Avg: 237,755)
> >>> 00:24:56 INFO  loader   :: Add: 1,000,000 lotico.ttl.gz
> >> (Batch:
> >>> 305,250 / Avg: 267,308)
> >>> 00:24:58 INFO  loader   :: Add: 1,500,000 lotico.ttl.gz
> >> (Batch:
> >>> 313,087 / Avg: 281,004)
> >>> 00:25:00 INFO  loader   :: Add: 2,000,000 lotico.ttl.gz
> >> (Batch:
> >>> 328,299 / Avg: 291,502)
> >>> 00:25:01 INFO  loader   :: Add: 2,500,000 lotico.ttl.gz
> >> (Batch:
> >>> 341,763 / Avg: 300,336)
> >>> 00:25:03 INFO  loader   :: Add: 3,000,000 lotico.ttl.gz
> >> (Batch:
> >>> 337,381 / Avg: 305,935)
> >>> 00:25:04 INFO  loader   :: Add: 3,500,000 lotico.ttl.gz
> >> (Batch:
> >>> 318,877 / Avg: 307,719)
> >>> 00:25:06 INFO  loader   :: Add: 4,000,000 lotico.ttl.gz
> >> (Batch:
> >>> 295,857 / Avg: 306,184)
> >>> 00:25:07 INFO  loader   :: Add: 4,500,000 lotico.ttl.gz
> >> (Batch:
> >>> 327,225 / Avg: 308,388)
> >>> 00:25:09 INFO  loader   :: Add: 5,000,000 lotico.ttl.gz
> >> (Batch:
> >>> 349,406 / Avg: 312,051)
> >>> 00:25:09 INFO  loader   ::   Elapsed: 16.02 seconds
> >> [2019/06/15
> >>> 00:25:09 CEST]
> >>> 00:25:11 INFO  loader   :: Add: 5,500,000 lotico.ttl.gz
> >> (Batch:
> >>> 285,062 / Avg: 309,388)
> >>> 00:25:13 INFO  loader   :: Add: 6,000,000 lotico.ttl.gz
> >> (Batch:
> >>> 203,665 / Avg: 296,559)
> >>> 00:25:16 INFO  loader   :: Add: 6,500,000 lotico.ttl.gz
> >> (Batch:
> >>> 189,393 / Avg: 284,190)
> >>>
> >>> on another machine that sits in the Azure infrastructure somewhere it
> >>> tdbloader doesn't look as good, even with decent hardware it seems to
> >> die a
> >>> slow death of memory exhaustion at 16GB. started off with 70kT/s and is
> >> now
> >>> down to 17kT/s and still going.
> >>>
> >>> lesson learned big iron and big memory is the way to go with Jena
> >>> tdbloaders.
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, Jun 14, 2019 at 10:53 PM Andy Seaborne 
> wrote:
> >>>
>  These messages are logged (to logger "org.apache.jena.tdb2.loader") -
> >> do
>  you have log4j.proprties in the current working directory?
> 
>  Do you get any output?
> 
>  INFO  Loader = LoaderParallel
>  INFO  Start: /home/afs/Datasets/BSBM/bsbm-5m.nt.gz
>  INFO  Add: 500,000 bsbm-5m.nt.gz (Batch: 134,770 / Avg: 134,770)
>  INFO  Add: 1,000,000 bsbm-5m.nt.gz (Batch: 189,753 / Avg: 157,604)
>  INFO  Add: 1,500,000 bsbm-5m.nt.gz (Batch: 205,676 / Avg: 170,920)
>  INFO  Add: 2,000,000 bsbm-5m.nt.gz (Batch: 204,248 / Avg: 178,189)
>  INFO  Add: 2,500,000 bsbm-5m.nt.gz (Batch: 202,101 / Avg: 182,508)
>  INFO  Add: 3,000,000 bsbm-5m.nt.gz (Batch: 206,953 / Avg: 186,173)
>  INFO  Add: 3,500,000 bsbm-5m.nt.gz (Batch: 183,621 / Avg: 185,804)
>  INFO  Add: 4,000,000 bsbm-5m.nt.gz (Batch: 151,423