Re: Testing tdb2.xloader

2022-01-11 Thread Øyvind Gjesdal
I'm debugging now and think I've found the some possible culprits of the slow data ingest stage on my setup. In the ingest data stage, I see 100% disk use read, with only 2.5 MB/s, and the ps command also shows my processor spending time waiting for IO. In the

Re: Testing tdb2.xloader

2021-12-28 Thread Andy Seaborne
Excellent news! Updated: https://www.w3.org/wiki/LargeTripleStores Andy On 28/12/2021 10:11, Marco Neumann wrote: Ok here is another successful tdb2 load. this time with the full wikidata download (20211222_latest-all.nt.gz 172G ) file. counting 16,733,395,878 triples and a total of

Re: Testing tdb2.xloader

2021-12-28 Thread Marco Neumann
Ok here is another successful tdb2 load. this time with the full wikidata download (20211222_latest-all.nt.gz 172G ) file. counting 16,733,395,878 triples and a total of "103h 45m 15s" for the entire load. I think with the right hardware this could easily be time compressed quite a bit.

Re: Testing tdb2.xloader

2021-12-21 Thread Marco Neumann
Thank you Andy. found it in revisions somewhere just finished another run with truthy http://lotico.com/temp/LOG-1214 will now increase RAM before running an additional load with increased thread count. Marco On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne wrote: > gists are git repos: so the

Re: Testing tdb2.xloader

2021-12-21 Thread Andy Seaborne
gists are git repos: so the file is there ... somewhere: https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log Andy On 19/12/2021 17:56, Marco Neumann wrote: Thank you Lorenz,

Re: Testing tdb2.xloader

2021-12-19 Thread Marco Neumann
Thank you Lorenz, unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in github On Sun, Dec 19, 2021 at 9:46 AM LB wrote: > I edited the Gist [1] and put the default stats there. Takes ~4min to > compute the stats. > > Findings: > > - for Wikidata we have to extend those stats

Re: Testing tdb2.xloader

2021-12-19 Thread Andy Seaborne
On 19/12/2021 09:46, LB wrote: I edited the Gist [1] and put the default stats there. Takes ~4min to compute the stats. Findings: - for Wikidata we have to extend those stats with the stats for wdt:P31 property as Wikidata does use this property as their own rdf:type relation. It is

Re: Testing tdb2.xloader

2021-12-19 Thread Andy Seaborne
I've updated: https://www.w3.org/wiki/LargeTripleStores#Apache_Jena_TDB_.286.6B.29 for Lorenz's first run. Andy On 16/12/2021 08:49, LB wrote: 39h 53m 27s 04:14:28 INFO  Triples loaded   = 6.610.055.778 04:14:28 INFO  Quads loaded = 0 04:14:28 INFO  Overall Rate 46.028 tuples per

Re: Testing tdb2.xloader

2021-12-19 Thread LB
I edited the Gist [1] and put the default stats there. Takes ~4min to compute the stats. Findings: - for Wikidata we have to extend those stats with the stats for wdt:P31 property as Wikidata does use this property as their own rdf:type relation. It is indeed trivial, just execute select

Re: Testing tdb2.xloader

2021-12-18 Thread Andy Seaborne
https://gist.github.com/afs/c97ebc7351478bce2989b79c9195ef11 Dell XPS13 (2021 edition) 32G RAM 4 core 1T SSD disk Jena 4.3.1 Data: wikidata-20211208-truthy-BETA.nt.gz 14:47:09 INFO Load node table = 39976 seconds 14:47:09 INFO Load ingest data = 17 seconds 14:47:09 INFO Build index SPO

Re: Testing tdb2.xloader

2021-12-18 Thread Andy Seaborne
Hi Lorenz, On 18/12/2021 08:09, LB wrote: Good morning, loading of Wikidata truthy is done, this time I didn't forget to keep logs: https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3 I'm a bit surprised that this time it was 8h faster than last time, 31h vs 39h.

Re: Testing tdb2.xloader

2021-12-18 Thread Marco Neumann
good morning Lorenz, Maybe time to get a few query bencharms tests? :) What does tdb2.tdbstats report? Marco On Sat, Dec 18, 2021 at 8:09 AM LB wrote: > Good morning, > > loading of Wikidata truthy is done, this time I didn't forget to keep > logs: >

Re: Testing tdb2.xloader

2021-12-18 Thread LB
Good morning, loading of Wikidata truthy is done, this time I didn't forget to keep logs: https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3 I'm a bit surprised that this time it was 8h faster than last time, 31h vs 39h. Not sure if a) there was something else on the

Re: Testing tdb2.xloader

2021-12-17 Thread Andy Seaborne
On 16/12/2021 10:08, Marco Neumann wrote: thank you Lorenz, I am running this test myself now again with a larger disk. You may want to consider running a full load of wikidata as well. The timing info and disk space you have should be sufficient. Full Wikidata (WD). I've tried to gather a

Re: Testing tdb2.xloader

2021-12-16 Thread Andy Seaborne
On 16/12/2021 10:52, Andy Seaborne wrote: ... I am getting a slow down during data ingestion. However, your summary figures don't show that in the ingest phase. The whole logs may have the signal in it but less pronounced. My working assumption is now that it is random access to the node

Re: Testing tdb2.xloader

2021-12-16 Thread Andy Seaborne
On 16/12/2021 12:32, LB wrote: I couldn't get access to the full log as the output was too verbose for the screen and I forgot to pipe into a file ... Yes - familiar ... Maybe xloader should capture it's logging. I can confirm the triples.tmp.gz size was something around 35-40G if I

Re: Testing tdb2.xloader

2021-12-16 Thread LB
I couldn't get access to the full log as the output was too verbose for the screen and I forgot to pipe into a file ... I can confirm the triples.tmp.gz size was something around 35-40G if I remember correctly. I rerun the load now to a) keep logs and b) see if increasing the number of

Re: Testing tdb2.xloader

2021-12-16 Thread Andy Seaborne
Awesome! I'm really pleased to hear the news. That's better than I feared at this scale! How big is triples.tmp.gz? 2* that size, and the database size is the peak storage space used. My estimate is about 40G making 604G overall. I'd appreciate having the whole log file. Could you email it

Re: Testing tdb2.xloader

2021-12-16 Thread Marco Neumann
thank you Lorenz, I am running this test myself now again with a larger disk. You may want to consider running a full load of wikidata as well. The timing info and disk space you have should be sufficient. Did we figure out a place to post the parser messages? Marco On Thu, Dec 16, 2021 at

Re: Testing tdb2.xloader

2021-12-16 Thread LB
Sure wikidata-tdb/Data-0001: total 524G -rw-r--r-- 1   24 Dez 15 05:41 GOSP.bpt -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.dat -rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.idn -rw-r--r-- 1   24 Dez 15 05:41 GPOS.bpt -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.dat -rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.idn -rw-r--r-- 1  

Re: Testing tdb2.xloader

2021-12-16 Thread Marco Neumann
Thank you Lorenz, can you please post a directory list for Data-0001 with file sizes. On Thu, Dec 16, 2021 at 8:49 AM LB wrote: > Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed: > > Server: > > AMD Ryzen 9 5950X (16C/32T) > 128 GB DDR4 ECC RAM > 2 x 3.84 TB NVMe SSD >

Re: Testing tdb2.xloader

2021-12-16 Thread LB
Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed: Server: AMD Ryzen 9 5950X  (16C/32T) 128 GB DDR4 ECC RAM 2 x 3.84 TB NVMe SSD Environment: - Ubuntu 20.04.3 LTS - OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04) - Jena 4.3.1 Command:

Re: Testing tdb2.xloader

2021-12-14 Thread Marco Neumann
The more tests we have on different machines the better. :) Personally I'd say if you have a choice go for a PCIe 4.0 NVMe SSDs and stay away from SATA < III SSDs. Also for the tests SSD RAID isn't necessary. These components have become extremely affordable in recent years and really should be

Re: Testing tdb2.xloader

2021-12-14 Thread Andy Seaborne
On 14/12/2021 10:38, Øyvind Gjesdal wrote: Hi Marco, Very useful to compare with your log on the different runs. Still working with configuration to see if I can get the ingest data stage to be usable for hdd. It looks like I get close to the performance of your run on the earlier stages,

Re: Testing tdb2.xloader

2021-12-14 Thread Øyvind Gjesdal
Hi Marco, Very useful to compare with your log on the different runs. Still working with configuration to see if I can get the ingest data stage to be usable for hdd. It looks like I get close to the performance of your run on the earlier stages, while ingest data is still very much too slow.

log4j implications [Was: Testing tdb2.xloader]

2021-12-12 Thread Andy Seaborne
4.3.1 will contain the fixed log4j 2.15.0. No special mitigations necessary. Jena uses log4j2 via the slf4j adapter from Apache Logging (log4j-slf4j-impl). 2.15.0 should be compatible in Jena usage with 2.14.* for Jena 4.x. From the download, replace log4j-(api|core|log4j-slf4j-impl) with

Re: Testing tdb2.xloader

2021-12-12 Thread Marco Neumann
Does 4.3.1 already contain the mitigation for the Log4j2 vulnerability? On Sun, Dec 12, 2021 at 1:24 PM Marco Neumann wrote: > As Andy mentioned, I will give the 4.3.1 xloader a try with the new 4TB > SSD drive and an old laptop. > > I also have a contact who has just set up a new datacenter in

Re: Testing tdb2.xloader

2021-12-12 Thread Marco Neumann
As Andy mentioned, I will give the 4.3.1 xloader a try with the new 4TB SSD drive and an old laptop. I also have a contact who has just set up a new datacenter in Ireland. I may be able to run a few tests on much bigger machines as well. Otherwise I am very happy with the iron in Finland.as long

Re: Testing tdb2.xloader

2021-12-12 Thread Andy Seaborne
On 11/12/2021 22:02, Marco Neumann wrote: Thank you Øyvind for sharing, great to see more tests in the wild. I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy dataset and quickly ran out of disk space. It finished the job but did not write any of the indexes to disk due to

Re: Testing tdb2.xloader

2021-12-12 Thread Andy Seaborne
Hi, Øyvind, This is all very helpful feedback. Thank you. On 11/12/2021 21:45, Øyvind Gjesdal wrote: I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy dump downloaded 2021-12-09. This is the 4.3.0 xloader? There are improvements in 4.3.1. Since that release was

Re: Testing tdb2.xloader

2021-12-12 Thread Marco Neumann
Øyvind, looks like the above was the wrong log from a prior sharding experiment. This is the correct log file for the truthy dataset. http://www.lotico.com/temp/LOG-98085 On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann wrote: > Thank you Øyvind for sharing, great to see more tests in the

Re: Testing tdb2.xloader

2021-12-11 Thread Marco Neumann
Thank you Øyvind for sharing, great to see more tests in the wild. I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy dataset and quickly ran out of disk space. It finished the job but did not write any of the indexes to disk due to lack of space. no error messages.

Testing tdb2.xloader

2021-12-11 Thread Øyvind Gjesdal
I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy dump downloaded 2021-12-09. The instance is a vm created on the Norwegian Research and Education Cloud, an openstack cloud provider. Instance type: 32 GB memory 4 CPU The storage used for dump + temp files is mounted