I'm debugging now and think I've found the some possible culprits of the
slow data ingest stage on my setup.
In the ingest data stage, I see 100% disk use read, with only 2.5 MB/s,
and the ps command also shows my processor spending time waiting for IO.
In the
Excellent news!
Updated: https://www.w3.org/wiki/LargeTripleStores
Andy
On 28/12/2021 10:11, Marco Neumann wrote:
Ok here is another successful tdb2 load. this time with the full wikidata
download (20211222_latest-all.nt.gz 172G ) file.
counting 16,733,395,878 triples and a total of
Ok here is another successful tdb2 load. this time with the full wikidata
download (20211222_latest-all.nt.gz 172G ) file.
counting 16,733,395,878 triples and a total of "103h 45m 15s" for the
entire load.
I think with the right hardware this could easily be time compressed quite
a bit.
Thank you Andy. found it in revisions somewhere
just finished another run with truthy
http://lotico.com/temp/LOG-1214
will now increase RAM before running an additional load with increased
thread count.
Marco
On Tue, Dec 21, 2021 at 8:48 AM Andy Seaborne wrote:
> gists are git repos: so the
gists are git repos: so the file is there ... somewhere:
https://gist.githubusercontent.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3/raw/9049cf8b559ce685b4293fca10d8b1c07cc79c43/tdb2_xloader_wikidata_truthy.log
Andy
On 19/12/2021 17:56, Marco Neumann wrote:
Thank you Lorenz,
Thank you Lorenz,
unfortunately the tdb2_xloader_wikidata_truthy.log is now truncated in
github
On Sun, Dec 19, 2021 at 9:46 AM LB
wrote:
> I edited the Gist [1] and put the default stats there. Takes ~4min to
> compute the stats.
>
> Findings:
>
> - for Wikidata we have to extend those stats
On 19/12/2021 09:46, LB wrote:
I edited the Gist [1] and put the default stats there. Takes ~4min to
compute the stats.
Findings:
- for Wikidata we have to extend those stats with the stats for wdt:P31
property as Wikidata does use this property as their own rdf:type
relation. It is
I've updated:
https://www.w3.org/wiki/LargeTripleStores#Apache_Jena_TDB_.286.6B.29
for Lorenz's first run.
Andy
On 16/12/2021 08:49, LB wrote:
39h 53m 27s
04:14:28 INFO Triples loaded = 6.610.055.778
04:14:28 INFO Quads loaded = 0
04:14:28 INFO Overall Rate 46.028 tuples per
I edited the Gist [1] and put the default stats there. Takes ~4min to
compute the stats.
Findings:
- for Wikidata we have to extend those stats with the stats for wdt:P31
property as Wikidata does use this property as their own rdf:type
relation. It is indeed trivial, just execute
select
https://gist.github.com/afs/c97ebc7351478bce2989b79c9195ef11
Dell XPS13 (2021 edition)
32G RAM
4 core
1T SSD disk
Jena 4.3.1
Data:
wikidata-20211208-truthy-BETA.nt.gz
14:47:09 INFO Load node table = 39976 seconds
14:47:09 INFO Load ingest data = 17 seconds
14:47:09 INFO Build index SPO
Hi Lorenz,
On 18/12/2021 08:09, LB wrote:
Good morning,
loading of Wikidata truthy is done, this time I didn't forget to keep
logs:
https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
I'm a bit surprised that this time it was 8h faster than last time, 31h
vs 39h.
good morning Lorenz,
Maybe time to get a few query bencharms tests? :)
What does tdb2.tdbstats report?
Marco
On Sat, Dec 18, 2021 at 8:09 AM LB
wrote:
> Good morning,
>
> loading of Wikidata truthy is done, this time I didn't forget to keep
> logs:
>
Good morning,
loading of Wikidata truthy is done, this time I didn't forget to keep
logs:
https://gist.github.com/LorenzBuehmann/e3619d53cf4c158c4e4902fd7d6ed7c3
I'm a bit surprised that this time it was 8h faster than last time, 31h
vs 39h. Not sure if a) there was something else on the
On 16/12/2021 10:08, Marco Neumann wrote:
thank you Lorenz, I am running this test myself now again with a larger
disk. You may want to consider running a full load of wikidata as well. The
timing info and disk space you have should be sufficient.
Full Wikidata (WD).
I've tried to gather a
On 16/12/2021 10:52, Andy Seaborne wrote:
...
I am getting a slow down during data ingestion. However, your summary
figures don't show that in the ingest phase. The whole logs may have the
signal in it but less pronounced.
My working assumption is now that it is random access to the node
On 16/12/2021 12:32, LB wrote:
I couldn't get access to the full log as the output was too verbose for
the screen and I forgot to pipe into a file ...
Yes - familiar ...
Maybe xloader should capture it's logging.
I can confirm the triples.tmp.gz size was something around 35-40G if I
I couldn't get access to the full log as the output was too verbose for
the screen and I forgot to pipe into a file ...
I can confirm the triples.tmp.gz size was something around 35-40G if I
remember correctly.
I rerun the load now to a) keep logs and b) see if increasing the number
of
Awesome!
I'm really pleased to hear the news.
That's better than I feared at this scale!
How big is triples.tmp.gz? 2* that size, and the database size is the
peak storage space used. My estimate is about 40G making 604G overall.
I'd appreciate having the whole log file. Could you email it
thank you Lorenz, I am running this test myself now again with a larger
disk. You may want to consider running a full load of wikidata as well. The
timing info and disk space you have should be sufficient.
Did we figure out a place to post the parser messages?
Marco
On Thu, Dec 16, 2021 at
Sure
wikidata-tdb/Data-0001:
total 524G
-rw-r--r-- 1 24 Dez 15 05:41 GOSP.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.idn
-rw-r--r-- 1 24 Dez 15 05:41 GPOS.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.idn
-rw-r--r-- 1
Thank you Lorenz, can you please post a directory list for Data-0001 with
file sizes.
On Thu, Dec 16, 2021 at 8:49 AM LB
wrote:
> Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
>
> Server:
>
> AMD Ryzen 9 5950X (16C/32T)
> 128 GB DDR4 ECC RAM
> 2 x 3.84 TB NVMe SSD
>
Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:
Server:
AMD Ryzen 9 5950X (16C/32T)
128 GB DDR4 ECC RAM
2 x 3.84 TB NVMe SSD
Environment:
- Ubuntu 20.04.3 LTS
- OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
- Jena 4.3.1
Command:
The more tests we have on different machines the better. :)
Personally I'd say if you have a choice go for a PCIe 4.0 NVMe SSDs and
stay away from SATA < III SSDs. Also for the tests SSD RAID isn't necessary.
These components have become extremely affordable in recent years and
really should be
On 14/12/2021 10:38, Øyvind Gjesdal wrote:
Hi Marco,
Very useful to compare with your log on the different runs. Still working
with configuration to see if I can get the ingest data stage to be usable
for hdd. It looks like I get close to the performance of your run on the
earlier stages,
Hi Marco,
Very useful to compare with your log on the different runs. Still working
with configuration to see if I can get the ingest data stage to be usable
for hdd. It looks like I get close to the performance of your run on the
earlier stages, while ingest data is still very much too slow.
4.3.1 will contain the fixed log4j 2.15.0. No special mitigations necessary.
Jena uses log4j2 via the slf4j adapter from Apache Logging
(log4j-slf4j-impl). 2.15.0 should be compatible in Jena usage with
2.14.* for Jena 4.x.
From the download, replace log4j-(api|core|log4j-slf4j-impl) with
Does 4.3.1 already contain the mitigation for the Log4j2 vulnerability?
On Sun, Dec 12, 2021 at 1:24 PM Marco Neumann
wrote:
> As Andy mentioned, I will give the 4.3.1 xloader a try with the new 4TB
> SSD drive and an old laptop.
>
> I also have a contact who has just set up a new datacenter in
As Andy mentioned, I will give the 4.3.1 xloader a try with the new 4TB SSD
drive and an old laptop.
I also have a contact who has just set up a new datacenter in Ireland. I
may be able to run a few tests on much bigger machines as well. Otherwise I
am very happy with the iron in Finland.as long
On 11/12/2021 22:02, Marco Neumann wrote:
Thank you Øyvind for sharing, great to see more tests in the wild.
I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
dataset and quickly ran out of disk space. It finished the job but did not
write any of the indexes to disk due to
Hi, Øyvind,
This is all very helpful feedback. Thank you.
On 11/12/2021 21:45, Øyvind Gjesdal wrote:
I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
dump downloaded 2021-12-09.
This is the 4.3.0 xloader?
There are improvements in 4.3.1. Since that release was
Øyvind, looks like the above was the wrong log from a prior sharding
experiment.
This is the correct log file for the truthy dataset.
http://www.lotico.com/temp/LOG-98085
On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann
wrote:
> Thank you Øyvind for sharing, great to see more tests in the
Thank you Øyvind for sharing, great to see more tests in the wild.
I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
dataset and quickly ran out of disk space. It finished the job but did not
write any of the indexes to disk due to lack of space. no error messages.
I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
dump downloaded 2021-12-09.
The instance is a vm created on the Norwegian Research and Education Cloud,
an openstack cloud provider.
Instance type:
32 GB memory
4 CPU
The storage used for dump + temp files is mounted
33 matches
Mail list logo