Re: Testing tdb2.xloader

LB Thu, 16 Dec 2021 04:33:02 -0800

I couldn't get access to the full log as the output was too verbose forthe screen and I forgot to pipe into a file ...

I can confirm the triples.tmp.gz size was something around 35-40G if Iremember correctly.

I rerun the load now to a) keep logs and b) see if increasing the numberof threads for parallel sort does change anything (though I don't think so).


In ~40h I will provide an update and share the logs.

Wikidata full is 143GB BZip2 compressed, I'm wondering how large TDBwould be on disk in that case. We have ~4 x the size of the truthy dump,both compressed, I'd guess we get something between 1.8T and 2.3T -clearly, the number of nodes shouldn't increase linearly as manystatements are still about nodes from truthy dump. On the other hand,Wikidata has this statement about statement stuff, i.e. we have lots ofstatement identifiers.

Once I've time and enough resources on the server, I'll give it a try.Now I'm interested to see the final result



On 16.12.21 11:52, Andy Seaborne wrote:

Awesome!
I'm really pleased to hear the news.

That's better than I feared at this scale!
How big is triples.tmp.gz? 2* that size, and the database size is thepeak storage space used. My estimate is about 40G making 604G overall.
I'd appreciate having the whole log file. Could you email it to me?
Currently, I'm trying the 2021-12-08 truthy (from gz, not bz2) on amodern portable with 4 cores and a single notional 1TB SSD. If theestimate is right, it will fit. More good news.
I am getting a slow down during data ingestion. However, your summaryfigures don't show that in the ingest phase. The whole logs may havethe signal in it but less pronounced.
My working assumption is now that it is random access to the nodetable. Your results point to it not being a CPU issue but that mysetup is saturating the I/O path. While the portable has a NVMe SSD,it has probably not got the same I/O bandwidth as a server class machine.
I'm not sure what to do about this other than run with a much biggernode table cache for the ingestion phase. Substituting some filemapper file area for bigger cache should be a win. While I hadn'tnoticed before, it is probably visible in logs of smaller loads oncloser inspection. Experimenting on a small dataset is a lot easier.
I'm also watching the CPU temperature. When the graphics aren'tactive, the fans aren't even on. After a few minutes of active screenthe fans spin up but the temperatures are still well within thelimits. The machine is raised up by 1cm to give good airflow. And Ikeep the door shut to keep the cats away.
    Andy

Inline ...

On 16/12/2021 08:49, LB wrote:
Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:

Server:

AMD Ryzen 9 5950X  (16C/32T)
128 GB DDR4 ECC RAM
2 x 3.84 TB NVMe SSD
Nice.
Environment:

- Ubuntu 20.04.3 LTS
- OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
- Jena 4.3.1


Command:
tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb--loc datasets/wikidata-tdb datasets/latest-truthy.nt.bz2
I found .gz to be slightly faster than .bz2. This maybe because .gz isbetter supported by the Java runtime or just the fact bz2 is designedfor best compression.
Log summary:
04:14:28 INFO  Load node table  = 36600 seconds
04:14:28 INFO  Load ingest data = 25811 seconds
04:14:28 INFO  Build index SPO  = 20688 seconds
04:14:28 INFO  Build index POS  = 35466 seconds
04:14:28 INFO  Build index OSP  = 25042 seconds
04:14:28 INFO  Overall          143607 seconds
04:14:28 INFO  Overall          39h 53m 27s
Less than 2 days :-)
04:14:28 INFO  Triples loaded   = 6.610.055.778
04:14:28 INFO  Quads loaded     = 0
04:14:28 INFO  Overall Rate     46.028 tuples per second
Disk space usage according to
du -sh datasets/wikidata-tdb
  is
524G    datasets/wikidata-tdb
During loading I could see ~90GB of RAM occupied (50% of total memorygot to sort and it used 2 threads - is it intended to stick to 2threads with --parallel 2?)
It is fixed at two for the sort currently.
There may be some benefit in making this configurable but previouslyI've found that more threads does not seem to yield much benefitthough you have a lot more threads! Experiment required.
Cheers,
Lorenz


On 12.12.21 13:07, Andy Seaborne wrote:
Hi, Øyvind,

This is all very helpful feedback. Thank you.

On 11/12/2021 21:45, Øyvind Gjesdal wrote:
I'm trying out tdb2.xloader on an openstack vm, loading thewikidata truthy
dump downloaded 2021-12-09.
This is the 4.3.0 xloader?
There are improvements in 4.3.1. Since that release was going out,including using less temporary space, the development version gotmerged in. It has had some testing.
It compresses the triples.tmp and intermediate sort files in theindex stage making the peak usage much smaller.
The instance is a vm created on the Norwegian Research andEducation Cloud,
an openstack cloud provider.

Instance type:
32 GB memory
4 CPU
I using similar on a 7 year old desktop machine, SATA disk.
I haven't got a machine I can dedicate to the multi-day load. I'lltry to find a way to at least push it though building the node table.
Loading the first 1B of truthy:

1B triples , 40kTPS , 06h 54m 10s
The database is 81G and building needs an addition 11.6G forworkspace for a total of 92G (+ the data file).
While smaller, its seems bz2 files are much slower to decompress soI've been using gz files.
My current best guess for 6.4B truthy is

Temp        96G
Database   540G
Data        48G
Total:     684G  -- peak disk needed
based on scaling up 1B truthy. Personally, I would make sure therewas more space. Also - I don't know if the shape of the data issufficiently uniform to make scaling predictable. The time doesn'tscale so simply.
This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
Compression reduces the size of triples.tmp -- the related sorttemporary files which add up to the same again -- 1/6 of the size.
The storage used for dump + temp files is mounted as a separate 900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as
  *mass-storage-default*: Storage backed by spinning hard drives,
available to everybody and is the default type.
with ext4 configured. At the moment I don't have access to the faster
volume type mass-storage-ssd. CPU and memory are not dedicated, andcan be
overcommitted.
"overcommitted" may be a problem.
While it's not "tdb2 loader parallel" it does use a continuous CPUin several threads.
For memory - "it's complicated".
The java parts only need say 2G. The sort is set to "buffer 50%--parallel=2" and the java pipes into sort, that's another thread. Ithink the effective peak is 3 active threads and they'll all be at100% for some of the time.
So it's going to need 50% of RAM + 2G for a java proces, +OS.
It does not need space for memory mapped files (they aren't used atall in the loading process and I/O is sequential.
If that triggers over commitment swap out, the performance may godown a lot.
For disk - if that is physically remote, it should not a problem(famous last words). I/O is sequential and in large continuouschunks - typical for batch processing jobs.
OS for the instance is a clean Rocky Linux image, with no servicesexceptjena/fuseki installed. The systemd service
 set up for fuseki is stopped.
jena and fuseki version is 4.3.0.

openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,sharing)
Just FYI: Java17 is a little faster. Some java improvements haveimproved RDF parsing speed by up to 10%. in xloader that notsignificant to the overall time.
I'm running from a tmux session to avoid connectivity issues and tocapturethe output.
I use

tdb2.xloader .... |& tee LOG-FILE-NAME
to capture the logs and see them. ">&" and "tail -f" would achievemuch the same effect
I think the output is stored in memory and not on disk.
On First run I tried to have the tmpdir on the root partition, toseparatetemp dir and data dir, but with only 19 GB free, the tmpdir soonwas disk
full. For the second (current run) all directories are under
/var/fuseki/databases.
Yes - after making that mistake myself, the new version ignoressystem TMPDIR. Using --tmpdir is best but otherwise it defaults tothe data directory.
$JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy--tmpdir
/var/fuseki/databases/tmp latest-truthy.nt.gz
The import is so far at the "ingest data" stage where it has reallyslowed
down.
FYI: The first line of ingest is always very slow. It is notmeasuring the start point correctly.
Current output is:
20:03:43 INFO Data :: Add: 502,000,000 Data (Batch:3,356 /
Avg: 7,593)

See full log so far at
https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
The earlier first pass also slows down and that should be fairlyconstant-ish speed step once everything settles down.
Some notes:

* There is a (time/info) lapse in the output log between the end of
'parse' and the start of 'index' for Terms. It is unclear to mewhat is
happening in the 1h13 minutes between the lines.
There is "sort" going on. "top" should show it.
For each index there is also a very long pause for exactly the samereason. It would be good to have some something go "tick" and log amessage occasionally.
22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds[2021/12/10
22:33:46 CET]
22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
6,560,468,631 triples/quads 129,331 TPS
23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch:237,755 /
Avg: 237,755)
* The ingest data step really slows down on the "ingest datastage": At thecurrent rate, if I calculated correctly, it looks likePKG.CmdxIngestData
has 10 days left before it finishes.
Ouch.
* When I saw sort running in the background for the first parts ofthe job,I looked at the `sort` command. I noticed from some online sourcesthatsetting the environment variable LC_ALL=C improves speed for`sort`. Could
this be set on the ProcessBuilder for the `sort` process? Could it
break/change something? I see the warning from the man page for`sort`.
*** WARNING *** The locale specified by the environmentaffects sort order. Set LC_ALL=C to get the traditional sort orderthat
        uses native byte values.
It shouldn't matter but, yes, better to set it and export it in thecontrol script and propagate to forked processes.
The sort is doing a binary sort except because it a text sortprogram, the binary is turned into hex (!!). hex is in the ASCIIsubset and shoule be locale safe.
But better to set LC_ALL=C.

    Andy
Links:
https://access.redhat.com/solutions/445233
https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ramhttps://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
Best regards,
Øyvind

Re: Testing tdb2.xloader

Reply via email to