I couldn't get access to the full log as the output was too verbose for the screen and I forgot to pipe into a file ...

I can confirm the triples.tmp.gz size was something around 35-40G if I remember correctly.

I rerun the load now to a) keep logs and b) see if increasing the number of threads for parallel sort does change anything (though I don't think so).

In ~40h I will provide an update and share the logs.


Wikidata full is 143GB BZip2 compressed, I'm wondering how large TDB would be on disk in that case. We have ~4 x the size of the truthy dump, both compressed, I'd guess we get something between 1.8T and 2.3T - clearly, the number of nodes shouldn't increase linearly as many statements are still about nodes from truthy dump. On the other hand, Wikidata has this statement about statement stuff, i.e. we have lots of statement identifiers.

Once I've time and enough resources on the server, I'll give it a try. Now I'm interested to see the final result


On 16.12.21 11:52, Andy Seaborne wrote:
Awesome!
I'm really pleased to hear the news.

That's better than I feared at this scale!

How big is triples.tmp.gz? 2* that size, and the database size is the peak storage space used. My estimate is about 40G making 604G overall.

I'd appreciate having the whole log file. Could you email it to me?

Currently, I'm trying the 2021-12-08 truthy (from gz, not bz2) on a modern portable with 4 cores and a single notional 1TB SSD. If the estimate is right, it will fit. More good news.


I am getting a slow down during data ingestion. However, your summary figures don't show that in the ingest phase. The whole logs may have the signal in it but less pronounced.

My working assumption is now that it is random access to the node table. Your results point to it not being a CPU issue but that my setup is saturating the I/O path. While the portable has a NVMe SSD, it has probably not got the same I/O bandwidth as a server class machine.

I'm not sure what to do about this other than run with a much bigger node table cache for the ingestion phase. Substituting some file mapper file area for bigger cache should be a win. While I hadn't noticed before, it is probably visible in logs of smaller loads on closer inspection. Experimenting on a small dataset is a lot easier.


I'm also watching the CPU temperature. When the graphics aren't active, the fans aren't even on. After a few minutes of active screen the fans spin up but the temperatures are still well within the limits. The machine is raised up by 1cm to give good airflow. And I keep the door shut to keep the cats away.

    Andy

Inline ...

On 16/12/2021 08:49, LB wrote:
Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:

Server:

AMD Ryzen 9 5950X  (16C/32T)
128 GB DDR4 ECC RAM
2 x 3.84 TB NVMe SSD

Nice.

Environment:

- Ubuntu 20.04.3 LTS
- OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
- Jena 4.3.1


Command:

tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc datasets/wikidata-tdb datasets/latest-truthy.nt.bz2

I found .gz to be slightly faster than .bz2. This maybe because .gz is better supported by the Java runtime or just the fact bz2 is designed for best compression.



Log summary:

04:14:28 INFO  Load node table  = 36600 seconds
04:14:28 INFO  Load ingest data = 25811 seconds
04:14:28 INFO  Build index SPO  = 20688 seconds
04:14:28 INFO  Build index POS  = 35466 seconds
04:14:28 INFO  Build index OSP  = 25042 seconds
04:14:28 INFO  Overall          143607 seconds
04:14:28 INFO  Overall          39h 53m 27s

Less than 2 days :-)

04:14:28 INFO  Triples loaded   = 6.610.055.778
04:14:28 INFO  Quads loaded     = 0
04:14:28 INFO  Overall Rate     46.028 tuples per second


Disk space usage according to

du -sh datasets/wikidata-tdb

  is

524G    datasets/wikidata-tdb



During loading I could see ~90GB of RAM occupied (50% of total memory got to sort and it used 2 threads - is it intended to stick to 2 threads with --parallel 2?)

It is fixed at two for the sort currently.

There may be some benefit in making this configurable but previously I've found that more threads does not seem to yield much benefit though you have a lot more threads! Experiment required.



Cheers,
Lorenz


On 12.12.21 13:07, Andy Seaborne wrote:
Hi, Øyvind,

This is all very helpful feedback. Thank you.

On 11/12/2021 21:45, Øyvind Gjesdal wrote:
I'm trying out tdb2.xloader on an openstack vm, loading the wikidata truthy
dump downloaded 2021-12-09.

This is the 4.3.0 xloader?

There are improvements in 4.3.1. Since that release was going out, including using less temporary space, the development version got merged in. It has had some testing.

It compresses the triples.tmp and intermediate sort files in the index stage making the peak usage much smaller.

The instance is a vm created on the Norwegian Research and Education Cloud,
an openstack cloud provider.

Instance type:
32 GB memory
4 CPU

I using similar on a 7 year old desktop machine, SATA disk.

I haven't got a machine I can dedicate to the multi-day load. I'll try to find a way to at least push it though building the node table.

Loading the first 1B of truthy:

1B triples , 40kTPS , 06h 54m 10s

The database is 81G and building needs an addition 11.6G for workspace for a total of 92G (+ the data file).

While smaller, its seems bz2 files are much slower to decompress so I've been using gz files.

My current best guess for 6.4B truthy is

Temp        96G
Database   540G
Data        48G
Total:     684G  -- peak disk needed

based on scaling up 1B truthy. Personally, I would make sure there was more space. Also - I don't know if the shape of the data is sufficiently uniform to make scaling predictable.  The time doesn't scale so simply.

This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.

Compression reduces the size of triples.tmp -- the related sort temporary files which add up to the same again -- 1/6 of the size.

The storage used for dump + temp files  is mounted as a separate  900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as
  *mass-storage-default*: Storage backed by spinning hard drives,
available to everybody and is the default type.
with ext4 configured. At the moment I don't have access to the faster
volume type mass-storage-ssd. CPU and memory are not dedicated, and can be
overcommitted.

"overcommitted" may be a problem.

While it's not "tdb2 loader parallel" it does use a continuous CPU in several threads.

For memory - "it's complicated".

The java parts only need say 2G. The sort is set to "buffer 50% --parallel=2" and the java pipes into sort, that's another thread. I think the effective peak is 3 active threads and they'll all be at 100% for some of the time.

So it's going to need 50% of RAM + 2G for a java proces, +OS.

It does not need space for memory mapped files (they aren't used at all in the loading process and I/O is sequential.

If that triggers over commitment swap out, the performance may go down a lot.

For disk - if that is physically remote, it should not a problem (famous last words). I/O is sequential and in large continuous chunks - typical for batch processing jobs.

OS for the instance is a clean Rocky Linux image, with no services except jena/fuseki installed. The systemd service

 set up for fuseki is stopped.
jena and fuseki version is 4.3.0.

openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)

Just FYI: Java17 is a little faster. Some java improvements have improved RDF parsing speed by up to 10%. in xloader that not significant to the overall time.

I'm running from a tmux session to avoid connectivity issues and to capture the output.

I use

tdb2.xloader .... |& tee LOG-FILE-NAME

to capture the logs and see them. ">&" and "tail -f" would achieve much the same effect

I think the output is stored in memory and not on disk.
On First run I tried to have the tmpdir on the root partition, to separate temp dir and data dir, but with only 19 GB free, the tmpdir soon was disk
full. For the second (current run) all directories are under
/var/fuseki/databases.

Yes - after making that mistake myself, the new version ignores system TMPDIR.  Using --tmpdir is best but otherwise it defaults to the data directory.


  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy --tmpdir
/var/fuseki/databases/tmp latest-truthy.nt.gz

The import is so far at the "ingest data" stage where it has really slowed
down.

FYI: The first line of ingest is always very slow. It is not measuring the start point correctly.


Current output is:

20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
Avg: 7,593)

See full log so far at
https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab

The earlier first pass also slows down and that should be fairly constant-ish speed step once everything settles down.

Some notes:

* There is a (time/info) lapse in the output log between the end of
'parse' and the start of 'index' for Terms.  It is unclear to me what is
happening in the 1h13 minutes between the lines.

There is "sort" going on. "top" should show it.

For each index there is also a very long pause for exactly the same reason.  It would be good to have some something go "tick" and log a message occasionally.


22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds [2021/12/10
22:33:46 CET]
22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
6,560,468,631 triples/quads 129,331 TPS
23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
Avg: 237,755)

* The ingest data step really slows down on the "ingest data stage": At the current rate, if I calculated correctly, it looks like PKG.CmdxIngestData
has 10 days left before it finishes.

Ouch.

* When I saw sort running in the background for the first parts of the job, I looked at the `sort` command. I noticed from some online sources that setting the environment variable LC_ALL=C improves speed for `sort`. Could
this be set on the ProcessBuilder for the `sort` process? Could it
break/change something? I see the warning from the man page for `sort`.

        *** WARNING *** The locale specified by the environment affects         sort order.  Set LC_ALL=C to get the traditional sort order that
        uses native byte values.

It shouldn't matter but, yes, better to set it and export it in the control script and propagate to forked processes.

The sort is doing a binary sort except because it a text sort program, the binary is turned into hex (!!). hex is in the ASCII subset and shoule be locale safe.

But better to set LC_ALL=C.

    Andy



Links:
https://access.redhat.com/solutions/445233
https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort

Best regards,
Øyvind

Reply via email to