Re: Testing tdb2.xloader

Andy Seaborne Thu, 16 Dec 2021 02:52:28 -0800

Awesome!
I'm really pleased to hear the news.

That's better than I feared at this scale!

How big is triples.tmp.gz? 2* that size, and the database size is thepeak storage space used. My estimate is about 40G making 604G overall.


I'd appreciate having the whole log file. Could you email it to me?

Currently, I'm trying the 2021-12-08 truthy (from gz, not bz2) on amodern portable with 4 cores and a single notional 1TB SSD. If theestimate is right, it will fit. More good news.

I am getting a slow down during data ingestion. However, your summaryfigures don't show that in the ingest phase. The whole logs may have thesignal in it but less pronounced.

My working assumption is now that it is random access to the node table.Your results point to it not being a CPU issue but that my setup issaturating the I/O path. While the portable has a NVMe SSD, it hasprobably not got the same I/O bandwidth as a server class machine.

I'm not sure what to do about this other than run with a much biggernode table cache for the ingestion phase. Substituting some file mapperfile area for bigger cache should be a win. While I hadn't noticedbefore, it is probably visible in logs of smaller loads on closerinspection. Experimenting on a small dataset is a lot easier.

I'm also watching the CPU temperature. When the graphics aren't active,the fans aren't even on. After a few minutes of active screen the fansspin up but the temperatures are still well within the limits. Themachine is raised up by 1cm to give good airflow. And I keep the doorshut to keep the cats away.


    Andy

Inline ...

On 16/12/2021 08:49, LB wrote:

Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:

Server:

AMD Ryzen 9 5950X  (16C/32T)
128 GB DDR4 ECC RAM
2 x 3.84 TB NVMe SSD


Nice.

Environment:

- Ubuntu 20.04.3 LTS
- OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
- Jena 4.3.1


Command:
tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --locdatasets/wikidata-tdb datasets/latest-truthy.nt.bz2

I found .gz to be slightly faster than .bz2. This maybe because .gz isbetter supported by the Java runtime or just the fact bz2 is designedfor best compression.



Log summary:

04:14:28 INFO  Load node table  = 36600 seconds
04:14:28 INFO  Load ingest data = 25811 seconds
04:14:28 INFO  Build index SPO  = 20688 seconds
04:14:28 INFO  Build index POS  = 35466 seconds
04:14:28 INFO  Build index OSP  = 25042 seconds
04:14:28 INFO  Overall          143607 seconds
04:14:28 INFO  Overall          39h 53m 27s


Less than 2 days :-)

04:14:28 INFO  Triples loaded   = 6.610.055.778
04:14:28 INFO  Quads loaded     = 0
04:14:28 INFO  Overall Rate     46.028 tuples per second



Disk space usage according to

du -sh datasets/wikidata-tdb

is

524G    datasets/wikidata-tdb

During loading I could see ~90GB of RAM occupied (50% of total memorygot to sort and it used 2 threads - is it intended to stick to 2 threadswith --parallel 2?)


It is fixed at two for the sort currently.

There may be some benefit in making this configurable but previouslyI've found that more threads does not seem to yield much benefit thoughyou have a lot more threads! Experiment required.

Cheers,
Lorenz


On 12.12.21 13:07, Andy Seaborne wrote:
Hi, Øyvind,

This is all very helpful feedback. Thank you.

On 11/12/2021 21:45, Øyvind Gjesdal wrote:
I'm trying out tdb2.xloader on an openstack vm, loading the wikidatatruthy
dump downloaded 2021-12-09.
This is the 4.3.0 xloader?
There are improvements in 4.3.1. Since that release was going out,including using less temporary space, the development version gotmerged in. It has had some testing.
It compresses the triples.tmp and intermediate sort files in the indexstage making the peak usage much smaller.
The instance is a vm created on the Norwegian Research and EducationCloud,
an openstack cloud provider.

Instance type:
32 GB memory
4 CPU
I using similar on a 7 year old desktop machine, SATA disk.
I haven't got a machine I can dedicate to the multi-day load. I'll tryto find a way to at least push it though building the node table.
Loading the first 1B of truthy:

1B triples , 40kTPS , 06h 54m 10s
The database is 81G and building needs an addition 11.6G for workspacefor a total of 92G (+ the data file).
While smaller, its seems bz2 files are much slower to decompress soI've been using gz files.
My current best guess for 6.4B truthy is

Temp        96G
Database   540G
Data        48G
Total:     684G  -- peak disk needed
based on scaling up 1B truthy. Personally, I would make sure there wasmore space. Also - I don't know if the shape of the data issufficiently uniform to make scaling predictable. The time doesn'tscale so simply.
This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
Compression reduces the size of triples.tmp -- the related sorttemporary files which add up to the same again -- 1/6 of the size.
The storage used for dump + temp files  is mounted as a separate  900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as
  *mass-storage-default*: Storage backed by spinning hard drives,
available to everybody and is the default type.
with ext4 configured. At the moment I don't have access to the faster
volume type mass-storage-ssd. CPU and memory are not dedicated, andcan be
overcommitted.
"overcommitted" may be a problem.
While it's not "tdb2 loader parallel" it does use a continuous CPU inseveral threads.
For memory - "it's complicated".
The java parts only need say 2G. The sort is set to "buffer 50%--parallel=2" and the java pipes into sort, that's another thread. Ithink the effective peak is 3 active threads and they'll all be at100% for some of the time.
So it's going to need 50% of RAM + 2G for a java proces, +OS.
It does not need space for memory mapped files (they aren't used atall in the loading process and I/O is sequential.
If that triggers over commitment swap out, the performance may go downa lot.
For disk - if that is physically remote, it should not a problem(famous last words). I/O is sequential and in large continuous chunks- typical for batch processing jobs.
OS for the instance is a clean Rocky Linux image, with no servicesexceptjena/fuseki installed. The systemd service
 set up for fuseki is stopped.
jena and fuseki version is 4.3.0.

openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
Just FYI: Java17 is a little faster. Some java improvements haveimproved RDF parsing speed by up to 10%. in xloader that notsignificant to the overall time.
I'm running from a tmux session to avoid connectivity issues and tocapturethe output.
I use

tdb2.xloader .... |& tee LOG-FILE-NAME
to capture the logs and see them. ">&" and "tail -f" would achievemuch the same effect
I think the output is stored in memory and not on disk.
On First run I tried to have the tmpdir on the root partition, toseparatetemp dir and data dir, but with only 19 GB free, the tmpdir soon wasdisk
full. For the second (current run) all directories are under
/var/fuseki/databases.
Yes - after making that mistake myself, the new version ignores systemTMPDIR. Using --tmpdir is best but otherwise it defaults to the datadirectory.
$JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy--tmpdir
/var/fuseki/databases/tmp latest-truthy.nt.gz
The import is so far at the "ingest data" stage where it has reallyslowed
down.
FYI: The first line of ingest is always very slow. It is not measuringthe start point correctly.
Current output is:

20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
Avg: 7,593)

See full log so far at
https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
The earlier first pass also slows down and that should be fairlyconstant-ish speed step once everything settles down.
Some notes:

* There is a (time/info) lapse in the output log between the end of
'parse' and the start of 'index' for Terms.  It is unclear to me what is
happening in the 1h13 minutes between the lines.
There is "sort" going on. "top" should show it.
For each index there is also a very long pause for exactly the samereason. It would be good to have some something go "tick" and log amessage occasionally.
22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds[2021/12/10
22:33:46 CET]
22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
6,560,468,631 triples/quads 129,331 TPS
23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
Avg: 237,755)
* The ingest data step really slows down on the "ingest data stage":At thecurrent rate, if I calculated correctly, it looks likePKG.CmdxIngestData
has 10 days left before it finishes.
Ouch.
* When I saw sort running in the background for the first parts ofthe job,
I looked at the `sort` command. I noticed from some online sources that
setting the environment variable LC_ALL=C improves speed for `sort`.Could
this be set on the ProcessBuilder for the `sort` process? Could it
break/change something? I see the warning from the man page for `sort`.

        *** WARNING *** The locale specified by the environment affects
        sort order.  Set LC_ALL=C to get the traditional sort order that
        uses native byte values.
It shouldn't matter but, yes, better to set it and export it in thecontrol script and propagate to forked processes.
The sort is doing a binary sort except because it a text sort program,the binary is turned into hex (!!). hex is in the ASCII subset andshoule be locale safe.
But better to set LC_ALL=C.

    Andy
Links:
https://access.redhat.com/solutions/445233
https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ramhttps://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
Best regards,
Øyvind

Re: Testing tdb2.xloader

Reply via email to