Hi, Øyvind,
This is all very helpful feedback. Thank you.
On 11/12/2021 21:45, Øyvind Gjesdal wrote:
I'm trying out tdb2.xloader on an openstack vm, loading the
wikidata truthy
dump downloaded 2021-12-09.
This is the 4.3.0 xloader?
There are improvements in 4.3.1. Since that release was going out,
including using less temporary space, the development version got
merged in. It has had some testing.
It compresses the triples.tmp and intermediate sort files in the
index stage making the peak usage much smaller.
The instance is a vm created on the Norwegian Research and
Education Cloud,
an openstack cloud provider.
Instance type:
32 GB memory
4 CPU
I using similar on a 7 year old desktop machine, SATA disk.
I haven't got a machine I can dedicate to the multi-day load. I'll
try to find a way to at least push it though building the node table.
Loading the first 1B of truthy:
1B triples , 40kTPS , 06h 54m 10s
The database is 81G and building needs an addition 11.6G for
workspace for a total of 92G (+ the data file).
While smaller, its seems bz2 files are much slower to decompress so
I've been using gz files.
My current best guess for 6.4B truthy is
Temp 96G
Database 540G
Data 48G
Total: 684G -- peak disk needed
based on scaling up 1B truthy. Personally, I would make sure there
was more space. Also - I don't know if the shape of the data is
sufficiently uniform to make scaling predictable. The time doesn't
scale so simply.
This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.
Compression reduces the size of triples.tmp -- the related sort
temporary files which add up to the same again -- 1/6 of the size.
The storage used for dump + temp files is mounted as a separate
900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as
*mass-storage-default*: Storage backed by spinning hard drives,
available to everybody and is the default type.
with ext4 configured. At the moment I don't have access to the faster
volume type mass-storage-ssd. CPU and memory are not dedicated, and
can be
overcommitted.
"overcommitted" may be a problem.
While it's not "tdb2 loader parallel" it does use a continuous CPU
in several threads.
For memory - "it's complicated".
The java parts only need say 2G. The sort is set to "buffer 50%
--parallel=2" and the java pipes into sort, that's another thread. I
think the effective peak is 3 active threads and they'll all be at
100% for some of the time.
So it's going to need 50% of RAM + 2G for a java proces, +OS.
It does not need space for memory mapped files (they aren't used at
all in the loading process and I/O is sequential.
If that triggers over commitment swap out, the performance may go
down a lot.
For disk - if that is physically remote, it should not a problem
(famous last words). I/O is sequential and in large continuous
chunks - typical for batch processing jobs.
OS for the instance is a clean Rocky Linux image, with no services
except
jena/fuseki installed. The systemd service
set up for fuseki is stopped.
jena and fuseki version is 4.3.0.
openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode,
sharing)
Just FYI: Java17 is a little faster. Some java improvements have
improved RDF parsing speed by up to 10%. in xloader that not
significant to the overall time.
I'm running from a tmux session to avoid connectivity issues and to
capture
the output.
I use
tdb2.xloader .... |& tee LOG-FILE-NAME
to capture the logs and see them. ">&" and "tail -f" would achieve
much the same effect
I think the output is stored in memory and not on disk.
On First run I tried to have the tmpdir on the root partition, to
separate
temp dir and data dir, but with only 19 GB free, the tmpdir soon
was disk
full. For the second (current run) all directories are under
/var/fuseki/databases.
Yes - after making that mistake myself, the new version ignores
system TMPDIR. Using --tmpdir is best but otherwise it defaults to
the data directory.
$JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
--tmpdir
/var/fuseki/databases/tmp latest-truthy.nt.gz
The import is so far at the "ingest data" stage where it has really
slowed
down.
FYI: The first line of ingest is always very slow. It is not
measuring the start point correctly.
Current output is:
20:03:43 INFO Data :: Add: 502,000,000 Data (Batch:
3,356 /
Avg: 7,593)
See full log so far at
https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
The earlier first pass also slows down and that should be fairly
constant-ish speed step once everything settles down.
Some notes:
* There is a (time/info) lapse in the output log between the end of
'parse' and the start of 'index' for Terms. It is unclear to me
what is
happening in the 1h13 minutes between the lines.
There is "sort" going on. "top" should show it.
For each index there is also a very long pause for exactly the same
reason. It would be good to have some something go "tick" and log a
message occasionally.
22:33:46 INFO Terms :: Elapsed: 50,720.20 seconds
[2021/12/10
22:33:46 CET]
22:33:52 INFO Terms :: == Parse: 50726.071 seconds :
6,560,468,631 triples/quads 129,331 TPS
23:46:13 INFO Terms :: Add: 1,000,000 Index (Batch:
237,755 /
Avg: 237,755)
* The ingest data step really slows down on the "ingest data
stage": At the
current rate, if I calculated correctly, it looks like
PKG.CmdxIngestData
has 10 days left before it finishes.
Ouch.
* When I saw sort running in the background for the first parts of
the job,
I looked at the `sort` command. I noticed from some online sources
that
setting the environment variable LC_ALL=C improves speed for
`sort`. Could
this be set on the ProcessBuilder for the `sort` process? Could it
break/change something? I see the warning from the man page for
`sort`.
*** WARNING *** The locale specified by the environment
affects
sort order. Set LC_ALL=C to get the traditional sort order
that
uses native byte values.
It shouldn't matter but, yes, better to set it and export it in the
control script and propagate to forked processes.
The sort is doing a binary sort except because it a text sort
program, the binary is turned into hex (!!). hex is in the ASCII
subset and shoule be locale safe.
But better to set LC_ALL=C.
Andy
Links:
https://access.redhat.com/solutions/445233
https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram
https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort
Best regards,
Øyvind