Sure

wikidata-tdb/Data-0001:
total 524G
-rw-r--r-- 1   24 Dez 15 05:41 GOSP.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 GOSP.idn
-rw-r--r-- 1   24 Dez 15 05:41 GPOS.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 GPOS.idn
-rw-r--r-- 1   24 Dez 15 05:41 GPU.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 GPU.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 GPU.idn
-rw-r--r-- 1   24 Dez 15 05:41 GSPO.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 GSPO.idn
-rw-r--r-- 1    0 Dez 15 05:41 journal.jrnl
-rw-r--r-- 1   24 Dez 15 05:41 nodes.bpt
-rw-r--r-- 1  36G Dez 15 05:41 nodes.dat
-rw-r--r-- 1   16 Dez 15 05:41 nodes-data.bdf
-rw-r--r-- 1  44G Dez 15 05:41 nodes-data.obj
-rw-r--r-- 1 312M Dez 15 05:41 nodes.idn
-rw-r--r-- 1   24 Dez 15 05:41 OSP.bpt
-rw-r--r-- 1 148G Dez 16 04:14 OSP.dat
-rw-r--r-- 1   24 Dez 15 05:41 OSPG.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 OSPG.idn
-rw-r--r-- 1 528M Dez 16 04:14 OSP.idn
-rw-r--r-- 1   24 Dez 15 05:41 POS.bpt
-rw-r--r-- 1 148G Dez 15 21:17 POS.dat
-rw-r--r-- 1   24 Dez 15 05:41 POSG.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 POSG.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 POSG.idn
-rw-r--r-- 1 528M Dez 15 21:17 POS.idn
-rw-r--r-- 1   24 Dez 15 05:41 prefixes.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.dat
-rw-r--r-- 1   16 Dez 15 05:41 prefixes-data.bdf
-rw-r--r-- 1    0 Dez 14 12:21 prefixes-data.obj
-rw-r--r-- 1 8,0M Dez 14 12:21 prefixes.idn
-rw-r--r-- 1   24 Dez 15 05:41 SPO.bpt
-rw-r--r-- 1 148G Dez 15 11:25 SPO.dat
-rw-r--r-- 1   24 Dez 15 05:41 SPOG.bpt
-rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.dat
-rw-r--r-- 1 8,0M Dez 14 12:21 SPOG.idn
-rw-r--r-- 1 528M Dez 15 11:25 SPO.idn
-rw-r--r-- 1    8 Dez 15 21:17 tdb.lock

On 16.12.21 10:27, Marco Neumann wrote:
Thank you Lorenz, can you please post a directory list for Data-0001 with
file sizes.


On Thu, Dec 16, 2021 at 8:49 AM LB <conpcompl...@googlemail.com.invalid>
wrote:

Loading of latest WD truthy dump (6.6 billion triples) Bzip2 compressed:

Server:

AMD Ryzen 9 5950X  (16C/32T)
128 GB DDR4 ECC RAM
2 x 3.84 TB NVMe SSD


Environment:

- Ubuntu 20.04.3 LTS
- OpenJDK Runtime Environment (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
- Jena 4.3.1


Command:

tools/apache-jena-4.3.1/bin/tdb2.xloader --tmpdir /data/tmp/tdb --loc
datasets/wikidata-tdb datasets/latest-truthy.nt.bz2

Log summary:

04:14:28 INFO  Load node table  = 36600 seconds
04:14:28 INFO  Load ingest data = 25811 seconds
04:14:28 INFO  Build index SPO  = 20688 seconds
04:14:28 INFO  Build index POS  = 35466 seconds
04:14:28 INFO  Build index OSP  = 25042 seconds
04:14:28 INFO  Overall          143607 seconds
04:14:28 INFO  Overall          39h 53m 27s
04:14:28 INFO  Triples loaded   = 6.610.055.778
04:14:28 INFO  Quads loaded     = 0
04:14:28 INFO  Overall Rate     46.028 tuples per second

Disk space usage according to

du -sh datasets/wikidata-tdb
   is

524G    datasets/wikidata-tdb
During loading I could see ~90GB of RAM occupied (50% of total memory
got to sort and it used 2 threads - is it intended to stick to 2 threads
with --parallel 2?)


Cheers,
Lorenz


On 12.12.21 13:07, Andy Seaborne wrote:
Hi, Øyvind,

This is all very helpful feedback. Thank you.

On 11/12/2021 21:45, Øyvind Gjesdal wrote:
I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
truthy
dump downloaded 2021-12-09.
This is the 4.3.0 xloader?

There are improvements in 4.3.1. Since that release was going out,
including using less temporary space, the development version got
merged in. It has had some testing.

It compresses the triples.tmp and intermediate sort files in the index
stage making the peak usage much smaller.

The instance is a vm created on the Norwegian Research and Education
Cloud,
an openstack cloud provider.

Instance type:
32 GB memory
4 CPU
I using similar on a 7 year old desktop machine, SATA disk.

I haven't got a machine I can dedicate to the multi-day load. I'll try
to find a way to at least push it though building the node table.

Loading the first 1B of truthy:

1B triples , 40kTPS , 06h 54m 10s

The database is 81G and building needs an addition 11.6G for workspace
for a total of 92G (+ the data file).

While smaller, its seems bz2 files are much slower to decompress so
I've been using gz files.

My current best guess for 6.4B truthy is

Temp        96G
Database   540G
Data        48G
Total:     684G  -- peak disk needed

based on scaling up 1B truthy. Personally, I would make sure there was
more space. Also - I don't know if the shape of the data is
sufficiently uniform to make scaling predictable.  The time doesn't
scale so simply.

This is the 4.3.1 version - the 4.3.0 uses a lot more disk space.

Compression reduces the size of triples.tmp -- the related sort
temporary files which add up to the same again -- 1/6 of the size.

The storage used for dump + temp files  is mounted as a separate  900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as
   *mass-storage-default*: Storage backed by spinning hard drives,
available to everybody and is the default type.
with ext4 configured. At the moment I don't have access to the faster
volume type mass-storage-ssd. CPU and memory are not dedicated, and
can be
overcommitted.
"overcommitted" may be a problem.

While it's not "tdb2 loader parallel" it does use a continuous CPU in
several threads.

For memory - "it's complicated".

The java parts only need say 2G. The sort is set to "buffer 50%
--parallel=2" and the java pipes into sort, that's another thread. I
think the effective peak is 3 active threads and they'll all be at
100% for some of the time.

So it's going to need 50% of RAM + 2G for a java proces, +OS.

It does not need space for memory mapped files (they aren't used at
all in the loading process and I/O is sequential.

If that triggers over commitment swap out, the performance may go down
a lot.

For disk - if that is physically remote, it should not a problem
(famous last words). I/O is sequential and in large continuous chunks
- typical for batch processing jobs.

OS for the instance is a clean Rocky Linux image, with no services
except
jena/fuseki installed. The systemd service
  set up for fuseki is stopped.
jena and fuseki version is 4.3.0.

openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)
Just FYI: Java17 is a little faster. Some java improvements have
improved RDF parsing speed by up to 10%. in xloader that not
significant to the overall time.

I'm running from a tmux session to avoid connectivity issues and to
capture
the output.
I use

tdb2.xloader .... |& tee LOG-FILE-NAME

to capture the logs and see them. ">&" and "tail -f" would achieve
much the same effect

I think the output is stored in memory and not on disk.
On First run I tried to have the tmpdir on the root partition, to
separate
temp dir and data dir, but with only 19 GB free, the tmpdir soon was
disk
full. For the second (current run) all directories are under
/var/fuseki/databases.
Yes - after making that mistake myself, the new version ignores system
TMPDIR.  Using --tmpdir is best but otherwise it defaults to the data
directory.

   $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
--tmpdir
/var/fuseki/databases/tmp latest-truthy.nt.gz

The import is so far at the "ingest data" stage where it has really
slowed
down.
FYI: The first line of ingest is always very slow. It is not measuring
the start point correctly.

Current output is:

20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
Avg: 7,593)

See full log so far at
https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab
The earlier first pass also slows down and that should be fairly
constant-ish speed step once everything settles down.

Some notes:

* There is a (time/info) lapse in the output log between the end of
'parse' and the start of 'index' for Terms.  It is unclear to me what is
happening in the 1h13 minutes between the lines.
There is "sort" going on. "top" should show it.

For each index there is also a very long pause for exactly the same
reason.  It would be good to have some something go "tick" and log a
message occasionally.

22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
[2021/12/10
22:33:46 CET]
22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
6,560,468,631 triples/quads 129,331 TPS
23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
Avg: 237,755)

* The ingest data step really slows down on the "ingest data stage":
At the
current rate, if I calculated correctly, it looks like
PKG.CmdxIngestData
has 10 days left before it finishes.
Ouch.

* When I saw sort running in the background for the first parts of
the job,
I looked at the `sort` command. I noticed from some online sources that
setting the environment variable LC_ALL=C improves speed for `sort`.
Could
this be set on the ProcessBuilder for the `sort` process? Could it
break/change something? I see the warning from the man page for `sort`.

         *** WARNING *** The locale specified by the environment affects
         sort order.  Set LC_ALL=C to get the traditional sort order that
         uses native byte values.
It shouldn't matter but, yes, better to set it and export it in the
control script and propagate to forked processes.

The sort is doing a binary sort except because it a text sort program,
the binary is turned into hex (!!). hex is in the ASCII subset and
shoule be locale safe.

But better to set LC_ALL=C.

     Andy


Links:
https://access.redhat.com/solutions/445233

https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram

https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort

Best regards,
Øyvind


Reply via email to