On 14/12/2021 10:38, Øyvind Gjesdal wrote:
Hi Marco,

Very useful to compare with your log on the different runs. Still working
with configuration to see if I can get the ingest data stage to be usable
for hdd. It looks like I get close to the performance of your run on the
earlier stages, while ingest data is still very much too slow. Having to
use SSD may be necessary, for a real world large import to complete?  I'lI
request some ssd storage as well, and hope there's a quota for me :)

The access patterns should (tm!) be spinning-disk friendly. There is no random IO updating B+trees directly.

All the B+Trees are written "bottom-up" by specially writing blocks of the right layout to disk, not via the B+Tree runtime code which would be top down via "add record" style access.

    Andy

Maybe I could also test different distros, to see if some of the default OS
settings affect the import.

Best regards,
Øyvind

søn. 12. des. 2021 kl. 10:21 skrev Marco Neumann <marco.neum...@gmail.com>:

Øyvind, looks like the above was the wrong log from a prior sharding
experiment.

This is the correct log file for the truthy dataset.

http://www.lotico.com/temp/LOG-98085



On Sat, Dec 11, 2021 at 10:02 PM Marco Neumann <marco.neum...@gmail.com>
wrote:

Thank you Øyvind for sharing, great to see more tests in the wild.

I did the test with a 1TB SSD / RAID1 / 64GB / ubuntu and the truthy
dataset and quickly ran out of disk space. It finished the job but did
not
write any of the indexes to disk due to lack of space. no error messages.

http://www.lotico.com/temp/LOG-95239

I have now ordered a new 4TB SSD drive to rerun the test possibly with
the
full wikidata dataset,

I personally had the best experience with dedicated hardware so far (can
be in the data center), shared or dedicated virtual compute engines did
not
deliver as expected. And I have not seen great benefits from data center
grade multicore cpus. But I think they will during runtime in multi user
settings (eg fuseki).

Best,
Marco

On Sat, Dec 11, 2021 at 9:45 PM Øyvind Gjesdal <oyvin...@gmail.com>
wrote:

I'm trying out tdb2.xloader on an openstack vm, loading the wikidata
truthy
dump downloaded 2021-12-09.

The instance is a vm created on the Norwegian Research and Education
Cloud,
an openstack cloud provider.

Instance type:
32 GB memory
4 CPU

The storage used for dump + temp files  is mounted as a separate  900GB
volume and is mounted on /var/fuseki/databases
.The type of storage is described as
  *mass-storage-default*: Storage backed by spinning hard drives,
available to everybody and is the default type.
with ext4 configured. At the moment I don't have access to the faster
volume type mass-storage-ssd. CPU and memory are not dedicated, and can
be
overcommitted.

OS for the instance is a clean Rocky Linux image, with no services
except
jena/fuseki installed. The systemd service  set up for fuseki is
stopped.
jena and fuseki version is 4.3.0.

openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.13+8-LTS, mixed mode, sharing)

I'm running from a tmux session to avoid connectivity issues and to
capture
the output. I think the output is stored in memory and not on disk.
On First run I tried to have the tmpdir on the root partition, to
separate
temp dir and data dir, but with only 19 GB free, the tmpdir soon was
disk
full. For the second (current run) all directories are under
/var/fuseki/databases.

  $JENA_HOME/bin/tdb2.xloader --loc /var/fuseki/databases/wd-truthy
--tmpdir
/var/fuseki/databases/tmp latest-truthy.nt.gz

The import is so far at the "ingest data" stage where it has really
slowed
down.

Current output is:

20:03:43 INFO  Data            :: Add: 502,000,000 Data (Batch: 3,356 /
Avg: 7,593)

See full log so far at
https://gist.github.com/OyvindLGjesdal/c1f61c0f7d3ab5808144d9455cd383ab

Some notes:

* There is a (time/info) lapse in the output log between the  end of
'parse' and the start of 'index' for Terms.  It is unclear to me what is
happening in the 1h13 minutes between the lines.

22:33:46 INFO  Terms           ::   Elapsed: 50,720.20 seconds
[2021/12/10
22:33:46 CET]
22:33:52 INFO  Terms           :: == Parse: 50726.071 seconds :
6,560,468,631 triples/quads 129,331 TPS
23:46:13 INFO  Terms           :: Add: 1,000,000 Index (Batch: 237,755 /
Avg: 237,755)

* The ingest data step really slows down on the "ingest data stage": At
the
current rate, if I calculated correctly, it looks like
PKG.CmdxIngestData
has 10 days left before it finishes.

* When I saw sort running in the background for the first parts of the
job,
I looked at the `sort` command. I noticed from some online sources that
setting the environment variable LC_ALL=C improves speed for `sort`.
Could
this be set on the ProcessBuilder for the `sort` process? Could it
break/change something? I see the warning from the man page for `sort`.

        *** WARNING *** The locale specified by the environment affects
        sort order.  Set LC_ALL=C to get the traditional sort order that
        uses native byte values.

Links:
https://access.redhat.com/solutions/445233


https://unix.stackexchange.com/questions/579251/how-to-use-parallel-to-speed-up-sort-for-big-files-fitting-in-ram


https://stackoverflow.com/questions/7074430/how-do-we-sort-faster-using-unix-sort

Best regards,
Øyvind



--


---
Marco Neumann
KONA



--


---
Marco Neumann
KONA


Reply via email to