[
https://issues.apache.org/jira/browse/JENA-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528493#comment-17528493
]
R Pope commented on JENA-2314:
------------------------------
Kia ora anō, hello again,
After scrapping Singularity and its peculiarities, we have gone back to the
system engineers for our HPC provider for help. Our local HPC infrastructure
NeSI uses a concurrent file system GPFS which doesn't work very well at all
with memory mapping, therefore significantly hampering the performance of the
loader regardless of which one we used (parallel, default etc).
The suggested fix for running in slurm:
cd ${TMPDIR:?}
mkdir tmp
export TMPDIR=$TMPDIR/tmp
export JVM_ARGS="-Djava.io.tmpdir=$TMPDIR -Xmx4G -XshowSettings:vm -Xss512M"
ln -s /your/path/to/{n-quads, triplestore} ./
tdb2.tdbloader -v --loader=parallel {n-quads}
$TMPDIR is [NeSI's temporary files location
|https://support.nesi.org.nz/hc/en-gb/articles/207765367-Java] so effectively
we have bypassed the GPFS by using a sym-link and passing a local version of
/tmp to Java, as opposed to a parallelised one. Well that's my (very limited)
understanding of it. Probably nothing actionable for Jena? However something to
consider with parallel file systems such as HPC environments. Avg speed up from
500 TPS to about 120,000 TPS.
Thanks a lot for the help :)
> tdb2.tdbloader performance issue
> --------------------------------
>
> Key: JENA-2314
> URL: https://issues.apache.org/jira/browse/JENA-2314
> Project: Apache Jena
> Issue Type: Question
> Components: TDB2
> Affects Versions: Jena 4.0.0, Jena 4.2.0, Jena 4.4.0
> Environment: Java maximum memory: 12884901888
> symbol:http://jena.apache.org/ARQ#regexImpl =
> symbol:http://jena.apache.org/ARQ#javaRegex
> symbol:http://jena.apache.org/ARQ#registryFunctions =
> org.apache.jena.sparql.function.FunctionRegistry@1536602f
> symbol:http://jena.apache.org/ARQ#constantBNodeLabels = true
> symbol:http://jena.apache.org/ARQ#registryPropertyFunctions =
> org.apache.jena.sparql.pfunction.PropertyFunctionRegistry@4ebea12c
> symbol:http://jena.apache.org/ARQ#stageGenerator =
> org.apache.jena.tdb2.solver.StageGeneratorDirectTDB@2a1edad4
> symbol:http://jena.apache.org/ARQ#enablePropertyFunctions = true
> symbol:http://jena.apache.org/ARQ#strictSPARQL = false
> 13:02:36 INFO loader :: Loader = LoaderParallel
> 13:02:36 INFO loader :: Start: 6 files
> 13:02:48 INFO loader :: Add: 500,000 bdmhistoricalrecords.nq
> (Batch: 40,361 / Avg: 40,361)
> 13:03:00 INFO loader :: Add: 1,000,000 bdmhistoricalrecords.nq
> (Batch: 44,907 / Avg: 42,513)
> 13:03:10 INFO loader :: Add: 1,500,000 bdmhistoricalrecords.nq
> (Batch: 47,980 / Avg: 44,191)
> 13:03:25 INFO loader :: Add: 2,000,000 bdmhistoricalrecords.nq
> (Batch: 32,486 / Avg: 40,539)
> 13:33:06 INFO loader :: Add: 2,500,000 bdmhistoricalrecords.nq
> (Batch: 280 / Avg: 1,366)
> 14:30:30 INFO loader :: Add: 3,000,000 bdmhistoricalrecords.nq
> (Batch: 145 / Avg: 568)
> 14:52:29 INFO loader :: Add: 3,500,000 bdmhistoricalrecords.nq
> (Batch: 378 / Avg: 530)
> Reporter: R Pope
> Priority: Major
>
> Kia ora, Hi there,
> We have been using tdb2.tdbloader to load ~400,000,000 triples into our
> triplestore - all the data is in nq format being previoiusly converted from
> JSONLD. The files we are loading range from ~10GB to ~50GB producing a
> triplestore ~180GB including a text index. We run the loader in an HPC
> environment so we can request as much memory as we need, often using 1TB to
> do the load. The job is run in a Singularity image (similar to docker) and
> slurm is the chosen workload manager.
> All that aside, the load typically takes ~12-16hours but no more than 24
> hours with --loader=parallel and an average rate of ~5,000 triples per
> second. We haven't needed to run the loader since October 2021, so upon
> recently running the load job again we are getting a grand average of about
> ~500triples per second. Haven't been able to wait and see if it even finishes.
> Has anyone else experienced such a big performance loss with tdb2.tdbloader
> in the current or recent versions of jena? Apart from the potential
> investigation that can be done on the slurm/HPC side does anyone have advice
> around performance?
> Thanks in advance
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]