[ 
https://issues.apache.org/jira/browse/JENA-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507489#comment-17507489
 ] 

Rob Vesse commented on JENA-2314:
---------------------------------

So some HPC specific thoughts that you might want to experiment with.

The IO in some of the TDB loaders ends up being pretty random so much of your 
slowdown may be down to IO patterns.  [~andy] knows more about that aspect and 
I suspect he'll drop a comment shortly with some general suggestions around 
that because there are different loaders that you could try using.

On the HPC specific side I assume your IO is going to some parallel filesystem?

One technique that has worked well in the past for avoiding some IO bottlenecks 
is to do the IO via a loopback mount i.e. create one large file on the parallel 
file system which you then mount into the container as a loopback mount.  This 
tends to have a bigger impact when processes are manipulating lots of small 
files, which TDB loaders don't tend to do, so may have minimal impact in your 
case.

Another consideration is how the input and output directories (and the files 
therein) are striped on the parallel file system.  I would suggest looking at 
how the existing input and output is striped (stripe count and size) and 
whether there's capacity on your filesystem to increase that.  This is probably 
more important on the output side where IO is going to be more random as it 
builds the indexes.  A larger stripe count should spread the random IO more 
evenly over the filesystem potentially mitigating some of the IO slowdown you 
appear to be encountering.

Bear in mind that changing the stripe count and size for a file/directly 
typically only effects newly created files unless you migrate the files, 
whether implicitly by copying them to a directory with the desired stripe 
settings or by using an explicit migrate command to have the parallel file 
system migrate the data in-place.

> tdb2.tdbloader performace issue
> -------------------------------
>
>                 Key: JENA-2314
>                 URL: https://issues.apache.org/jira/browse/JENA-2314
>             Project: Apache Jena
>          Issue Type: Question
>          Components: TDB2
>    Affects Versions: Jena 4.0.0, Jena 4.2.0, Jena 4.4.0
>         Environment: Java maximum memory: 12884901888
> symbol:http://jena.apache.org/ARQ#regexImpl = 
> symbol:http://jena.apache.org/ARQ#javaRegex
> symbol:http://jena.apache.org/ARQ#registryFunctions = 
> org.apache.jena.sparql.function.FunctionRegistry@1536602f
> symbol:http://jena.apache.org/ARQ#constantBNodeLabels = true
> symbol:http://jena.apache.org/ARQ#registryPropertyFunctions = 
> org.apache.jena.sparql.pfunction.PropertyFunctionRegistry@4ebea12c
> symbol:http://jena.apache.org/ARQ#stageGenerator = 
> org.apache.jena.tdb2.solver.StageGeneratorDirectTDB@2a1edad4
> symbol:http://jena.apache.org/ARQ#enablePropertyFunctions = true
> symbol:http://jena.apache.org/ARQ#strictSPARQL = false
> 13:02:36 INFO  loader          :: Loader = LoaderParallel
> 13:02:36 INFO  loader          :: Start: 6 files
> 13:02:48 INFO  loader          :: Add: 500,000 bdmhistoricalrecords.nq 
> (Batch: 40,361 / Avg: 40,361)
> 13:03:00 INFO  loader          :: Add: 1,000,000 bdmhistoricalrecords.nq 
> (Batch: 44,907 / Avg: 42,513)
> 13:03:10 INFO  loader          :: Add: 1,500,000 bdmhistoricalrecords.nq 
> (Batch: 47,980 / Avg: 44,191)
> 13:03:25 INFO  loader          :: Add: 2,000,000 bdmhistoricalrecords.nq 
> (Batch: 32,486 / Avg: 40,539)
> 13:33:06 INFO  loader          :: Add: 2,500,000 bdmhistoricalrecords.nq 
> (Batch: 280 / Avg: 1,366)
> 14:30:30 INFO  loader          :: Add: 3,000,000 bdmhistoricalrecords.nq 
> (Batch: 145 / Avg: 568)
> 14:52:29 INFO  loader          :: Add: 3,500,000 bdmhistoricalrecords.nq 
> (Batch: 378 / Avg: 530)
>            Reporter: R Pope
>            Priority: Major
>
> Kia ora, Hi there,
> We have been using tdb2.tdbloader to load ~400,000,000 triples into our 
> triplestore - all the data is in nq format being previoiusly converted from 
> JSONLD. The files we are loading range from ~10GB to ~50GB producing a 
> triplestore ~180GB including a text index. We run the loader in an HPC 
> environment so we can request as much memory as we need, often using 1TB to 
> do the load. The job is run in a Singularity image (similar to docker) and 
> slurm is the chosen workload manager.
> All that aside, the load typically takes ~12-16hours but no more than 24 
> hours with --loader=parallel and an average rate of ~5,000 triples per 
> second. We haven't needed to run the loader since October 2021, so upon 
> recently running the load job again we are getting a grand average of about 
> ~500triples per second. Haven't been able to wait and see if it even finishes.
> Has anyone else experienced such a big performance loss with tdb2.tdbloader 
> in the current or recent versions of jena? Apart from the potential 
> investigation that can be done on the slurm/HPC side does anyone have advice 
> around performance?
> Thanks in advance



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to