[jira] [Commented] (JENA-2314) tdb2.tdbloader performace issue

Andy Seaborne (Jira) Wed, 16 Mar 2022 03:29:06 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17507472#comment-17507472
 ]


Andy Seaborne commented on JENA-2314:
-------------------------------------

5000 triples/s is very slow, let alone 500 TPS.

Jena 4.0.0 was released 2021-03 so presumably the current situation arises with 
the same code as used October 2021.

There haven't any significant changes to the loaded except the new 
"tdb2.xloader", normally for larger datasets than 400M.

Have your tried other loaders, the default 'phased' (default) and 'sequential'? 
Each loader has different I/O patterns and "parallel" does a lot of random I/O 
to memory mapped files. (I don't know HPC environments nor singularity.) 
"parallel" does best with local SSDs, and less well with spinning disk.

If the data has a high proportion of long literals, loading is slower. "long" 
being 1000's of bytes.
{quote}Java maximum memory: 12884901888
{quote}
12G – the loaders don't use much java heap. Setting this to say 4G should be 
enough. If the container is limited to about this size of RAM, then memory 
mapped I/O performance will be poor.

> tdb2.tdbloader performace issue
> -------------------------------
>
>                 Key: JENA-2314
>                 URL: https://issues.apache.org/jira/browse/JENA-2314
>             Project: Apache Jena
>          Issue Type: Question
>          Components: TDB2
>    Affects Versions: Jena 4.0.0, Jena 4.2.0, Jena 4.4.0
>         Environment: Java maximum memory: 12884901888
> symbol:http://jena.apache.org/ARQ#regexImpl = 
> symbol:http://jena.apache.org/ARQ#javaRegex
> symbol:http://jena.apache.org/ARQ#registryFunctions = 
> org.apache.jena.sparql.function.FunctionRegistry@1536602f
> symbol:http://jena.apache.org/ARQ#constantBNodeLabels = true
> symbol:http://jena.apache.org/ARQ#registryPropertyFunctions = 
> org.apache.jena.sparql.pfunction.PropertyFunctionRegistry@4ebea12c
> symbol:http://jena.apache.org/ARQ#stageGenerator = 
> org.apache.jena.tdb2.solver.StageGeneratorDirectTDB@2a1edad4
> symbol:http://jena.apache.org/ARQ#enablePropertyFunctions = true
> symbol:http://jena.apache.org/ARQ#strictSPARQL = false
> 13:02:36 INFO  loader          :: Loader = LoaderParallel
> 13:02:36 INFO  loader          :: Start: 6 files
> 13:02:48 INFO  loader          :: Add: 500,000 bdmhistoricalrecords.nq 
> (Batch: 40,361 / Avg: 40,361)
> 13:03:00 INFO  loader          :: Add: 1,000,000 bdmhistoricalrecords.nq 
> (Batch: 44,907 / Avg: 42,513)
> 13:03:10 INFO  loader          :: Add: 1,500,000 bdmhistoricalrecords.nq 
> (Batch: 47,980 / Avg: 44,191)
> 13:03:25 INFO  loader          :: Add: 2,000,000 bdmhistoricalrecords.nq 
> (Batch: 32,486 / Avg: 40,539)
> 13:33:06 INFO  loader          :: Add: 2,500,000 bdmhistoricalrecords.nq 
> (Batch: 280 / Avg: 1,366)
> 14:30:30 INFO  loader          :: Add: 3,000,000 bdmhistoricalrecords.nq 
> (Batch: 145 / Avg: 568)
> 14:52:29 INFO  loader          :: Add: 3,500,000 bdmhistoricalrecords.nq 
> (Batch: 378 / Avg: 530)
>            Reporter: R Pope
>            Priority: Major
>
> Kia ora, Hi there,
> We have been using tdb2.tdbloader to load ~400,000,000 triples into our 
> triplestore - all the data is in nq format being previoiusly converted from 
> JSONLD. The files we are loading range from ~10GB to ~50GB producing a 
> triplestore ~180GB including a text index. We run the loader in an HPC 
> environment so we can request as much memory as we need, often using 1TB to 
> do the load. The job is run in a Singularity image (similar to docker) and 
> slurm is the chosen workload manager.
> All that aside, the load typically takes ~12-16hours but no more than 24 
> hours with --loader=parallel and an average rate of ~5,000 triples per 
> second. We haven't needed to run the loader since October 2021, so upon 
> recently running the load job again we are getting a grand average of about 
> ~500triples per second. Haven't been able to wait and see if it even finishes.
> Has anyone else experienced such a big performance loss with tdb2.tdbloader 
> in the current or recent versions of jena? Apart from the potential 
> investigation that can be done on the slurm/HPC side does anyone have advice 
> around performance?
> Thanks in advance



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (JENA-2314) tdb2.tdbloader performace issue

Reply via email to