[jira] [Commented] (JENA-2314) tdb2.tdbloader performance issue

R Pope (Jira) Wed, 23 Mar 2022 13:29:07 -0700


    [ 
https://issues.apache.org/jira/browse/JENA-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511476#comment-17511476
 ]


R Pope commented on JENA-2314:
------------------------------

Thank you Andy and Rob for your responses. Unfortunately, no luck.

My inkling here is that as Rob mentioned, there's a huge I/O bottleneck 
happening with the way that Singularity (a container built from our Docker 
image) is accessing the parallel file system (while running a job) in our HPC 
environment. Which as Rob will know, the compute nodes (the machines you get 
given to run big workloads) are often located elsewhere from the login node 
(where you request work from). We are going the Singularity route since their 
system engineers suggest that it will be easier as opposed to using a native 
Java and getting jena & fuseki in the appropriate places to be used in a 
compute node.

Our project has relied on jena and particularly HPC for the last 2 years.. 
unfortunately however it will be coming to an end soon so there isn't real 
capacity to investigate an appropriate setup. 

Using our local machines to do the load will do, which is much faster. The 
queries, updates and indexing sections of our pipeline will continue to run in 
our HPC environment. There is definitely an appetite to solve this problem, 
however government funding for big projects in New Zealand being the way they 
are, it may be a while until we are able to fix it. 

In conclusion, as Andy had alluded to, the problem does not lie with Jena, but 
the way it is used by Singularity, inside all these environments where there is 
little visibility, for now.

Thanks team :)

> tdb2.tdbloader performance issue
> --------------------------------
>
>                 Key: JENA-2314
>                 URL: https://issues.apache.org/jira/browse/JENA-2314
>             Project: Apache Jena
>          Issue Type: Question
>          Components: TDB2
>    Affects Versions: Jena 4.0.0, Jena 4.2.0, Jena 4.4.0
>         Environment: Java maximum memory: 12884901888
> symbol:http://jena.apache.org/ARQ#regexImpl = 
> symbol:http://jena.apache.org/ARQ#javaRegex
> symbol:http://jena.apache.org/ARQ#registryFunctions = 
> org.apache.jena.sparql.function.FunctionRegistry@1536602f
> symbol:http://jena.apache.org/ARQ#constantBNodeLabels = true
> symbol:http://jena.apache.org/ARQ#registryPropertyFunctions = 
> org.apache.jena.sparql.pfunction.PropertyFunctionRegistry@4ebea12c
> symbol:http://jena.apache.org/ARQ#stageGenerator = 
> org.apache.jena.tdb2.solver.StageGeneratorDirectTDB@2a1edad4
> symbol:http://jena.apache.org/ARQ#enablePropertyFunctions = true
> symbol:http://jena.apache.org/ARQ#strictSPARQL = false
> 13:02:36 INFO  loader          :: Loader = LoaderParallel
> 13:02:36 INFO  loader          :: Start: 6 files
> 13:02:48 INFO  loader          :: Add: 500,000 bdmhistoricalrecords.nq 
> (Batch: 40,361 / Avg: 40,361)
> 13:03:00 INFO  loader          :: Add: 1,000,000 bdmhistoricalrecords.nq 
> (Batch: 44,907 / Avg: 42,513)
> 13:03:10 INFO  loader          :: Add: 1,500,000 bdmhistoricalrecords.nq 
> (Batch: 47,980 / Avg: 44,191)
> 13:03:25 INFO  loader          :: Add: 2,000,000 bdmhistoricalrecords.nq 
> (Batch: 32,486 / Avg: 40,539)
> 13:33:06 INFO  loader          :: Add: 2,500,000 bdmhistoricalrecords.nq 
> (Batch: 280 / Avg: 1,366)
> 14:30:30 INFO  loader          :: Add: 3,000,000 bdmhistoricalrecords.nq 
> (Batch: 145 / Avg: 568)
> 14:52:29 INFO  loader          :: Add: 3,500,000 bdmhistoricalrecords.nq 
> (Batch: 378 / Avg: 530)
>            Reporter: R Pope
>            Priority: Major
>
> Kia ora, Hi there,
> We have been using tdb2.tdbloader to load ~400,000,000 triples into our 
> triplestore - all the data is in nq format being previoiusly converted from 
> JSONLD. The files we are loading range from ~10GB to ~50GB producing a 
> triplestore ~180GB including a text index. We run the loader in an HPC 
> environment so we can request as much memory as we need, often using 1TB to 
> do the load. The job is run in a Singularity image (similar to docker) and 
> slurm is the chosen workload manager.
> All that aside, the load typically takes ~12-16hours but no more than 24 
> hours with --loader=parallel and an average rate of ~5,000 triples per 
> second. We haven't needed to run the loader since October 2021, so upon 
> recently running the load job again we are getting a grand average of about 
> ~500triples per second. Haven't been able to wait and see if it even finishes.
> Has anyone else experienced such a big performance loss with tdb2.tdbloader 
> in the current or recent versions of jena? Apart from the potential 
> investigation that can be done on the slurm/HPC side does anyone have advice 
> around performance?
> Thanks in advance



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (JENA-2314) tdb2.tdbloader performance issue

Reply via email to