Re: SPAM-LOW: Slow clear graph

Mikael Pesonen Tue, 19 Nov 2024 01:37:25 -0800

I guess best way to go now for our case is to use separate instances ofJena and use federated queries? Downside is that more RAM is needed.


On 18/11/2024 16:06, Andy Seaborne wrote:

CLEAR and DROP are done inside a transaction. The SPARQL Updateimplementation is skewed to many, small operations.
Looking at TDB2, that isn't just updating G??? indexes To make uniondefault graph work, there are also indexes ???G. It's writeamplification.
There is a tradeoff - efficient union default graph vs easy of deletion.
A different design could have each graph separately, optimizing forwhole graph operations.
A way round this, in the future, might be to employ server reload(WIP). Prepare a new dataset offline and do a zero-downtime reload ofthe configuration.
    Andy

On 14/11/2024 11:47, Mikael Pesonen wrote:
Thanks for the explanation! Yes it's mainly happening on disk, CPUusage is at 0.5%. Are there any strategies to get around this?Somehow compartmentalize the data so that deleting one compartmentwould be more efficient?
On 14/11/2024 13:30, Rob @ DNR wrote:
Details matter here e.g. what storage layer is in use? How big isthe graph being deleted? How many other graphs (and triples) are inthe server as a whole? You say a curl request so can we assumeFuseki? Are there other secondary indices involved e.g. Jena Text?
---
Most Jena storage, i.e. TDB/TDB2, is quad-oriented behind the scenesso when you issue a CLEAR GRAPH <uri> (or a DROP GRAPH <uri>) whathappens internally is that it must scan each index and delete allquads with the relevant <uri> in the graph position of the quad. For indexes where graph is later in the order e.g. SPOG these quadscould be scattered across the entire index affecting many blocks ondisk meaning the whole index needs to be read.
For TDB2 which uses copy on write data structures this might alsoend up effectively having to rewrite every single block in the indexwhich for large datasets could take an exceedingly long time.
If you have secondary indices involved, e.g. Jena Text, then it isalso potentially having to make the relevant delete requests tothose indices as well.
---
So, my guess would be that you have a lot of disk IO happening onyour server if you happened to look at its resource consumptionwhile the CLEAR GRAPH is ongoing?
Rob


From: Mikael Pesonen <[email protected]>
Date: Thursday, 14 November 2024 at 09:21
To: [email protected] <[email protected]>
Subject: SPAM-LOW: Slow clear graph
Curl command is running now over 24 hours with Jena, what could cause
that? Shouldn't clear graph always be done in few seconds? It's not an
expensive operation?

--
Lingsoft - 30 years of Leading Language Management

www.lingsoft.fi

Speech Applications - Language Management - Translation - Reader's and Writer's 
Tools - Text Tools - E-books and M-books

Mikael Pesonen
Semantic Technologies

e-mail: [email protected]
Tel. +358 2 279 3300

Time zone: GMT+2

Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND

Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND

Re: SPAM-LOW: Slow clear graph

Reply via email to