Re: SPAM-LOW: Slow clear graph

Andy Seaborne Mon, 18 Nov 2024 06:06:42 -0800

CLEAR and DROP are done inside a transaction. The SPARQL Updateimplementation is skewed to many, small operations.

Looking at TDB2, that isn't just updating G??? indexes To make uniondefault graph work, there are also indexes ???G. It's write amplification.


There is a tradeoff - efficient union default graph vs easy of deletion.

A different design could have each graph separately, optimizing forwhole graph operations.

A way round this, in the future, might be to employ server reload (WIP).Prepare a new dataset offline and do a zero-downtime reload of theconfiguration.


    Andy

On 14/11/2024 11:47, Mikael Pesonen wrote:

Thanks for the explanation! Yes it's mainly happening on disk, CPU usageis at 0.5%. Are there any strategies to get around this? Somehowcompartmentalize the data so that deleting one compartment would be moreefficient?
On 14/11/2024 13:30, Rob @ DNR wrote:
Details matter here e.g. what storage layer is in use? How big is thegraph being deleted? How many other graphs (and triples) are in theserver as a whole? You say a curl request so can we assume Fuseki?Are there other secondary indices involved e.g. Jena Text?
---
Most Jena storage, i.e. TDB/TDB2, is quad-oriented behind the scenesso when you issue a CLEAR GRAPH <uri> (or a DROP GRAPH <uri>) whathappens internally is that it must scan each index and delete allquads with the relevant <uri> in the graph position of the quad. Forindexes where graph is later in the order e.g. SPOG these quads couldbe scattered across the entire index affecting many blocks on diskmeaning the whole index needs to be read.
For TDB2 which uses copy on write data structures this might also endup effectively having to rewrite every single block in the index whichfor large datasets could take an exceedingly long time.
If you have secondary indices involved, e.g. Jena Text, then it isalso potentially having to make the relevant delete requests to thoseindices as well.
---
So, my guess would be that you have a lot of disk IO happening on yourserver if you happened to look at its resource consumption while theCLEAR GRAPH is ongoing?
Rob


From: Mikael Pesonen <[email protected]>
Date: Thursday, 14 November 2024 at 09:21
To: [email protected] <[email protected]>
Subject: SPAM-LOW: Slow clear graph
Curl command is running now over 24 hours with Jena, what could cause
that? Shouldn't clear graph always be done in few seconds? It's not an
expensive operation?

Re: SPAM-LOW: Slow clear graph

Reply via email to