CLEAR and DROP are done inside a transaction. The SPARQL Update implementation is skewed to many, small operations.

Looking at TDB2, that isn't just updating G??? indexes To make union default graph work, there are also indexes ???G. It's write amplification.

There is a tradeoff - efficient union default graph vs easy of deletion.

A different design could have each graph separately, optimizing for whole graph operations.

A way round this, in the future, might be to employ server reload (WIP). Prepare a new dataset offline and do a zero-downtime reload of the configuration.

    Andy

On 14/11/2024 11:47, Mikael Pesonen wrote:
Thanks for the explanation! Yes it's mainly happening on disk, CPU usage is at 0.5%. Are there any strategies to get around this? Somehow compartmentalize the data so that deleting one compartment would be more efficient?

On 14/11/2024 13:30, Rob @ DNR wrote:
Details matter here e.g. what storage layer is in use? How big is the graph being deleted?  How many other graphs (and triples) are in the server as a whole?  You say a curl request so can we assume Fuseki? Are there other secondary indices involved e.g. Jena Text?

---

Most Jena storage, i.e. TDB/TDB2, is quad-oriented behind the scenes so when you issue a CLEAR GRAPH <uri> (or a DROP GRAPH <uri>) what happens internally is that it must scan each index and delete all quads with the relevant <uri> in the graph position of the quad.  For indexes where graph is later in the order e.g. SPOG these quads could be scattered across the entire index affecting many blocks on disk meaning the whole index needs to be read.

For TDB2 which uses copy on write data structures this might also end up effectively having to rewrite every single block in the index which for large datasets could take an exceedingly long time.

If you have secondary indices involved, e.g. Jena Text, then it is also potentially having to make the relevant delete requests to those indices as well.

---

So, my guess would be that you have a lot of disk IO happening on your server if you happened to look at its resource consumption while the CLEAR GRAPH is ongoing?

Rob


From: Mikael Pesonen <[email protected]>
Date: Thursday, 14 November 2024 at 09:21
To: [email protected] <[email protected]>
Subject: SPAM-LOW: Slow clear graph
Curl command is running now over 24 hours with Jena, what could cause
that? Shouldn't clear graph always be done in few seconds? It's not an
expensive operation?


Reply via email to