CLEAR and DROP are done inside a transaction. The SPARQL Update
implementation is skewed to many, small operations.
Looking at TDB2, that isn't just updating G??? indexes To make union
default graph work, there are also indexes ???G. It's write amplification.
There is a tradeoff - efficient union default graph vs easy of deletion.
A different design could have each graph separately, optimizing for
whole graph operations.
A way round this, in the future, might be to employ server reload (WIP).
Prepare a new dataset offline and do a zero-downtime reload of the
configuration.
Andy
On 14/11/2024 11:47, Mikael Pesonen wrote:
Thanks for the explanation! Yes it's mainly happening on disk, CPU usage
is at 0.5%. Are there any strategies to get around this? Somehow
compartmentalize the data so that deleting one compartment would be more
efficient?
On 14/11/2024 13:30, Rob @ DNR wrote:
Details matter here e.g. what storage layer is in use? How big is the
graph being deleted? How many other graphs (and triples) are in the
server as a whole? You say a curl request so can we assume Fuseki?
Are there other secondary indices involved e.g. Jena Text?
---
Most Jena storage, i.e. TDB/TDB2, is quad-oriented behind the scenes
so when you issue a CLEAR GRAPH <uri> (or a DROP GRAPH <uri>) what
happens internally is that it must scan each index and delete all
quads with the relevant <uri> in the graph position of the quad. For
indexes where graph is later in the order e.g. SPOG these quads could
be scattered across the entire index affecting many blocks on disk
meaning the whole index needs to be read.
For TDB2 which uses copy on write data structures this might also end
up effectively having to rewrite every single block in the index which
for large datasets could take an exceedingly long time.
If you have secondary indices involved, e.g. Jena Text, then it is
also potentially having to make the relevant delete requests to those
indices as well.
---
So, my guess would be that you have a lot of disk IO happening on your
server if you happened to look at its resource consumption while the
CLEAR GRAPH is ongoing?
Rob
From: Mikael Pesonen <[email protected]>
Date: Thursday, 14 November 2024 at 09:21
To: [email protected] <[email protected]>
Subject: SPAM-LOW: Slow clear graph
Curl command is running now over 24 hours with Jena, what could cause
that? Shouldn't clear graph always be done in few seconds? It's not an
expensive operation?