I guess best way to go now for our case is to use separate instances of
Jena and use federated queries? Downside is that more RAM is needed.
On 18/11/2024 16:06, Andy Seaborne wrote:
CLEAR and DROP are done inside a transaction. The SPARQL Update
implementation is skewed to many, small operations.
Looking at TDB2, that isn't just updating G??? indexes To make union
default graph work, there are also indexes ???G. It's write
amplification.
There is a tradeoff - efficient union default graph vs easy of deletion.
A different design could have each graph separately, optimizing for
whole graph operations.
A way round this, in the future, might be to employ server reload
(WIP). Prepare a new dataset offline and do a zero-downtime reload of
the configuration.
Andy
On 14/11/2024 11:47, Mikael Pesonen wrote:
Thanks for the explanation! Yes it's mainly happening on disk, CPU
usage is at 0.5%. Are there any strategies to get around this?
Somehow compartmentalize the data so that deleting one compartment
would be more efficient?
On 14/11/2024 13:30, Rob @ DNR wrote:
Details matter here e.g. what storage layer is in use? How big is
the graph being deleted? How many other graphs (and triples) are in
the server as a whole? You say a curl request so can we assume
Fuseki? Are there other secondary indices involved e.g. Jena Text?
---
Most Jena storage, i.e. TDB/TDB2, is quad-oriented behind the scenes
so when you issue a CLEAR GRAPH <uri> (or a DROP GRAPH <uri>) what
happens internally is that it must scan each index and delete all
quads with the relevant <uri> in the graph position of the quad.
For indexes where graph is later in the order e.g. SPOG these quads
could be scattered across the entire index affecting many blocks on
disk meaning the whole index needs to be read.
For TDB2 which uses copy on write data structures this might also
end up effectively having to rewrite every single block in the index
which for large datasets could take an exceedingly long time.
If you have secondary indices involved, e.g. Jena Text, then it is
also potentially having to make the relevant delete requests to
those indices as well.
---
So, my guess would be that you have a lot of disk IO happening on
your server if you happened to look at its resource consumption
while the CLEAR GRAPH is ongoing?
Rob
From: Mikael Pesonen <[email protected]>
Date: Thursday, 14 November 2024 at 09:21
To: [email protected] <[email protected]>
Subject: SPAM-LOW: Slow clear graph
Curl command is running now over 24 hours with Jena, what could cause
that? Shouldn't clear graph always be done in few seconds? It's not an
expensive operation?
--
Lingsoft - 30 years of Leading Language Management
www.lingsoft.fi
Speech Applications - Language Management - Translation - Reader's and Writer's
Tools - Text Tools - E-books and M-books
Mikael Pesonen
Semantic Technologies
e-mail: [email protected]
Tel. +358 2 279 3300
Time zone: GMT+2
Helsinki Office
Eteläranta 10
FI-00130 Helsinki
FINLAND
Turku Office
Kauppiaskatu 5 A
FI-20100 Turku
FINLAND