Re: Fuseki growing in size and need for compaction

Balduin Landolt Thu, 26 Dec 2024 08:48:07 -0800

Hi Milorad,

I'm afraid I don't have a very satisfying response to give you. We never
found a way to solve the problem, just ways to work around it:
- We have a DB compaction scheduled nightly, during the time we have fewest
users.
- We make sure to have enough disk space available to last until the next
scheduled compaction.
- If we have an internal process that we know will make a large number of
updates, we make sure to batch it, so that only an acceptable number of
updates is performed before it suspends until the next compaction has
happened.
- We have our data logically compartmentalized in named graphs; if there is
a process that will perform many updates, wherever possible we design it in
such a way that it will only affect a single graph. Like this we can run it
on a separate instance with a mirror of our production database, and once
the process has finished, we simply upload the graph in question from the
other instance to prod. (Though this is admittedly dangerous in terms of
data integrity.)


For the longer term, we have considered and discarded two solutions: 1)
swapping Fuseki out for a different triple store, or 2) changing our
application so that the triplestore no longer is the source of truth, but
instead just a secondary DB for SPARQL querying. Both would be substantial
changes and each come with some cost and complication.
Instead, what we're now doing is splitting out some functionality of our
current (monolithic) system, into separate components that will no longer
rely on Fuseki, hoping that this will reduce the number of updates to an
extent where the disk size won't be such an issue anymore.

Best,
Balduin

On Thu, Dec 26, 2024 at 3:48 PM Milorad Tosic <mbto...@yahoo.com.invalid>
wrote:

> Hi Balduin,
>
> We have a similar problem. Could you let me know status on your progress?
>
> Thanks
>
> Milorad
>
> On 4/22/2024 5:22 PM, Balduin Landolt wrote:
> > Hello,
> >
> > we're running Fuseki 5.0.0 (but previously the last 4.x versions behaved
> > essentially the same) with roughly 40 Mio triples (tendency: growing).
> > Not sure what configuration is relevant, but we have the default graph as
> > the union graph.
> > Also, we use Fuseki as our main database, not just as a "view on our
> data"
> > so we do quite a bit of updating on the data all the time.
> >
> > Lately, we've been having more and more issues with servers running out
> of
> > disk space because Fuseki's database grew pretty rapidly.
> > This can be solved by compacting the DB, but with our data and hardware
> > this takes ca. 15 minutes, during which Fuseki does not accept any update
> > queries, so for the production system we can't really do this outside of
> > nighttime hours when (hopefully) no one uses the system anyways.
> >
> > Some things we've noticed:
> > - A subset of our data (I think ~20 Mio triples) taking up 6GB in
> compacted
> > state, when dumped to a .trig file is ca. 5GB. But when uploading the
> same
> > .trig file to an empty DB, this grows to ca. 25GB
> > - Dropping graphs does not free up disk space
> > - A sequence of e.g. 10k queries updating only a small number of triples
> > (maybe 1-10 or so) on the full dataset seems to grow the DB size a lot,
> > like 10s to 100s of GB (I don't have numbers on this one, but it was
> > substantial).
> >
> > My question is:
> > Would that kind of growth in disk usage be expected? Are other people
> > having similar issues? Are there strategies to mitigate this? Maybe some
> > configuration that may be tweaked or so?
> >
> > Best & thanks in advance,
> > Balduin
> >
>

Re: Fuseki growing in size and need for compaction

Reply via email to