Hi Milorad, I'm afraid I don't have a very satisfying response to give you. We never found a way to solve the problem, just ways to work around it: - We have a DB compaction scheduled nightly, during the time we have fewest users. - We make sure to have enough disk space available to last until the next scheduled compaction. - If we have an internal process that we know will make a large number of updates, we make sure to batch it, so that only an acceptable number of updates is performed before it suspends until the next compaction has happened. - We have our data logically compartmentalized in named graphs; if there is a process that will perform many updates, wherever possible we design it in such a way that it will only affect a single graph. Like this we can run it on a separate instance with a mirror of our production database, and once the process has finished, we simply upload the graph in question from the other instance to prod. (Though this is admittedly dangerous in terms of data integrity.)
For the longer term, we have considered and discarded two solutions: 1) swapping Fuseki out for a different triple store, or 2) changing our application so that the triplestore no longer is the source of truth, but instead just a secondary DB for SPARQL querying. Both would be substantial changes and each come with some cost and complication. Instead, what we're now doing is splitting out some functionality of our current (monolithic) system, into separate components that will no longer rely on Fuseki, hoping that this will reduce the number of updates to an extent where the disk size won't be such an issue anymore. Best, Balduin On Thu, Dec 26, 2024 at 3:48 PM Milorad Tosic <mbto...@yahoo.com.invalid> wrote: > Hi Balduin, > > We have a similar problem. Could you let me know status on your progress? > > Thanks > > Milorad > > On 4/22/2024 5:22 PM, Balduin Landolt wrote: > > Hello, > > > > we're running Fuseki 5.0.0 (but previously the last 4.x versions behaved > > essentially the same) with roughly 40 Mio triples (tendency: growing). > > Not sure what configuration is relevant, but we have the default graph as > > the union graph. > > Also, we use Fuseki as our main database, not just as a "view on our > data" > > so we do quite a bit of updating on the data all the time. > > > > Lately, we've been having more and more issues with servers running out > of > > disk space because Fuseki's database grew pretty rapidly. > > This can be solved by compacting the DB, but with our data and hardware > > this takes ca. 15 minutes, during which Fuseki does not accept any update > > queries, so for the production system we can't really do this outside of > > nighttime hours when (hopefully) no one uses the system anyways. > > > > Some things we've noticed: > > - A subset of our data (I think ~20 Mio triples) taking up 6GB in > compacted > > state, when dumped to a .trig file is ca. 5GB. But when uploading the > same > > .trig file to an empty DB, this grows to ca. 25GB > > - Dropping graphs does not free up disk space > > - A sequence of e.g. 10k queries updating only a small number of triples > > (maybe 1-10 or so) on the full dataset seems to grow the DB size a lot, > > like 10s to 100s of GB (I don't have numbers on this one, but it was > > substantial). > > > > My question is: > > Would that kind of growth in disk usage be expected? Are other people > > having similar issues? Are there strategies to mitigate this? Maybe some > > configuration that may be tweaked or so? > > > > Best & thanks in advance, > > Balduin > > >