Hi Andy,

Thank you very much for the answers.

Regards,
Vinay Mahamuni

On Fri, 28 Jan 2022 at 03:28, Andy Seaborne <a...@apache.org> wrote:

> Hi Vinay,
>
>
> On 27/01/2022 06:14, Vinay Mahamuni wrote:
> > Hello,
> >
> > I am using Apache Jena v4.3.2 + Fuseki + TDB2 persistent disk storage. I
> > am using jena RDFConnection to connect to the Fuseki server. I am
> > sending 50k triples in one update. This is mostly new data(only a few
> > triples will match with existing data). These data are instances based
> > on an ontology. Please have a look at the attached file containing how
> > much disk memory increases with each update. For 1.5million triples, it
> > took around 1.2GB. We want to store around a few billions of triples.
> > Thus the bytes/triple ratio won't be good for our use case.
> >
> > When I used the tdb2.tdbcompact tool, the data volume shrinked to 400MB.
> > But this extra step needs to be performed manually to optimise the
> storage.
>
> It can be triggered by an admin process with e.g. "cron".
>
> It doesn't have to be done very often unless your volume of 50k triple
> transactions is very high - in which case I suggest batching them into
> larger units.
>
> >
> > My questions are as follows:
> >
> >  1. Why 30 update queries each of 50k triples take 3 times more memory
> >     than a single update query of 1500k triples? Data getting stored is
> >     the same but memory consumed is more in the first case.
>
> TDB2 uses an MVCC/copy-on-write scheme for transaction isolation. It
> gives a very high isolation guarantee (serialized),
>
> That means there is a per transaction overhead here which is recovered
> by compact.  In fact, it can't recover at the time because the old data
> may be in use in read-transactions seeing the pre-write state.
>
> Compact is similar (not identical) to like PostgreSQL VACUUM.
>
> Note that all additional space is recovered by "compact". The active
> directory is the highest number "Data-NNNN". You can delete the earlier
> ones once the "compact" has finished as logged in the server log. Or zip
> them and keep them as backups - Fuseki has released them and does not
> touch them.  Caution: on MS Windows, due to a long standing (10+year)
> Java JDK issue, the server has to be stopped and restarted to properly
> release old files.
>
> It doesn't matter whether it was one large write-transaction or 100
> write transactions, the compacted database will be the same size. It
> will have become bigger for 100 writes than 1, but more space is
> recovered and the new data storage is the same size if you delete the
> now unused storage areas.
>
> >  2. Is there any other way to solve this memory problem?
>
> Schedule "compact", delete the old data storage.
>
> If the update are a stream of updates without reading the database,
> write a big file (N-triples, Turtle: just write all concatenated to a
> single file).
>
> You can also consider instead of loading in to Fuseki, to use the bulk
> loader tbd2.tdbloader to build the database offline, then put in place,
> then start Fuseki. The bulk loader is significantly faster when sizes
> get into the 100's millions of triples.
>
> >  3. What are the existing strategies that can be used to optimise the
> >     storage memory while writing data?
> >  4. Is there any new development going on to use less memory for the
> >     write/update query?
>
> Just plans that need resources!
>
> It would be nice to have serve-side transactions over several updates
> (which is beyond what the SPARQL protocol can do).
>
> --
>
> I've tried TDB with other storage systems (e.g. RocksDB) but the ability
> to directly write the on-disk format is useful - it makes the bulk
> loader work.
>
> --
>
> There are other issues as well in your use case.
>
> It also depends on the data. If many triples have unique literals/ URIs,
> the node table is proportionately large
>
>      Andy
>
> >
> >
> > Thanks,
> > Vinay Mahamuni
>

Reply via email to