Hi Andy, Thank you very much for the answers.
Regards, Vinay Mahamuni On Fri, 28 Jan 2022 at 03:28, Andy Seaborne <a...@apache.org> wrote: > Hi Vinay, > > > On 27/01/2022 06:14, Vinay Mahamuni wrote: > > Hello, > > > > I am using Apache Jena v4.3.2 + Fuseki + TDB2 persistent disk storage. I > > am using jena RDFConnection to connect to the Fuseki server. I am > > sending 50k triples in one update. This is mostly new data(only a few > > triples will match with existing data). These data are instances based > > on an ontology. Please have a look at the attached file containing how > > much disk memory increases with each update. For 1.5million triples, it > > took around 1.2GB. We want to store around a few billions of triples. > > Thus the bytes/triple ratio won't be good for our use case. > > > > When I used the tdb2.tdbcompact tool, the data volume shrinked to 400MB. > > But this extra step needs to be performed manually to optimise the > storage. > > It can be triggered by an admin process with e.g. "cron". > > It doesn't have to be done very often unless your volume of 50k triple > transactions is very high - in which case I suggest batching them into > larger units. > > > > > My questions are as follows: > > > > 1. Why 30 update queries each of 50k triples take 3 times more memory > > than a single update query of 1500k triples? Data getting stored is > > the same but memory consumed is more in the first case. > > TDB2 uses an MVCC/copy-on-write scheme for transaction isolation. It > gives a very high isolation guarantee (serialized), > > That means there is a per transaction overhead here which is recovered > by compact. In fact, it can't recover at the time because the old data > may be in use in read-transactions seeing the pre-write state. > > Compact is similar (not identical) to like PostgreSQL VACUUM. > > Note that all additional space is recovered by "compact". The active > directory is the highest number "Data-NNNN". You can delete the earlier > ones once the "compact" has finished as logged in the server log. Or zip > them and keep them as backups - Fuseki has released them and does not > touch them. Caution: on MS Windows, due to a long standing (10+year) > Java JDK issue, the server has to be stopped and restarted to properly > release old files. > > It doesn't matter whether it was one large write-transaction or 100 > write transactions, the compacted database will be the same size. It > will have become bigger for 100 writes than 1, but more space is > recovered and the new data storage is the same size if you delete the > now unused storage areas. > > > 2. Is there any other way to solve this memory problem? > > Schedule "compact", delete the old data storage. > > If the update are a stream of updates without reading the database, > write a big file (N-triples, Turtle: just write all concatenated to a > single file). > > You can also consider instead of loading in to Fuseki, to use the bulk > loader tbd2.tdbloader to build the database offline, then put in place, > then start Fuseki. The bulk loader is significantly faster when sizes > get into the 100's millions of triples. > > > 3. What are the existing strategies that can be used to optimise the > > storage memory while writing data? > > 4. Is there any new development going on to use less memory for the > > write/update query? > > Just plans that need resources! > > It would be nice to have serve-side transactions over several updates > (which is beyond what the SPARQL protocol can do). > > -- > > I've tried TDB with other storage systems (e.g. RocksDB) but the ability > to directly write the on-disk format is useful - it makes the bulk > loader work. > > -- > > There are other issues as well in your use case. > > It also depends on the data. If many triples have unique literals/ URIs, > the node table is proportionately large > > Andy > > > > > > > Thanks, > > Vinay Mahamuni >