Hey Andy The basic approach looks sound and I like the simple text based format, see my notes later on about maybe having a binary serialization as well.
How do you envisage incremental backups being implemented in practice, you suggest in the document that you would take a full RDF dump and then compute the RDF delta from a previous backup. Talking from the experience of having done this as part of one of my experiments in my PhD this can be very complex and time consuming to do especially if you need to take care of BNode isomorphism. I assume from some of the other discussion on BNodes that you assume that IDs will remain stable across dumps, thus there is an implicit requirement here that the database be able to dump RDF using consistent BNode IDs (either internal IDs or some stable round trippable IDs). Taking ARQ as an example the existing NQuads/TriG writers do not do this so there would need to be an option for those writers to be able to support this. Even without any concerns of BNode isomorphism comparing two RDF dumps to create a delta could be a potentially very time consuming operation and recording the deltas as changes happen may be far more efficient. Of course depending on the exact use case the RDF dump and compute delta approach may be acceptable. My main criticism is on the "Minimise actions" section, there needs to be a more solid clarification of definitions and when minimization can and should happen. For example: "When written in minimise form the RDF Delta can be run backwards, to undo a change. This only works when real changes are recorded because otherwise knowing a triple is added does not mean it was not there before." While I agree it is necessary to record real changes for deltas to be reverse applied I'm not convinced they have to be in minimized form (at least based on how the definition of minimized form reads right now), if only real changes are recorded then deltas will be in a minimal form. Yet it is not entirely clear by your definition the following delta would be considered minimal: A <http://s> <http://p> <http://o> R <http://s> <http://p> <http://o> A <http://s> <http://p> <http://o> I'm assuming that your intention was that such deltas should not be minimized but perhaps this needs to be more clear in the document. On the topic of related work: I think I may have mentioned previously that I've done some research work internally here at YarcData on a general purpose binary serialization for Triples, Quads and Tuples which likely could be fairly trivially extended to carry a binary encoding of the deltas as well which may save space. For ball park comparison purposes compression is roughly equivalent to GZipping raw NTriples with the key advantage being that the format is significantly faster to process even in its current prototype single threaded implementation (the design was written to take advantage of parallelism). There are a bunch of further optimizations that I had ideas for that I never got as far as implementing because of lack of management support for the concept. There has been some discussion of open sourcing this work (likely as a contributed Experimental module to Jena) so that it could be developed outside of the company, if this sounds like it may be of interest I will broach the subject with relevant management again and see whether this can happen in the near future. Rob On 6/18/13 7:26 AM, "Andy Seaborne" <a...@apache.org> wrote: >I started writing up a format for transferring changes between dataset >copies (copies in time and in location). > >https://cwiki.apache.org/confluence/display/JENA/RDF+Delta > >Still rough and ready but I hope it gives a general impression of the >format and usage. > >Comments, thoughts, discussion here on dev@ please. > > Andy