Hey Andy

The basic approach looks sound and I like the simple text based format,
see my notes later on about maybe having a binary serialization as well.

How do you envisage incremental backups being implemented in practice, you
suggest in the document that you would take a full RDF dump and then
compute the RDF delta from a previous backup.  Talking from the experience
of having done this as part of one of my experiments in my PhD this can be
very complex and time consuming to do especially if you need to take care
of BNode isomorphism.  I assume from some of the other discussion on
BNodes that you assume that IDs will remain stable across dumps, thus
there is an implicit requirement here that the database be able to dump
RDF using consistent BNode IDs (either internal IDs or some stable round
trippable IDs).  Taking ARQ as an example the existing NQuads/TriG writers
do not do this so there would need to be an option for those writers to be
able to support this.

Even without any concerns of BNode isomorphism comparing two RDF dumps to
create a delta could be a potentially very time consuming operation and
recording the deltas as changes happen may be far more efficient.  Of
course depending on the exact use case the RDF dump and compute delta
approach may be acceptable.

My main criticism is on the "Minimise actions" section, there needs to be
a more solid clarification of definitions and when minimization can and
should happen.

For example:

"When written in minimise form the RDF Delta can be run backwards, to undo
a change. This only works when real changes are recorded because otherwise
knowing a triple is added does not mean it was not there before."

While I agree it is necessary to record real changes for deltas to be
reverse applied I'm not convinced they have to be in minimized form (at
least based on how the definition of minimized form reads right now), if
only real changes are recorded then deltas will be in a minimal form.

Yet it is not entirely clear by your definition the following delta would
be considered minimal:

A <http://s> <http://p> <http://o>
R <http://s> <http://p> <http://o>
A <http://s> <http://p> <http://o>

I'm assuming that your intention was that such deltas should not be
minimized but perhaps this needs to be more clear in the document.

On the topic of related work:

I think I may have mentioned previously that I've done some research work
internally here at YarcData on a general purpose binary serialization for
Triples, Quads and Tuples which likely could be fairly trivially extended
to carry a binary encoding of the deltas as well which may save space.
For ball park comparison purposes compression is roughly equivalent to
GZipping raw NTriples with the key advantage being that the format is
significantly faster to process even in its current prototype single
threaded implementation (the design was written to take advantage of
parallelism).  There are a bunch of further optimizations that I had ideas
for that I never got as far as implementing because of lack of management
support for the concept.

There has been some discussion of open sourcing this work (likely as a
contributed Experimental module to Jena) so that it could be developed
outside of the company, if this sounds like it may be of interest I will
broach the subject with relevant management again and see whether this can
happen in the near future.

Rob


On 6/18/13 7:26 AM, "Andy Seaborne" <a...@apache.org> wrote:

>I started writing up a format for transferring changes between dataset
>copies (copies in time and in location).
>
>https://cwiki.apache.org/confluence/display/JENA/RDF+Delta
>
>Still rough and ready but I hope it gives a general impression of the
>format and usage.
>
>Comments, thoughts, discussion here on dev@ please.
>
>       Andy

Reply via email to