RDF Patch - experiences suggesting changes

Andy Seaborne Thu, 13 Oct 2016 08:33:33 -0700

I've been using modified RDF Patch for the data exchanged to keepmultiple datasets synchronized.

My primary use case is having multiple copies of the datasets for a highavailability solution. It has to be a general solution for any data.


There are some changes to the format that this work has highlighted.

[RDF Patch - v1]
https://afs.github.io/rdf-patch/


1/ Record changes to prefixes

Just handling quads/triples isn't enough - to keep two datasets in-step,we also need to record changes to prefixes. While they don't change themeaning of the data, application developers and users like prefixes.


2/ Remove the in-data prefixes feature.

RDF Patch has the feature to define prefixes in the data and use themfor prefix names later in the data using @prefix.

This seems to have no real advantage, it can slow things down (c.f.N-Triples parsing is faster than Turtle parsing - prefixes is part ofthat), and it generally complicates the data form.

When including "add"/"delete" prefixes on the dataset (1) it also makesit quite confusing.

Whether the "R" for "repeat" entry from previous row should also beremoved is an open question.


3/ Record transaction boundaries.

(A.3 in RDF Patch v1)
http://afs.github.io/rdf-patch/#transaction-boundaries

Having the transaction boundaries recorded means that they can bereplayed when applying the patch. While often a patch will be onetransaction, patches can be consolidated by concatenation.


There 3 operations:

TB, TC, TA - Transaction Begin, Commit, Abort.

Abort is useful to include because to know whether a transaction in apatch is going to commit or abort means waiting until the end. Thatcould be buffering client-side, or buffering server-side (or not writingthe patch to a file) and having a means to discard a patch stream.

Instead, allow a transaction to record an abort, and say that abortedtransactions in patches can be discarded downstream.


4/ Reversibility is a patch feature.

The RDF Patch v1 document includes "canonical patch" (section 9)
http://afs.github.io/rdf-patch/#canonical-patches

Such a patch is reversible (it can undo changes) if the adds and deletesare recorded only if they lead to a real change. "Add quad" must mean"there was no quad in the set before". But this only makes sense if thewhole patch has this property.

RDF Patches are in general entries in a "redo log" - you can apply thepatch over and over again and it will end up in the same state (they areidempotent).

A reversible patch is also an "undo log" entry and if you apply it inreverse order, it acts to undo the patch played forwards.

Testing whether a triple or quad is already present while performingupdates is not cheap - and in some cases where the patch is beingcomputed without reference to an existing dataset may not be possible.

What would be useful is to label the patch itself to say whether it isreversible.


5/ "RDF Git"

A patch should be able to record where it can be applied. If RDF Patchis being used to keep two datasets in-step, then some checking to knowthat the patch can be applied to a copy because it is a patch createdfrom the previous version

So give each version of the dataset a UUID for a version then record theold ("parent") UUID and the new UUID in the patch.

If the version checked and enforced, we get a chain of versions andpatches that lead from one state to another without risk of concurrentchanges getting mixed in.

This is like git - a patch can be accepted if the versions alignotherwise it is rejected (more a git repo not accepting a push than amerge conflict).

Or some system may want to apply any patch and so create a tree ofchanges. For the use case of keeping two datasets in-step, that's notwhat is wanted but other use cases may be better served by having theprimary version chain sorted out by higher level software; a patch maybe a "proposed change".


6/ Packets of change.

To have 4 (label a patch with reversible) and 5 (the version details),there needs to be somewhere to put the information. Having it in thepatch itself means that the whole unit can be stored in a file. If itis in the protocol, like HTTP for E-tags then the information becomesseparated. That is not to say that it can't also be in the protocol butit needs support in the data format.


7/ Checksum

Another feature to add to the packet is a checksum. A hash (which one?git uses SHA1) from start of packet header, including the initialversion (UUID), the version on applying the patch (UUID) and the changes(i.e. start of packet to after the DOT of the last line of change),makes the packet robust to editting after creating it. Like git; gituses it as the "object id".


So a patch packet for a single transaction:

PARENT <UUID>
VERSION <UUID>
REVERSIBLE           optional
TB
QA ...
QD ...
PA ...
PD ...
TC
H <sha1sum>

where QA and QD are "quad add" "quad delete", and "PA" "PD" are "addprefix" and "delete prefix"


        Andy


[RDF Patch - v1]
https://afs.github.io/rdf-patch/

RDF Patch - updated library
work in progress (does not have "packets").

https://github.com/afs/rdf-delta/tree/master/rdf-patch

RDF Patch - experiences suggesting changes

Reply via email to