Re: RDF Patch - experiences suggesting changes

A. Soroka Fri, 14 Oct 2016 04:00:07 -0700

Thoughts in-line. (Incidentally, my immediate interest in RDF Patch is pretty 
similar; robustness via distribution, but there's also a smaller, more 
theoretical interest for me in automatically "shredding" or "sharding" datasets 
across networks for higher persistence and query throughput.)

---
A. Soroka
The University of Virginia Library

> On Oct 13, 2016, at 11:32 AM, Andy Seaborne <a...@seaborne.org> wrote:
> 
> ...
> 1/ Record changes to prefixes
> 
> Just handling quads/triples isn't enough - to keep two datasets in-step, we 
> also need to record changes to prefixes.  While they don't change the meaning 
> of the data, application developers and users like prefixes.

Boo, hiss, but I can see your point. The worry to me would be the inevitable 
semantic overloading that will come with. But I guess that cake has already 
been baked by all the other RDF formats except NTriples.

> 2/ Remove the in-data prefixes feature.
> 
> RDF Patch has the feature to define prefixes in the data and use them for 
> prefix names later in the data using @prefix.
> 
> This seems to have no real advantage, it can slow things down (c.f. N-Triples 
> parsing is faster than Turtle parsing - prefixes is part of that), and it 
> generally complicates the data form.
> 
> When including "add"/"delete" prefixes on the dataset (1) it also makes it 
> quite confusing.
> 
> Whether the "R" for "repeat" entry from previous row should also be removed 
> is an open question.

I would agree with removing R, and the reason is that it doesn't remove lines. 
In other words, the abbreviation it offers is pretty minimal. On the other 
hand, it is relatively cheap to implement (4 slots of state) so I wouldn't 
argue very much to remove it.

> 3/ Record transaction boundaries.
> 
> (A.3 in RDF Patch v1)
> http://afs.github.io/rdf-patch/#transaction-boundaries
> 
> Having the transaction boundaries recorded means that they can be replayed 
> when applying the patch.  While often a patch will be one transaction, 
> patches can be consolidated by concatenation.
> 
> There 3 operations:
> 
> TB, TC, TA - Transaction Begin, Commit, Abort.
> 
> Abort is useful to include because to know whether a transaction in a patch 
> is going to commit or abort means waiting until the end.  That could be 
> buffering client-side, or buffering server-side (or not writing the patch to 
> a file) and having a means to discard a patch stream.
> 
> Instead, allow a transaction to record an abort, and say that aborted 
> transactions in patches can be discarded downstream.

This is very good stuff. It would be nice to include a definition of 
"transaction-compact" in which no TA may appear. It would enable RDF Patch 
readers to make a very convenient assumption. 

> 4/ Reversibility is a patch feature.
> 
> The RDF Patch v1 document includes "canonical patch" (section 9)
> http://afs.github.io/rdf-patch/#canonical-patches
> 
> Such a patch is reversible (it can undo changes) if the adds and deletes are 
> recorded only if they lead to a real change.  "Add quad" must mean "there was 
> no quad in the set before".  But this only makes sense if the whole patch has 
> this property.
> ...
> What would be useful is to label the patch itself to say whether it is 
> reversible.

Just a thought-- you could change BEGIN to permit "flags". So you could have:

BEGIN REVERSIBLE
patch
patch
patch
END

and you get "canonicity" on a per-transaction level. A patch could optionally 
make explicit its wrapping BEGIN and END for this kind of use.

> 5/ "RDF Git"
> 
> A patch should be able to record where it can be applied.  If RDF Patch is 
> being used to keep two datasets in-step, then some checking to know that the 
> patch can be applied to a copy because it is a patch created from the 
> previous version
> 
> So give each version of the dataset a UUID for a version then record the old 
> ("parent") UUID and the new UUID in the patch.
> ...
> Or some system may want to apply any patch and so create a tree of changes.  
> For the use case of keeping two datasets in-step, that's not what is wanted 
> but other use cases may be better served by having the primary version chain 
> sorted out by higher level software; a patch may be a "proposed change".

Yes, the roaring success of Git (and other DVCS) may imply that letting patches 
be pure changes (not connected to particular versions of the dataset, just 
"isolated" deltas) is the right way to think about them. The word "patch", 
itself, is usefully suggestive. That doesn't mean avoiding any versioning info, 
just making clear that datasets have versions, and the UUIDs associated with a 
given patch refer to where it _came from_, but you can still apply it to 
whatever you want (like cherry-picking Git commits).

Or another way to think about it: any dataset is just the sum of a series of 
patches (a random dataset with no history has an implicit history of one 
"virtual" patch with nothing but adds). So those UUIDs are roughly equivalent 
to a series of some patch IDs. So I _think_ you could alternatively assign just 
patch IDs and record a "parent" patch ID and a "self" patch ID for each patch. 
Then the question "Am I supposed to be able to use this patch on this dataset?" 
is answerable if you know the patch ID of the last patch applied. Not too 
different from dataset version UUIDs but it avoids introducing the notion of 
dataset version in favor of "pure changes".

> 
> 6/ Packets of change.
> 
> To have 4 (label a patch with reversible) and 5 (the version details), there 
> needs to be somewhere to put the information. Having it in the patch itself 
> means that the whole unit can be stored in a file.  If it is in the protocol, 
> like HTTP for E-tags then the information becomes separated.  That is not to 
> say that it can't also be in the protocol but it needs support in the data 
> format.

As long as the sort of information about which we are thinking makes sense on a 
per-transaction basis, that could be as I suggest above, as "metadata" on BEGIN.

> So a patch packet for a single transaction:
> 
> PARENT <UUID>
> VERSION <UUID>
> REVERSIBLE           optional
> TB
> QA ...
> QD ...
> PA ...
> PD ...
> TC
> H <sha1sum>
> 
> where QA and QD are "quad add" "quad delete", and "PA" "PD" are "add prefix" 
> and "delete prefix"

I'm suggesting something more like:

TB PARENT <UUID> VERSION <UUID> REVERSIBLE
QA ...
QD ...
PA ...
PD ...
TC H <sha1sum>

Or even just positionally 

TB <UUID> <UUID> REVERSIBLE
QA ...
QD ...
PA ...
PD ...
TC <sha1sum>

I'll add a further point that isn't in response to your thoughts:

You have a section:

> Binary Format
> An alternative wire format for efficient processing.
> (Need to quantify the gains, if any).

You might consider getting rid of R (or even ANY) and just concentrating on 
extreme clarity and speed of parsing for the basic format, and leaving all 
abbreviation for an additional binary format that offers compactness. If that 
doesn't make sense, the real point I'm offering is that you have two values in 
hand, parsing efficiency and compactness. It might be difficult to balance both 
in both a basic and a binary form and still offer any real advantage to using 
binary. But if you separate the values, it might clarify the decision for the 
user when to use basic or binary. Maybe not. Just a thought...

---
A. Soroka
The University of Virginia Library

Re: RDF Patch - experiences suggesting changes

Reply via email to