On 14/10/16 15:22, Rob Vesse wrote:
Thanks for sending this out

Another use case that springs to mind is for write ahead logging
particularly for reversible patches.

Yes. Whether it has to be reversible depends on how it related to the
journal. If it is the journal, it only has to be a replayable log. If the commit journal is separate, it may need to be reversible.

On the subject prefixes I agree that being able to record prefix
definitions it Is useful and I am strongly in favour of not using
them to compact the data. As you say it actually makes reading and
writing the Data slower as well as requiring additional state to be
recorded during processing.

I like the use of transaction boundaries, I also like A.Soroka’s
suggestion on making the reversible flag be Applied to transaction
begin rather than to the patch as a whole though I don’t see any
problem with supporting both forms. I think reversible patches are an
essential feature.

I don't understand what capabilities are enabled by transaction granularity if there are multiple transactions in a single patch. Concrete examples of where it helps?

However, I've normally been working with one transaction per patch anyway.

Allowing multiple transaction per patch is for making a collect of (semantically) related changes into a unit, by consolidating small patches "today's changes " (c.f. git squash).

Leaving the transaction boundaries in gives internal checkpoints, not just one big transaction. It also makes the consolidate patch decomposable (unlike squash).

Internal checkpoints are useful not just for keeping the transaction manageable but also to be able to restart a very large update in case it failed part way through for system reasons (server power cut, user reboots laptop by accident, ...) Imagine keeping a DBpedia copy up to date.

For the version control aspect I would be tempted to not constrain it
to UUID and simply say that it is an identifier for the parent state
to which the patch is applied. This will then allow people the
freedom to use hash algorithms, simple counters etc or any other
Version identification scheme they desired. I might even be tempted
to suggest that it should be a URI so that people can use identifiers
in their own name spaces to reduce the chance collisions.

As long as the ref is globally unique (so not counters without uniquifier).

I mentioned UUIDs really to turn up the contrast. It is not naming a web resource if it is a version. The web resource is mutable - it's the dataset. If someone wants to use http: versions for a way-back-database, that's cool, but making that the way for systems that don't have temporal capabilities (the majority) gets into philosophical debates.

And to keep patches protocol independent.

I have separate work on a protocol for keeping two datasets synced (soft consistency).

I can see the value of supporting meta data about the patch both
within it and in any protocol used to communicate it. Checksums are
fine although if you include this then you probably need to define
exactly how each checksum should be calculated.

Yes.


As for some of the other suggestions you have received:

- I would be strongly against including an ANY term. As soon as you
get into wild cards you may as well just use SPARQL Update. Plus the
meaning of the wild card is dependent on the dataset to which it is
applied which completely defeats the purpose of being a canonical
description of changes
> - I am strongly for including the REPEAT term.
This has the potential to offer significant compression particularly
if the system producing the patch chooses to group changes by subject
and predicate À la turtle and most other syntaxes.

These two together seem a bit contradictory. The advantage of ANY, with versions, is that it is form of compression.

With out a version, I agree that it is stepping towards a higher level language for changes.


The compression by subject/predicate leads me mixed - compression after hashing would treat them as more orthogonal. compressing even with R

My rule of thumb is x8 to x10 compression of N-triple/N-quads. That's not all coming from same-subject etc. I assume it comes from effectively spotting the namespaces and making them compression tokens.

> - Having a term for the default graph could prove useful

        Andy


Rob


On 13/10/2016 16:32, "Andy Seaborne" <andy.seabo...@gmail.com on
behalf of a...@seaborne.org> wrote:

I've been using modified RDF Patch for the data exchanged to keep
multiple datasets synchronized.

My primary use case is having multiple copies of the datasets for a
high availability solution.  It has to be a general solution for any
data.

There are some changes to the format that this work has highlighted.

[RDF Patch - v1] https://afs.github.io/rdf-patch/


1/ Record changes to prefixes

Just handling quads/triples isn't enough - to keep two datasets
in-step, we also need to record changes to prefixes.  While they
don't change the meaning of the data, application developers and
users like prefixes.

2/ Remove the in-data prefixes feature.

RDF Patch has the feature to define prefixes in the data and use them
 for prefix names later in the data using @prefix.

This seems to have no real advantage, it can slow things down (c.f.
N-Triples parsing is faster than Turtle parsing - prefixes is part of
 that), and it generally complicates the data form.

When including "add"/"delete" prefixes on the dataset (1) it also
makes it quite confusing.

Whether the "R" for "repeat" entry from previous row should also be
removed is an open question.

3/ Record transaction boundaries.

(A.3 in RDF Patch v1)
http://afs.github.io/rdf-patch/#transaction-boundaries

Having the transaction boundaries recorded means that they can be
replayed when applying the patch.  While often a patch will be one
transaction, patches can be consolidated by concatenation.

There 3 operations:

TB, TC, TA - Transaction Begin, Commit, Abort.

Abort is useful to include because to know whether a transaction in a
 patch is going to commit or abort means waiting until the end.  That
 could be buffering client-side, or buffering server-side (or not
writing the patch to a file) and having a means to discard a patch
stream.

Instead, allow a transaction to record an abort, and say that aborted
 transactions in patches can be discarded downstream.

4/ Reversibility is a patch feature.

The RDF Patch v1 document includes "canonical patch" (section 9)
http://afs.github.io/rdf-patch/#canonical-patches

Such a patch is reversible (it can undo changes) if the adds and
deletes are recorded only if they lead to a real change.  "Add quad"
must mean "there was no quad in the set before".  But this only makes
sense if the whole patch has this property.

RDF Patches are in general entries in a "redo log" - you can apply
the patch over and over again and it will end up in the same state
(they are idempotent).

A reversible patch is also an "undo log" entry and if you apply it in
 reverse order, it acts to undo the patch played forwards.

Testing whether a triple or quad is already present while performing
 updates is not cheap - and in some cases where the patch is being
computed without reference to an existing dataset may not be
possible.

What would be useful is to label the patch itself to say whether it
is reversible.

5/ "RDF Git"

A patch should be able to record where it can be applied.  If RDF
Patch is being used to keep two datasets in-step, then some checking
to know that the patch can be applied to a copy because it is a patch
created from the previous version

So give each version of the dataset a UUID for a version then record
the old ("parent") UUID and the new UUID in the patch.

If the version checked and enforced, we get a chain of versions and
patches that lead from one state to another without risk of
concurrent changes getting mixed in.

This is like git - a patch can be accepted if the versions align
otherwise it is rejected (more a git repo not accepting a push than a
 merge conflict).

Or some system may want to apply any patch and so create a tree of
changes.  For the use case of keeping two datasets in-step, that's
not what is wanted but other use cases may be better served by having
the primary version chain sorted out by higher level software; a
patch may be a "proposed change".

6/ Packets of change.

To have 4 (label a patch with reversible) and 5 (the version
details), there needs to be somewhere to put the information. Having
it in the patch itself means that the whole unit can be stored in a
file.  If it is in the protocol, like HTTP for E-tags then the
information becomes separated.  That is not to say that it can't also
be in the protocol but it needs support in the data format.

7/ Checksum

Another feature to add to the packet is a checksum. A hash (which
one? git uses SHA1) from start of packet header, including the
initial version (UUID), the version on applying the patch (UUID) and
the changes (i.e. start of packet to after the DOT of the last line
of change), makes the packet robust to editting after creating it.
Like git; git uses it as the "object id".

So a patch packet for a single transaction:

PARENT <UUID> VERSION <UUID> REVERSIBLE           optional TB QA ...
QD ... PA ... PD ... TC H <sha1sum>

where QA and QD are "quad add" "quad delete", and "PA" "PD" are "add
 prefix" and "delete prefix"

Andy


[RDF Patch - v1] https://afs.github.io/rdf-patch/

RDF Patch - updated library work in progress (does not have
"packets").

https://github.com/afs/rdf-delta/tree/master/rdf-patch





Reply via email to