Re: RDF Delta - recording changes to RDF Datasets

Andy Seaborne Thu, 20 Jun 2013 06:15:40 -0700

I think the use of N-Quads order is better, given N-Quads exists. Ialways think of quads a G-S-P-O (but I have no idea why!) and it justgot written that way because.

The format does really need to be parsed in complete rows beforedeciding what to do with a row so, caveat very large literals (VLL),batching by graph isn't greatly affected.


VLL (Very Long Literals) of themselves could do with special handling.

But at the same time, I'd like to assume subjects-as-literals whichmeans they are not necessarily in the final object slot in GSPO orderwhen you could imagine special handling enabled by G-first.


Added a comments/todo section to not loose any of these points.

    Andy


On 19/06/13 00:56, Rob Vesse wrote:

The format already allows arbitrarily sized tuples (well in the current
form it is capped at 255 columns per tuple) though it assumes that this
will be used to convey SPARQL results and thus currently requires that
column headers be provided.  Both those restrictions would be fairly easy
to remove.

I will raise the issue of open sourcing with management again and see if I
get any traction.

On the subject of column ordering I can see benefits of putting the <g>
field first in that it may make it easier to batch operations on a single
graph though I don't think putting it at the end to align with NQuads
precludes this you just require slightly more lookahead to determine
whether to continue adding statements to your batch.

Rob



On 6/18/13 4:41 PM, "Stephen Allen" <sal...@apache.org> wrote:

On Tue, Jun 18, 2013 at 6:05 PM, Andy Seaborne <a...@apache.org> wrote:

On 18/06/13 22:13, Rob Vesse wrote:

Hey Andy


Hi Rob - thanks for the comments - really appreciate feedback -

The basic approach looks sound and I like the simple text based format,
see my notes later on about maybe having a binary serialization as
well.


A binary forms would excellent for this and for NT and NQ.  One of the
speed limitations is parsing and Turtle is slower than NT (this isn't
just
a Jena effect).  gzip is neutral for reading but slows down writing.
So a
fast file format would be quite useful to add to the tool box.


  How do you envisage incremental backups being implemented in practice,
you

suggest in the document that you would take a full RDF dump and then
compute the RDF delta from a previous backup.  Talking from the
experience
of having done this as part of one of my experiments in my PhD this
can be
very complex and time consuming to do especially if you need to take
care
of BNode isomorphism.  I assume from some of the other discussion on
BNodes that you assume that IDs will remain stable across dumps, thus
there is an implicit requirement here that the database be able to dump
RDF using consistent BNode IDs (either internal IDs or some stable
round
trippable IDs).  Taking ARQ as an example the existing NQuads/TriG
writers
do not do this so there would need to be an option for those writers
to be
able to support this.


Shh, don't tell anyone but n-quads and n-triples outputs do dump
recoverable bNode labels :-)  TriG and Turtle do not - they try to be
pretty.  The readers need a  tweak to recover them but the label->Node
code
has an option for various label policies and recover id from label is
one
of them.  This is not exposed formally - it's strictly illegal for RDF
syntaxes.  Or use <_:label> URIs.

I have prototyped a wrapper dataset that records changes as they happen
driven off add(quad) and delete(quad).  This produces the RDF Delta
(sp!)
form so couple to xtn and you can have a "live incremental backup".

A strict after-the-event delta would be prohibitively expensive.


  Even without any concerns of BNode isomorphism comparing two RDF dumps
to

create a delta could be a potentially very time consuming operation and
recording the deltas as changes happen may be far more efficient.  Of
course depending on the exact use case the RDF dump and compute delta
approach may be acceptable.


It isn't a delta in the set theory A\B sense - nor is it a diff (it's
not
reversible without the additional condition).  "delta" and "diff" are
both
names I've toyed with - "RDF changes" might better capture the idea.  Or
"RDF Changes Log".


  My main criticism is on the "Minimise actions" section, there needs to
be

a more solid clarification of definitions and when minimization can and
should happen.


Yes - it isn't as well covered in the doc.

Logically - or generally - in teh event generating dataset wrpapper:

         if ( contains(g,s,p,o) ) {
             record(QuadAction.NO_ADD,g,s,**p,o) ; // No action.
             return ;
         }

         add(g,s,p,o) ;
         record(QuadAction.ADD,g,s,p,o) ;        // Action.

https://github.com/afs/AFS-**Dev/tree/master/src/main/java/**

projects/recorder<https://github.com/afs/AFS-Dev/tree/master/src/main/jav
a/projects/recorder>

but implementations like TDB can do it without the contains() as the
indexes already return true/false for whether a change occurred or not.

For example:

"When written in minimise form the RDF Delta can be run backwards, to
undo
a change. This only works when real changes are recorded because
otherwise
knowing a triple is added does not mean it was not there before."

While I agree it is necessary to record real changes for deltas to be
reverse applied I'm not convinced they have to be in minimized form (at
least based on how the definition of minimized form reads right now),
if
only real changes are recorded then deltas will be in a minimal form.

Yet it is not entirely clear by your definition the following delta
would
be considered minimal:

A <http://s> <http://p> <http://o>
R <http://s> <http://p> <http://o>
A <http://s> <http://p> <http://o>


If the dataset did not originally contain <http://s> <http://p>
<http://o>
then that is minimal.  Each row makes a real change ; it's the fast that
graphs/datasets are set of triples/quads that the real change is needed.


  I'm assuming that your intention was that such deltas should not be

minimized but perhaps this needs to be more clear in the document.


There is no reason not to allow the redundant first two A-D to be
removed
but it's not required.


  On the topic of related work:


I think I may have mentioned previously that I've done some research
work
internally here at YarcData on a general purpose binary serialization
for
Triples, Quads and Tuples which likely could be fairly trivially
extended
to carry a binary encoding of the deltas as well which may save space.
For ball park comparison purposes compression is roughly equivalent to
GZipping raw NTriples with the key advantage being that the format is
significantly faster to process even in its current prototype single
threaded implementation (the design was written to take advantage of
parallelism).  There are a bunch of further optimizations that I had
ideas
for that I never got as far as implementing because of lack of
management
support for the concept.


My experience is that the cost of writing gzip is an appreciable
slowdown.
  If your binary form removes that cost it would help full backups quiet
a
lot.


  There has been some discussion of open sourcing this work (likely as a

contributed Experimental module to Jena) so that it could be developed
outside of the company, if this sounds like it may be of interest I
will
broach the subject with relevant management again and see whether this
can
happen in the near future.


Please do.  I find the style of having a text form and a binary form
makes
system building easier.  Text files to debug; binary for production.

We can add e.g. .ntz and .nqz to the known formats -- modules can add
language, syntaxes, parsers and writers.  The JSON-LD module does, so I
know it does work from outside; all the built-in ones actually register
themselves the same way and have no specials.

Rob:

I would definitely be interested in a binary format for both triples and
quads.  In fact, if it could be generalized to handle arbitrarily sized
RDF
tuples, that would be even better.  I would like to replace the current
text-based solution used for the spill-to-disk functionality.

Andy:
I like what you've done and think it could be very useful.  One
suggestion:
the order of the tuples should be <s> <p> <o> <g> to match the N-Quads
format [1].


-Stephen

[1] http://www.w3.org/TR/n-quads/

Re: RDF Delta - recording changes to RDF Datasets

Reply via email to