Re: Binary RDF

Rob Vesse Fri, 20 Jun 2014 01:50:21 -0700

Andy

Comments inline:


On 19/06/2014 17:06, "Andy Seaborne" <a...@apache.org> wrote:

>Lizard needs to do network transfer of RDF data.  Rather than just doing
>something specific to Lizard, I've started on a general binary RDF
>module using Apache Thrift.
>
>== RDF-Thrift
>Work in Progress :: https://github.com/afs/rdf-thrift/
>
>Discussion welcome.
>
>
>The current is to have three supported abstractions:
>
>1. StreamRDF
>2. SPARQL Result Sets
>3. RDF patch (which is very like StreamRDF but with A and D markers).
>
>A first pass for StreamRDF is done including some attempts to reduce
>objetc churn when crossing the abstract boundaries. Abstract is all very
>well but repeated conversion of datastructures can slow things down.
>
>Using StreamRDF means that prefix compression can be done.
>
>See
>   https://github.com/afs/rdf-thrift/blob/master/RDF.thrift
>for the encoding at the moment for just RDF.

Looks like a sane encoding from what I understand of Thrift

>
>== In Jena
>
>There are a number of places this might be useful:
>
>1/ Fuseki and "application/sparql-results+thrift", "application/x-thrift"
>
>(oh dear, "application/x-thrift", "x-" is not encouraged any more due to
>the transition problem c.f. "application/x-www-form-urlencoded")
>
>2/ Hadoop-RDF
>
>This is currently using N-Triple/N-Quads.  Rob - presumably this would
>be useful eventually.  AbstractNodeTupleWritable /
>AbstractNLineFileInputFormat look about right to be but that's from
>code-reading not code-doing.

Yes and No

The concerns on Hadoop are somewhat different.  It is
advantageous/required that the Hadoop code has direct control over the
binary serialisation because of the contract for Writable.  This is needed
both to support serialisation and deserialisation of values and in order
to optionally provide direct comparisons on the binary representation
terms which has substantial performance benefits because it avoids having
to unnecessarily deserialise terms.

It is unclear to me whether using RDF Thrift would allow this or not?  Or
if the overhead of Thrift would be more overall?

Certainly it would be possible to support a RDF Thrift based binary RDF as
an input & output format regardless of how the writables are defined

>
>(I know you/Cray have some internal binary RDF)

Yes though the intent of that format is somewhat different.  It was
designed to be a parallel friendly RDF specific compression format so
besides a global header at the start of the stream it is then block
oriented such that each block is entirely independent of each other and
requires only the data in the global header and itself in order to permit
decompression.

For small data there will be little/no benefit, for large data the
compression achieved is roughly equivalent to GZipped NTriples with the
primary advantage that it is substantially faster to produce (about 5x)
and potentially even faster given a good parallel implementation.  Of
course what we have is mostly just a prototype and it hasn't been heavily
optimised so there may be more performance to be had.

>
>3/ Data bags and spill to disk
>
>4/ RDF patch
>
>5/ TDB (v2 - it would be a disk change) could useful use the RDF term
>encoding for the node table.

Would this actually save much space?

It looks like you'd only save a few bytes because you still have to store
the bulk of the term encoding you just lose some of the surface syntax
that something like a NTriples encoding would give you

Rob

>
>5/ Files.  Add to RIOT as a new syntax (a fairly direct access to
>StreamRDF+Thrift) which then helps TDB loading.
>
>6/ Caching results set in queries in Fuseki.
>
>In an ideal world, the Thrift format could be shared across toolkits.
>There is nothing Jena specific about the wire encoding.
>
>== Thrift vs Protocol Buffer(+netty)
>
>The Lizard prototype currently uses Protocol Buffer + netty.  Doing RDF
>Thrift has a way to learn about Thrift.
>
>All the reviews and comparisons on the interweb seem to be born out.
>There isn't a huge difference between the two.
>
>Thrift's initial entry costs are higher (document is still weak, the
>maven artifact does not have a maven compatible source artifact (!!!) so
>you have to mangle one yourself which isn't hard; there is the source
>but in a non-standard form.
>
>Thrift has it's own networking; I'm unlikely to use the service (RPC)
>layer from Thrift in Lizard itself as it is not fully streaming but
>driving the next layer down directly is quite easy (as it is in PB+N).
>
>Protocol Buffers does not have a network layer, it's just the byte
>encoding, but Netty comes with built in protocol buffer handling (PB+N).
>  That works fine as well and I have done back and found the equivalent
>functionality I have used in RDF Thrift.
>
>For binary RDF and it's general use, thrift's wider language cover is a
>plus point.
>
>       Andy

Re: Binary RDF

Reply via email to