Re: SPARQL 1.1 Update in ARQ 2.8.7

Andy Seaborne Mon, 31 Jan 2011 04:22:44 -0800

Hi Stephen,

We have resisted making Node serializable although that's usually beenrequested for RPC purposes in conjunction with serializable graphs andit is better to use RDF syntaxes for that. Automatic Java serializationshould not in the current design end up pulling on a lot of stuff but(long term, no plans) Node may be an interface so things like Parliamentcan have Nodes carrying internal data around (not Nodes wil not be tiesto storage layers - that would break use inference and probably lots ofother things). To repeat: There are no plans to make Nodes interfaces -an experiment has been done.

ObjectOutputStreams might be too complicated. In writing toObjectOutputStream, there may be one than one Java object written perNode (e.g. literals, lexical form and datatype) and the whole sharedreference mechanism for ObjectOutputStream may or may not be a win. Itneeds state during writing to do that so this might be a loss when theobjective is to have beyond-memory datastructures.

And ObjectOutputStream.writeUTF is limited to 2 bytes of length. While64K strings seems a lot, for a general mechanism it's a bit of a pain(Uniprot has 69Kb literals, Clerezza have been experimenting with 3Mbliterals).

XML is not fast to parse - a lot of bytes, and the layering of theprocessing in order to reuse a standard parser can incur costs. StAX isbetter, and the streaming model is going to suite this use but thereseems to be to be little value in XML internally - it's not abouttransfer between applications. Instead, JSON or RIOT (which parsefaster) can be used.


> However,
> having serializable binding objects has the potential benefit of being

> useful for other parts of query execution that could be memory bound> (sort, join, distinct, group by, etc.).

I agree - a general mechanism for spill-to-disk iterators of bindingscould be so useful that an implementation that is specific to this caseis worth while. The functionality already in Java looks to be not thathelpful although using DataOutputStream as one implementation of a"BindingOutputStream" would be interesting to compare to doing it withthe parser functionality.


There are some building blocks already.

The RIOT parser suite is built on top of a general tokenizer thatunderstands Turtle-style tokens and a few extras. The extras were putin for this sort of situation for extensibility. The tokenizer is tunedfor speed. So it natively recognized IRIs and literals with lang tagfor example.

Writing a binding could be done as one row, of alternating var and value(because bindings may have different variables in them in different rows).

Alternatively, a table, with declared columns could be done. Thepossible columns can be calculated from the syntax of a query althoughthis isn't easily available currently.

As well as the usual RDF tokens (IRIs, literals, bnodes) it doesvariables and "symbols", where symbols are things that can be use dtoextend the language.


A simple run-length-encoding compression scheme would be low-space and

If we use "*" to mean "same as the row before in this position, andprefixes can be used to compress URIs


?s <http://example/> ?p <http://example/p> ?o 123 .
* ?p <http://example/p1> ?o "hello" .

Further token replacement with common strings (e.g. "<http://";) woudlalso get the size down quite easily. That also compresses numericaldata (datatype is syntax, not explicit declaration).


The fact output is sort of "human readable" helps debugging :-)

Aside: in working on RIOT, I have found that reading from gzip streamsis slightly slower than working on the uncompressed data despite therebeing more I/O bytes involved. If it's across a network, I'm sure thereverse would be true. But compression is a lot more expensive thandecompression for gzip. My guess is that gz compression will not be a win.

The output side of write-node-to-stream is something I've been meaningto do better for a while. There is FmtUtils, which can turn RDF termsto Strings, which really should have been RDF term to output, one ofwhich is a wrapped StringBuilder or ByteArrayOutputStream.Unfortunately, FmtUtils does the job quite well, even if it is a copy,and has the advantage it can provide the length of output before theactual output which is sometimes needed, or at least convenient. (c.f.TDB NodecSSE).

There is some code in an experimental system to write streams of RIOTtokens. It may even do the binding/row stuff. I can't look it up veryeasily at the moment, because some SF services like "view SVN" are offline.


So something in the "class 1" would be very valuable.

As for integration with TDB and where it works in NodeIds, not Nodes,things like sort need the full node


>   - Does it matter that Bindings coming out of the deserializer
> will be flat and lose any notion of their original types?

No, it shouldn't It's sued because in query processing, adding bindingsis done by sharing the previous results, avoiding a copy and saving somespace.


>   - Should the Binding that comes out of the deserializer be
> immutable or should it properly implement .add() and .addAll()
> (for the SPARQL Update case it can definitely be immutable,
>  but I'm not sure if it needs to be elsewhere in the query
> execution process)?

Immutable is probably fine. It's not possible to "set" a bindingcurrently, only add one. Once assigned, it can't be changed.

The general style is that some stage creates the binding or extends itsinput, finishes it's work then the binding is not changed after that stage.


        Andy


On 27/01/11 23:35, Stephen Allen wrote:

1) Serialize the Binding objects as they are generated and before they are
applied to the triple template(s).  Two methods of doing so are:

1a) Create a Serializable Binding object and use Java's ObjectOutputStream.
Here I could check to see if the Binding object implemented Serializable
already and just write it out, or copy it into a new Serializable Binding
object if it didn't.  This would allow stores to serialize the binding
object themselves, which could be of benefit to systems like TDB which would
store its internal NodeIds instead of Nodes (some mechanism of passing the
serialized Binding object type and other important objects, such as the
NodeTable reference, around the serialization gap would probably be needed).
I would also have to make new Serializable Node objects to parallel the Node
subclasses (or modify the existing ones to use Serializable instead of
Object in the "label" field).

1b) Implement a custom serializer for Binding and Node_* objects.  Could be
binary or XML based.  Maybe leveraging the
com.hp.hpl.jena.sparql.resultset.XMLOutputResultSet class if we wanted to
use XML.

2) Serialize the generated Triples after applying the Bindings to the
insert/delete templates.  This has the benefit of using a slightly modified
N-Quads serializer/deserializer (changed to restore blank nodes back to
their internal Jena IDs).  A further optimization would be to wrap this in a
compressed input/output stream.

I'm not sure which approach would be better for space efficiency, I guess it
would really depend on the specific query as to whether the list of bindings
or list of triples would be larger or smaller.  As of now it seems like 2)
would be slightly easier to implement since I wouldn't have to create a
serializer/deserializer.  However, it has the drawback of being less general
and also forcing the generated triples to be materialized to Nodes and would
mean that store implementations would not be able to leverage it if they
wanted to generate triples of NodeIds when applying the templates.  Also it
could be fragile in relying on internal blank node ids passing through the
RDF writer and reader.  1a) does not look too difficult if I can make Node
serializable, but then this change affects both Jena and ARQ.  However,
having serializable binding objects has the potential benefit of being
useful for other parts of query execution that could be memory bound (sort,
join, distinct, group by, etc.).


I would like to tackle option 1a), but I have a few questions:

  - I want to make sure that there would be no major adverse effects from
making the Node classes Serializable and the Node label field Serializable.
  - The Binding.getParent() method.  What is this used for?  I think I can
ignore this and store just the results of .vars(), and results of .get(var)
for each variable since these will retrieve any required info from the
parents as necessary.
  - Does it matter that Bindings coming out of the deserializer will be flat
and lose any notion of their original types?
  - Should the Binding that comes out of the deserializer be immutable or
should it properly implement .add() and .addAll() (for the SPARQL Update
case it can definitely be immutable, but I'm not sure if it needs to be
elsewhere in the query execution process)?

Re: SPARQL 1.1 Update in ARQ 2.8.7

Reply via email to