Hi Stephen,
We have resisted making Node serializable although that's usually been
requested for RPC purposes in conjunction with serializable graphs and
it is better to use RDF syntaxes for that. Automatic Java serialization
should not in the current design end up pulling on a lot of stuff but
(long term, no plans) Node may be an interface so things like Parliament
can have Nodes carrying internal data around (not Nodes wil not be ties
to storage layers - that would break use inference and probably lots of
other things). To repeat: There are no plans to make Nodes interfaces -
an experiment has been done.
ObjectOutputStreams might be too complicated. In writing to
ObjectOutputStream, there may be one than one Java object written per
Node (e.g. literals, lexical form and datatype) and the whole shared
reference mechanism for ObjectOutputStream may or may not be a win. It
needs state during writing to do that so this might be a loss when the
objective is to have beyond-memory datastructures.
And ObjectOutputStream.writeUTF is limited to 2 bytes of length. While
64K strings seems a lot, for a general mechanism it's a bit of a pain
(Uniprot has 69Kb literals, Clerezza have been experimenting with 3Mb
literals).
XML is not fast to parse - a lot of bytes, and the layering of the
processing in order to reuse a standard parser can incur costs. StAX is
better, and the streaming model is going to suite this use but there
seems to be to be little value in XML internally - it's not about
transfer between applications. Instead, JSON or RIOT (which parse
faster) can be used.
> However,
> having serializable binding objects has the potential benefit of being
> useful for other parts of query execution that could be memory bound
> (sort, join, distinct, group by, etc.).
I agree - a general mechanism for spill-to-disk iterators of bindings
could be so useful that an implementation that is specific to this case
is worth while. The functionality already in Java looks to be not that
helpful although using DataOutputStream as one implementation of a
"BindingOutputStream" would be interesting to compare to doing it with
the parser functionality.
There are some building blocks already.
The RIOT parser suite is built on top of a general tokenizer that
understands Turtle-style tokens and a few extras. The extras were put
in for this sort of situation for extensibility. The tokenizer is tuned
for speed. So it natively recognized IRIs and literals with lang tag
for example.
Writing a binding could be done as one row, of alternating var and value
(because bindings may have different variables in them in different rows).
Alternatively, a table, with declared columns could be done. The
possible columns can be calculated from the syntax of a query although
this isn't easily available currently.
As well as the usual RDF tokens (IRIs, literals, bnodes) it does
variables and "symbols", where symbols are things that can be use dto
extend the language.
A simple run-length-encoding compression scheme would be low-space and
If we use "*" to mean "same as the row before in this position, and
prefixes can be used to compress URIs
?s <http://example/> ?p <http://example/p> ?o 123 .
* ?p <http://example/p1> ?o "hello" .
Further token replacement with common strings (e.g. "<http://") woudl
also get the size down quite easily. That also compresses numerical
data (datatype is syntax, not explicit declaration).
The fact output is sort of "human readable" helps debugging :-)
Aside: in working on RIOT, I have found that reading from gzip streams
is slightly slower than working on the uncompressed data despite there
being more I/O bytes involved. If it's across a network, I'm sure the
reverse would be true. But compression is a lot more expensive than
decompression for gzip. My guess is that gz compression will not be a win.
The output side of write-node-to-stream is something I've been meaning
to do better for a while. There is FmtUtils, which can turn RDF terms
to Strings, which really should have been RDF term to output, one of
which is a wrapped StringBuilder or ByteArrayOutputStream.
Unfortunately, FmtUtils does the job quite well, even if it is a copy,
and has the advantage it can provide the length of output before the
actual output which is sometimes needed, or at least convenient. (c.f.
TDB NodecSSE).
There is some code in an experimental system to write streams of RIOT
tokens. It may even do the binding/row stuff. I can't look it up very
easily at the moment, because some SF services like "view SVN" are offline.
So something in the "class 1" would be very valuable.
As for integration with TDB and where it works in NodeIds, not Nodes,
things like sort need the full node
> - Does it matter that Bindings coming out of the deserializer
> will be flat and lose any notion of their original types?
No, it shouldn't It's sued because in query processing, adding bindings
is done by sharing the previous results, avoiding a copy and saving some
space.
> - Should the Binding that comes out of the deserializer be
> immutable or should it properly implement .add() and .addAll()
> (for the SPARQL Update case it can definitely be immutable,
> but I'm not sure if it needs to be elsewhere in the query
> execution process)?
Immutable is probably fine. It's not possible to "set" a binding
currently, only add one. Once assigned, it can't be changed.
The general style is that some stage creates the binding or extends its
input, finishes it's work then the binding is not changed after that stage.
Andy
On 27/01/11 23:35, Stephen Allen wrote:
1) Serialize the Binding objects as they are generated and before they are
applied to the triple template(s). Two methods of doing so are:
1a) Create a Serializable Binding object and use Java's ObjectOutputStream.
Here I could check to see if the Binding object implemented Serializable
already and just write it out, or copy it into a new Serializable Binding
object if it didn't. This would allow stores to serialize the binding
object themselves, which could be of benefit to systems like TDB which would
store its internal NodeIds instead of Nodes (some mechanism of passing the
serialized Binding object type and other important objects, such as the
NodeTable reference, around the serialization gap would probably be needed).
I would also have to make new Serializable Node objects to parallel the Node
subclasses (or modify the existing ones to use Serializable instead of
Object in the "label" field).
>
1b) Implement a custom serializer for Binding and Node_* objects. Could be
binary or XML based. Maybe leveraging the
com.hp.hpl.jena.sparql.resultset.XMLOutputResultSet class if we wanted to
use XML.
2) Serialize the generated Triples after applying the Bindings to the
insert/delete templates. This has the benefit of using a slightly modified
N-Quads serializer/deserializer (changed to restore blank nodes back to
their internal Jena IDs). A further optimization would be to wrap this in a
compressed input/output stream.
I'm not sure which approach would be better for space efficiency, I guess it
would really depend on the specific query as to whether the list of bindings
or list of triples would be larger or smaller. As of now it seems like 2)
would be slightly easier to implement since I wouldn't have to create a
serializer/deserializer. However, it has the drawback of being less general
and also forcing the generated triples to be materialized to Nodes and would
mean that store implementations would not be able to leverage it if they
wanted to generate triples of NodeIds when applying the templates. Also it
could be fragile in relying on internal blank node ids passing through the
RDF writer and reader. 1a) does not look too difficult if I can make Node
serializable, but then this change affects both Jena and ARQ. However,
having serializable binding objects has the potential benefit of being
useful for other parts of query execution that could be memory bound (sort,
join, distinct, group by, etc.).
I would like to tackle option 1a), but I have a few questions:
- I want to make sure that there would be no major adverse effects from
making the Node classes Serializable and the Node label field Serializable.
- The Binding.getParent() method. What is this used for? I think I can
ignore this and store just the results of .vars(), and results of .get(var)
for each variable since these will retrieve any required info from the
parents as necessary.
- Does it matter that Bindings coming out of the deserializer will be flat
and lose any notion of their original types?
- Should the Binding that comes out of the deserializer be immutable or
should it properly implement .add() and .addAll() (for the SPARQL Update
case it can definitely be immutable, but I'm not sure if it needs to be
elsewhere in the query execution process)?