Andy, I have started implementing the serializer (SinkBindingOutput) by using org.openjena.riot.SinkQuadOutput as a guide and using OutputLangUtils to print out the variable/values. I created the deserializer (LangBindings) by extending org.openjena.riot.lang.LangNTuple. I'm using the paired var/value format you described below. For now I'll start with a straightforward implementation with no compression, but like your ideas in this area. I'll try to do some measurements to see if any other compression is beneficial.
I did not define an org.openjena.riot.Lang enum for the deserializer (because it isn't an RDF language) but I was planning on putting the LangBindings class in the org.openjena.riot.lang package. For determining when to spill bindings to disk, there are a few options (in order of least difficulty): 1) Store binding objects in an list, and then spill them to disk once the list size passes a threshold 2) Start serializing bindings immediately into something like DeferredFileOutputStream [1] that will retain the data in memory until it passes a memory threshold 3) Do 1), but try to calculate the size of the bindings in memory and use a memory threshold instead of a number of bindings threshold I think 1) should be sufficient if we come up with a reasonable guess for the threshold. Option 2) lets you get much better control over the memory management, but I think the cost of unnecessarily serializing/deserializing small queries may be too high. -Stephen [1] http://commons.apache.org/io/api-release/org/apache/commons/io/output/Deferr edFileOutputStream.html -----Original Message----- From: Andy Seaborne [mailto:[email protected]] Sent: Monday, January 31, 2011 7:22 AM To: [email protected] Subject: Re: SPARQL 1.1 Update in ARQ 2.8.7 Hi Stephen, We have resisted making Node serializable although that's usually been requested for RPC purposes in conjunction with serializable graphs and it is better to use RDF syntaxes for that. Automatic Java serialization should not in the current design end up pulling on a lot of stuff but (long term, no plans) Node may be an interface so things like Parliament can have Nodes carrying internal data around (not Nodes wil not be ties to storage layers - that would break use inference and probably lots of other things). To repeat: There are no plans to make Nodes interfaces - an experiment has been done. ObjectOutputStreams might be too complicated. In writing to ObjectOutputStream, there may be one than one Java object written per Node (e.g. literals, lexical form and datatype) and the whole shared reference mechanism for ObjectOutputStream may or may not be a win. It needs state during writing to do that so this might be a loss when the objective is to have beyond-memory datastructures. And ObjectOutputStream.writeUTF is limited to 2 bytes of length. While 64K strings seems a lot, for a general mechanism it's a bit of a pain (Uniprot has 69Kb literals, Clerezza have been experimenting with 3Mb literals). XML is not fast to parse - a lot of bytes, and the layering of the processing in order to reuse a standard parser can incur costs. StAX is better, and the streaming model is going to suite this use but there seems to be to be little value in XML internally - it's not about transfer between applications. Instead, JSON or RIOT (which parse faster) can be used. > However, > having serializable binding objects has the potential benefit of being > useful for other parts of query execution that could be memory bound > (sort, join, distinct, group by, etc.). I agree - a general mechanism for spill-to-disk iterators of bindings could be so useful that an implementation that is specific to this case is worth while. The functionality already in Java looks to be not that helpful although using DataOutputStream as one implementation of a "BindingOutputStream" would be interesting to compare to doing it with the parser functionality. There are some building blocks already. The RIOT parser suite is built on top of a general tokenizer that understands Turtle-style tokens and a few extras. The extras were put in for this sort of situation for extensibility. The tokenizer is tuned for speed. So it natively recognized IRIs and literals with lang tag for example. Writing a binding could be done as one row, of alternating var and value (because bindings may have different variables in them in different rows). Alternatively, a table, with declared columns could be done. The possible columns can be calculated from the syntax of a query although this isn't easily available currently. As well as the usual RDF tokens (IRIs, literals, bnodes) it does variables and "symbols", where symbols are things that can be use dto extend the language. A simple run-length-encoding compression scheme would be low-space and If we use "*" to mean "same as the row before in this position, and prefixes can be used to compress URIs ?s <http://example/> ?p <http://example/p> ?o 123 . * ?p <http://example/p1> ?o "hello" . Further token replacement with common strings (e.g. "<http://") woudl also get the size down quite easily. That also compresses numerical data (datatype is syntax, not explicit declaration). The fact output is sort of "human readable" helps debugging :-) Aside: in working on RIOT, I have found that reading from gzip streams is slightly slower than working on the uncompressed data despite there being more I/O bytes involved. If it's across a network, I'm sure the reverse would be true. But compression is a lot more expensive than decompression for gzip. My guess is that gz compression will not be a win. The output side of write-node-to-stream is something I've been meaning to do better for a while. There is FmtUtils, which can turn RDF terms to Strings, which really should have been RDF term to output, one of which is a wrapped StringBuilder or ByteArrayOutputStream. Unfortunately, FmtUtils does the job quite well, even if it is a copy, and has the advantage it can provide the length of output before the actual output which is sometimes needed, or at least convenient. (c.f. TDB NodecSSE). There is some code in an experimental system to write streams of RIOT tokens. It may even do the binding/row stuff. I can't look it up very easily at the moment, because some SF services like "view SVN" are offline. So something in the "class 1" would be very valuable. As for integration with TDB and where it works in NodeIds, not Nodes, things like sort need the full node > - Does it matter that Bindings coming out of the deserializer > will be flat and lose any notion of their original types? No, it shouldn't It's sued because in query processing, adding bindings is done by sharing the previous results, avoiding a copy and saving some space. > - Should the Binding that comes out of the deserializer be > immutable or should it properly implement .add() and .addAll() > (for the SPARQL Update case it can definitely be immutable, > but I'm not sure if it needs to be elsewhere in the query > execution process)? Immutable is probably fine. It's not possible to "set" a binding currently, only add one. Once assigned, it can't be changed. The general style is that some stage creates the binding or extends its input, finishes it's work then the binding is not changed after that stage. Andy On 27/01/11 23:35, Stephen Allen wrote: > 1) Serialize the Binding objects as they are generated and before they are > applied to the triple template(s). Two methods of doing so are: > > 1a) Create a Serializable Binding object and use Java's ObjectOutputStream. > Here I could check to see if the Binding object implemented Serializable > already and just write it out, or copy it into a new Serializable Binding > object if it didn't. This would allow stores to serialize the binding > object themselves, which could be of benefit to systems like TDB which would > store its internal NodeIds instead of Nodes (some mechanism of passing the > serialized Binding object type and other important objects, such as the > NodeTable reference, around the serialization gap would probably be needed). > I would also have to make new Serializable Node objects to parallel the Node > subclasses (or modify the existing ones to use Serializable instead of > Object in the "label" field). > > 1b) Implement a custom serializer for Binding and Node_* objects. Could be > binary or XML based. Maybe leveraging the > com.hp.hpl.jena.sparql.resultset.XMLOutputResultSet class if we wanted to > use XML. > > 2) Serialize the generated Triples after applying the Bindings to the > insert/delete templates. This has the benefit of using a slightly modified > N-Quads serializer/deserializer (changed to restore blank nodes back to > their internal Jena IDs). A further optimization would be to wrap this in a > compressed input/output stream. > > I'm not sure which approach would be better for space efficiency, I guess it > would really depend on the specific query as to whether the list of bindings > or list of triples would be larger or smaller. As of now it seems like 2) > would be slightly easier to implement since I wouldn't have to create a > serializer/deserializer. However, it has the drawback of being less general > and also forcing the generated triples to be materialized to Nodes and would > mean that store implementations would not be able to leverage it if they > wanted to generate triples of NodeIds when applying the templates. Also it > could be fragile in relying on internal blank node ids passing through the > RDF writer and reader. 1a) does not look too difficult if I can make Node > serializable, but then this change affects both Jena and ARQ. However, > having serializable binding objects has the potential benefit of being > useful for other parts of query execution that could be memory bound (sort, > join, distinct, group by, etc.). > > > I would like to tackle option 1a), but I have a few questions: > > - I want to make sure that there would be no major adverse effects from > making the Node classes Serializable and the Node label field Serializable. > - The Binding.getParent() method. What is this used for? I think I can > ignore this and store just the results of .vars(), and results of .get(var) > for each variable since these will retrieve any required info from the > parents as necessary. > - Does it matter that Bindings coming out of the deserializer will be flat > and lose any notion of their original types? > - Should the Binding that comes out of the deserializer be immutable or > should it properly implement .add() and .addAll() (for the SPARQL Update > case it can definitely be immutable, but I'm not sure if it needs to be > elsewhere in the query execution process)?
