Hi Andy, Yes, I can submit a patch for this to JIRA. Do you need a contributor agreement from me now that Jena is part of Apache?
I'm considering a couple of different designs; I'd appreciate any feedback you might have. 1) Serialize the Binding objects as they are generated and before they are applied to the triple template(s). Two methods of doing so are: 1a) Create a Serializable Binding object and use Java's ObjectOutputStream. Here I could check to see if the Binding object implemented Serializable already and just write it out, or copy it into a new Serializable Binding object if it didn't. This would allow stores to serialize the binding object themselves, which could be of benefit to systems like TDB which would store its internal NodeIds instead of Nodes (some mechanism of passing the serialized Binding object type and other important objects, such as the NodeTable reference, around the serialization gap would probably be needed). I would also have to make new Serializable Node objects to parallel the Node subclasses (or modify the existing ones to use Serializable instead of Object in the "label" field). 1b) Implement a custom serializer for Binding and Node_* objects. Could be binary or XML based. Maybe leveraging the com.hp.hpl.jena.sparql.resultset.XMLOutputResultSet class if we wanted to use XML. 2) Serialize the generated Triples after applying the Bindings to the insert/delete templates. This has the benefit of using a slightly modified N-Quads serializer/deserializer (changed to restore blank nodes back to their internal Jena IDs). A further optimization would be to wrap this in a compressed input/output stream. I'm not sure which approach would be better for space efficiency, I guess it would really depend on the specific query as to whether the list of bindings or list of triples would be larger or smaller. As of now it seems like 2) would be slightly easier to implement since I wouldn't have to create a serializer/deserializer. However, it has the drawback of being less general and also forcing the generated triples to be materialized to Nodes and would mean that store implementations would not be able to leverage it if they wanted to generate triples of NodeIds when applying the templates. Also it could be fragile in relying on internal blank node ids passing through the RDF writer and reader. 1a) does not look too difficult if I can make Node serializable, but then this change affects both Jena and ARQ. However, having serializable binding objects has the potential benefit of being useful for other parts of query execution that could be memory bound (sort, join, distinct, group by, etc.). I would like to tackle option 1a), but I have a few questions: - I want to make sure that there would be no major adverse effects from making the Node classes Serializable and the Node label field Serializable. - The Binding.getParent() method. What is this used for? I think I can ignore this and store just the results of .vars(), and results of .get(var) for each variable since these will retrieve any required info from the parents as necessary. - Does it matter that Bindings coming out of the deserializer will be flat and lose any notion of their original types? - Should the Binding that comes out of the deserializer be immutable or should it properly implement .add() and .addAll() (for the SPARQL Update case it can definitely be immutable, but I'm not sure if it needs to be elsewhere in the query execution process)? Unrelated to the above but dealing with the SPARQL Update implementation would be a change to add a method to either the GraphStore or DatasetGraph interface that creates and adds a named graph. This would be useful for the SPARQL Update create method so that a native graph could be created. The current mechanism of creating a default in-memory Jena graph and adding that to the GraphStore works, but seems a little ugly because of the extra work to create an object that just gets iterated over and then thrown away if the store creates its own graph object to replace it during the add call. Another benefit would be for users of the graph store to have a standard way of creating new graphs that are native to the GraphStore. -Stephen -----Original Message----- From: Andy Seaborne [mailto:[email protected]] Sent: Wednesday, January 26, 2011 3:17 PM To: [email protected] Subject: Re: SPARQL 1.1 Update in ARQ 2.8.7 On 26/01/11 15:19, Stephen Allen wrote: > Hi, > > I am working on updating Parliament to ARQ 2.8.7 from 2.8.5. I've noticed > that there are now two parallel SPARQL/Update mechanisms [1]. I'm guessing > the "submission" package refers to the SPARQL Update member submission [2] > and "request" is the new support added for the SPARQL 1.1 Update working > draft [3]? Don't use "submission" - it's legacy. And in the development version does not exist. At 2.8.7, SPARQL 1.1 Update is the update langauge with some syntax support for the submission in the SPARQL 1.1 Update parser and, so far, this has proved adequate. This is what you get for syntaxARQ, and syntaxARQ is the default. At some point, it is likely that SPARQL 1.1 strict becomes the default and an app would need to ask for synatxARQ, as it is with query. > I'd like to implement the new mechanism for Parliament. Previously, I was > able to subclass UpdateProcessorVisitor (now > UpdateProcessorSubmissionVisitor) in order to provide my own implementation > for certain methods. As an example, Parliament is implemented as a > collection of triple stores, so it can safely read from one graph while > writing to another one (and thus avoid buffering all statements in an > ArrayList). Also ARQ currently stores all WHERE clause bindings in a > ArrayList during an insert/delete operation, but I would like to make this > more memory efficient for large updates by serializing bindings to disk in a > temporary file (after it passes a threshold). I'd like to do exactly that for ARQ - could you submit a patch to Apache jena JIRA for incorporation into the main code base? > With an eye towards not copying a lot of ARQ code into my codebase, would it > be possible to change the class access modifier of > com.hp.hpl.jena.sparql.modify.UpdateEngineWorker to public instead of > package-private and make some of the private methods protected instead (also > com.hp.hpl.jena.sparql.modify.NodeTransformBNodesToVariables)? Certainly - done in SourceForge SVN and a new 2.8.8-SNAPSHOT available with the changes. Let's identify which operations should protected and which private - i made them al protected for now. http://openjena.org/repo-dev/com/hp/hpl/jena/arq/ This includes a zip distribution as well as the usual maven artifacts. The current implementation is a bit "direct" in places - the buffering in ArrayList being a good example. It means it makes as few assumptions about the storage layer as possible but clearly that generality is at a potential cost. UpdateEngineWorker is a step towards an extension mechanism. I'd like to identify a set of "update ops" that can be used to build each of the SPARQL Update request types so an implementation can add varying degree of efficiency for the amount of work needed. If you have any insights here, I'd very much appreciate hearing them. (You have presumably found the registry UpdateEngineRegistry - it all parallels the query engine extension design) > Thanks, > Stephen > > P.S. I note that the following SPARQL/Update functions are specified in your > implementation/grammar: ADD, MOVE, COPY. However I don't see them in the > latest working draft [3]. Presumably they are coming in the future? They were only agreed at about the time of the last publication. But they are missing from the editors working draft as well; I've just added a note to the SPARQL-WG wiki as work items that need to be done. Thanks for catching this. The syntax rules are: ADD SILENT? GraphOrDefault TO GraphOrDefault MOVE SILENT? GraphOrDefault TO GraphOrDefault COPY SILENT? GraphOrDefault TO GraphOrDefault GraphOrDefault ::= DEFAULT | GRAPH? IRIref Given the separate Andy > > > [1] "com.hp.hpl.jena.sparql.modify.request" and > "com.hp.hpl.jena.sparql.modify.submission" > [2] http://www.w3.org/Submission/SPARQL-Update/ > [3] http://www.w3.org/TR/sparql11-update/
