RE: SPARQL 1.1 Update in ARQ 2.8.7

Stephen Allen Thu, 27 Jan 2011 15:33:58 -0800

Hi Andy,

Yes, I can submit a patch for this to JIRA.  Do you need a contributor
agreement from me now that Jena is part of Apache?

I'm considering a couple of different designs; I'd appreciate any feedback
you might have.

1) Serialize the Binding objects as they are generated and before they are
applied to the triple template(s).  Two methods of doing so are:

1a) Create a Serializable Binding object and use Java's ObjectOutputStream.
Here I could check to see if the Binding object implemented Serializable
already and just write it out, or copy it into a new Serializable Binding
object if it didn't.  This would allow stores to serialize the binding
object themselves, which could be of benefit to systems like TDB which would
store its internal NodeIds instead of Nodes (some mechanism of passing the
serialized Binding object type and other important objects, such as the
NodeTable reference, around the serialization gap would probably be needed).
I would also have to make new Serializable Node objects to parallel the Node
subclasses (or modify the existing ones to use Serializable instead of
Object in the "label" field).

1b) Implement a custom serializer for Binding and Node_* objects.  Could be
binary or XML based.  Maybe leveraging the
com.hp.hpl.jena.sparql.resultset.XMLOutputResultSet class if we wanted to
use XML.

2) Serialize the generated Triples after applying the Bindings to the
insert/delete templates.  This has the benefit of using a slightly modified
N-Quads serializer/deserializer (changed to restore blank nodes back to
their internal Jena IDs).  A further optimization would be to wrap this in a
compressed input/output stream.

I'm not sure which approach would be better for space efficiency, I guess it
would really depend on the specific query as to whether the list of bindings
or list of triples would be larger or smaller.  As of now it seems like 2)
would be slightly easier to implement since I wouldn't have to create a
serializer/deserializer.  However, it has the drawback of being less general
and also forcing the generated triples to be materialized to Nodes and would
mean that store implementations would not be able to leverage it if they
wanted to generate triples of NodeIds when applying the templates.  Also it
could be fragile in relying on internal blank node ids passing through the
RDF writer and reader.  1a) does not look too difficult if I can make Node
serializable, but then this change affects both Jena and ARQ.  However,
having serializable binding objects has the potential benefit of being
useful for other parts of query execution that could be memory bound (sort,
join, distinct, group by, etc.).

I would like to tackle option 1a), but I have a few questions:

 - I want to make sure that there would be no major adverse effects from
making the Node classes Serializable and the Node label field Serializable.
 - The Binding.getParent() method.  What is this used for?  I think I can
ignore this and store just the results of .vars(), and results of .get(var)
for each variable since these will retrieve any required info from the
parents as necessary.
 - Does it matter that Bindings coming out of the deserializer will be flat
and lose any notion of their original types?
 - Should the Binding that comes out of the deserializer be immutable or
should it properly implement .add() and .addAll() (for the SPARQL Update
case it can definitely be immutable, but I'm not sure if it needs to be
elsewhere in the query execution process)?

Unrelated to the above but dealing with the SPARQL Update implementation
would be a change to add a method to either the GraphStore or DatasetGraph
interface that creates and adds a named graph.  This would be useful for the
SPARQL Update create method so that a native graph could be created.  The
current mechanism of creating a default in-memory Jena graph and adding that
to the GraphStore works, but seems a little ugly because of the extra work
to create an object that just gets iterated over and then thrown away if the
store creates its own graph object to replace it during the add call.
Another benefit would be for users of the graph store to have a standard way
of creating new graphs that are native to the GraphStore.

-Stephen

-----Original Message-----
From: Andy Seaborne [mailto:[email protected]] 
Sent: Wednesday, January 26, 2011 3:17 PM
To: [email protected]
Subject: Re: SPARQL 1.1 Update in ARQ 2.8.7

On 26/01/11 15:19, Stephen Allen wrote:
> Hi,
>
> I am working on updating Parliament to ARQ 2.8.7 from 2.8.5.  I've noticed
> that there are now two parallel SPARQL/Update mechanisms [1].  I'm
guessing
> the "submission" package refers to the SPARQL Update member submission [2]
> and "request" is the new support added for the SPARQL 1.1 Update working
> draft [3]?

Don't use "submission" - it's legacy.  And in the development version 
does not exist.

At 2.8.7, SPARQL 1.1 Update is the update langauge with some syntax 
support for the submission in the SPARQL 1.1 Update parser and, so far, 
this has proved adequate.  This is what you get for syntaxARQ, and 
syntaxARQ is the default.

At some point, it is likely that SPARQL 1.1 strict becomes the default 
and an app would need to ask for synatxARQ, as it is with query.

> I'd like to implement the new mechanism for Parliament.  Previously, I was
> able to subclass UpdateProcessorVisitor (now
> UpdateProcessorSubmissionVisitor) in order to provide my own
implementation
> for certain methods.  As an example, Parliament is implemented as a
> collection of triple stores, so it can safely read from one graph while
> writing to another one (and thus avoid buffering all statements in an
> ArrayList).  Also ARQ currently stores all WHERE clause bindings in a
> ArrayList during an insert/delete operation, but I would like to make this
> more memory efficient for large updates by serializing bindings to disk in
a
> temporary file (after it passes a threshold).

I'd like to do exactly that for ARQ - could you submit a patch to Apache 
jena JIRA for incorporation into the main code base?

> With an eye towards not copying a lot of ARQ code into my codebase, would
it
> be possible to change the class access modifier of
> com.hp.hpl.jena.sparql.modify.UpdateEngineWorker to public instead of
> package-private and make some of the private methods protected instead
(also
> com.hp.hpl.jena.sparql.modify.NodeTransformBNodesToVariables)?

Certainly - done in SourceForge SVN and a new 2.8.8-SNAPSHOT available 
with the changes.  Let's identify which operations should protected and 
which private - i made them al protected for now.

http://openjena.org/repo-dev/com/hp/hpl/jena/arq/

This includes a zip distribution as well as the usual maven artifacts.

The current implementation is a bit "direct" in places - the buffering 
in ArrayList being a good example.  It means it makes as few assumptions 
about the storage layer as possible but clearly that generality is at a 
potential cost.

UpdateEngineWorker is a step towards an extension mechanism. I'd like to 
identify a set of "update ops" that can be used to build each of the 
SPARQL Update request types so an implementation can add varying degree 
of efficiency for the amount of work needed.

If you have any insights here, I'd very much appreciate hearing them.

(You have presumably found the registry UpdateEngineRegistry - it all 
parallels the query engine extension design)

> Thanks,
> Stephen
>
> P.S. I note that the following SPARQL/Update functions are specified in
your
> implementation/grammar: ADD, MOVE, COPY.  However I don't see them in the
> latest working draft [3].  Presumably they are coming in the future?

They were only agreed at about the time of the last publication. But 
they are missing from the editors working draft as well; I've just added 
a note to the SPARQL-WG wiki as work items that need to be done.  Thanks 
for catching this.

The syntax rules are:

ADD SILENT? GraphOrDefault TO GraphOrDefault
MOVE SILENT? GraphOrDefault TO GraphOrDefault
COPY SILENT? GraphOrDefault TO GraphOrDefault

GraphOrDefault    ::=           DEFAULT | GRAPH? IRIref

Given the separate

        Andy

>
>
> [1] "com.hp.hpl.jena.sparql.modify.request" and
> "com.hp.hpl.jena.sparql.modify.submission"
> [2] http://www.w3.org/Submission/SPARQL-Update/
> [3] http://www.w3.org/TR/sparql11-update/

RE: SPARQL 1.1 Update in ARQ 2.8.7

Reply via email to