Re: Graph on Cassandra

Andy Seaborne Mon, 31 Oct 2016 10:54:11 -0700


On 31/10/16 13:41, Claude Warren wrote:

Andy,

This seems like a good approach but does not appear to be in the Jena code
base, which I suppose is your comment about an approach to developing work.

Does it make sense to create git clones that contain the new work?  Or
perhaps branches?

Do you have a suggestion or direction you would like to see this go?

That's the discussion to have. The first item is "Community". This isall new code? Who is involved? Just you so far?

A storage layer is not trivial - this is not an "extra" thing. It is amodule of it's own, and if the community is significantly different,maybe a different different mailing lists (e.g. solr within the theLucene project), maybe even a different project; it can be "straight toTLP" or "incubated" - that depends on who is involved. There are a wideset of possibilities.

If it is starting off, then the Jena git repo isn't a good place to havethe code. The lifecycles don't line up.

A branch that is complete separate is really a separate repo. Jena canget another git repo.


What would be the release cycle?
The real issue is the work needed by the PMC for releases.

To get all options mentioned:

If this is a one-person effort for now, then starting a github repo andcreating the initial sketch/framework is an option. More focused. Morefreedom to try things out and change directions.


        Andy


Claude



On Fri, Oct 28, 2016 at 2:35 PM, Andy Seaborne <a...@apache.org> wrote:

Claude,

These may help:

I have been thinking about an interface that is more oriented to the
storage than the full DatasetGraph.

StorageRDF breaks down all the operations into those on the default graph
and those on named graphs.  For just a graph, simply ignore the named graph
operations.

https://github.com/afs/AFS-Dev/blob/master/src/main/java/pro
jects/dsg2/storage/StorageRDF.java

There is an adapter to the DatasetGraph hierarchy (which is needed for
SPARQL):

https://github.com/afs/AFS-Dev/blob/master/src/main/java/pro
jects/dsg2/DatasetGraphStorage.java

If you want to only use existing classes, DatasetGraphTriplesQuads is the
place to start - used by TIM and TDB - yuo can implement without needing
quads/named graphs. Again, simply ignore (throw
UnsupportedOperationException for the named graph calls).

Going the graph route could lead to rework later on for any kind of
performance issues because find(S,P,O) is so narrow and precludes union
default graph except by brute force.  DatasetGraph work with the SPARQL
execution engine.

We still need to discuss how best to approach developing work - it should
not get sucked up by the release cycle.

        Andy


On 26/10/16 19:21, Claude Warren wrote:

My plan is to start with a Graph implementation.  We expect to write 3
tables: SPO, POS, OPS (I think).  Currently we don't have an easy way to
handle find( ANY, ANY, ANY) so I suspect we will just start with
permitting
a column scan on Cassandra.

I have not looked at DynamoDB but as I recall there are significant
differences under the hood.

I expect that we will move on to a custom model or query engine to get the
best performance but that is not what we are planning for the first cut.

I am still waiting for management approval to do this at work ....
sometimes it takes longer to get the paperwork done than it does to design
the thing.


Claude

On Mon, Oct 17, 2016 at 6:39 PM, Paul Houle <paul.ho...@ontology2.com>
wrote:

I like DynamoDB as a target for this sort of thing.  There are many

tasks which are small-scale yet critical where it would otherwise be
hard to provide a distributed and reliable database.  Put that together
with Lambda,  which does the same for computation,  and you are cooking
with gas.

I wrote a 1-1 translation of DynamoDB documents to RDF that I use
throughout an application;  the code is DynamoDB idiomatic in every way,
 just the application reads and writes (a constrained set of) RDF
documents.

Right now I dump the documents from the DynamoDB system into a triple
store when I want a panoptic view,  but with a distributed graph like
that would mean being able to run SPARQL queries against DynamoDB
directly.

There are many products in the same family as Cassandra and DynamoDB and
it would be good to think through the math so we can approach them all
in a similar way.

--
  Paul Houle
  paul.ho...@ontology2.com

On Mon, Oct 17, 2016, at 12:31 PM, A. Soroka wrote:

Yep,

http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/

Workshops/SSWS/Ladwig-et-all-SSWS2011.pdf


indicates that they are indexing by subject. As someone who has
implemented LDP, that is definitely the approach that makes sense there.

---
A. Soroka
The University of Virginia Library

On Oct 17, 2016, at 12:20 PM, Andy Seaborne <a...@apache.org> wrote:


IIRC It stores CBDs indexed by subject so it is the "other" model to

Rya.  Better for LDP (??).

    Andy

On 17/10/16 15:41, A. Soroka wrote:

There's also:

https://github.com/cumulusrdf/cumulusrdf

in a similar vein (RDF over Cassandra). Not sure what kind of

particular uses it expects to support.

---
A. Soroka
The University of Virginia Library

On Oct 17, 2016, at 7:02 AM, Andy Seaborne <a...@apache.org> wrote:


Hi Claude,

There is certainly interest from me.

What the best thing to do depends on various factors.  By putting it

in extras I presume you mean it gets added to the release?  That is

not the
only way forward.

An important aspect of Apache is "Community over code" - will there

be a community around this code?  Is that community the same, or

significant overlap, as the Jena community?

There are various reasons for wanting RDF over a column store -

which use cases are the most important for this work?

They lead to different ways of using Cassandra. For example,

Rya(incubating) uses Accumulo tables as indexes, and partial scans of

the
table is streaming.  Other systems try to use the columns for properties,
possibly more useful for LDP style than SPARQL.

  Andy

On 15/10/16 18:38, Claude Warren wrote:

Howdy,

We have a project at work that is implementing Jena Graph on

Cassandra.  I

am wondering if there is enough interest here to accept it as a

contribution.  I was thinking that it might fit in the Extras

category.

I can not promise release of the code yet as I have to present it

to our

internal Intellectual Property group first.


Thoughts?

Claude

Re: Graph on Cassandra

Reply via email to