I just updated the readme at https://github.com/Claudenw/jena-on-cassandra/blob/master/README.md to cover this question.
Basically, I put the data into 4 tables (assuming that storage is cheap) and added 3 indexes to each of those. The primary index columns (g, s, p, and o) are always populated, the other 3 indexes are populated when appropriate. Deletes and inserts are done with separate threads since we are assuming eventual consistency anyway. Caude On Tue, Sep 5, 2017 at 3:40 PM, <aj...@apache.org> wrote: > The requirements for distributed storage are actually that DRAS-TIC (see > that grant description) be used, and DRAS-TIC is 100% based around > Cassandra, so effectively, the requirement is that Cassandra be used, at > least at core. So part of what I am wondering (if it's not obvious) is "If > we're going to have a Cassandra cluster as part of this, how can we get as > much mileage as possible out of it?" > > I know that Cassandra offers some ordering capabilities out-of-the-box, > although I'm not familiar with them. Maybe they could be used to support > merge join generally. > > CumulusRDF (as shown in that paper I forwarded) uses a structure in which > they mostly leave column values empty. The information is stored entirely > in the keys, and use is made of prefix lookup. Does your system do > something like that, Claude? It sounds like you are storing tuple component > in the column values. > > > ajs6f > > Andy Seaborne wrote on 9/5/17 4:43 AM: > > >> On Mon, Sep 4, 2017 at 12:10 PM, <aj...@apache.org> wrote: >>>> >>>> Little of both? :grin: >>>>> >>>>> Primarily I am interested because of a grant [1] in which the >>>>> Smithsonian >>>>> Institution (where I work) is participating in a supporting role >>>>> (partly >>>>> because I convinced us to). That work involves using Cassandra for >>>>> distributed storage, and it will also involve a distributed LDP >>>>> implementation (the Fedora API referred to in that grant description is >>>>> really just a packaging of Memento [2] with LDP [3]), hence my >>>>> interest in >>>>> jena-on-cassandra. >>>>> >>>> >> Turning this round - what are the requirements for the distributed >> storage? >> >> As I understand the join question, the usual move with Cassandra is to >>>>> denormalize and store the joined data together, but that's obviously >>>>> nontrivial in our situation, where we don't know the potential queries. >>>>> Have you looked at an indexing solution such as was used by CumulusRDF >>>>> [4]? >>>>> >>>> >> (single graph example) >> >> If Cassandra has stored PSO and POS then parallel merge joins are >> possible. >> >> Andy >> >> >>>>> ajs6f >>>>> >>>>> [1] https://www.imls.gov/grants/awarded/lg-71-17-0159-17 >>>>> [2] http://www.mementoweb.org/guide/quick-intro/ >>>>> [3] https://www.w3.org/TR/ldp/ >>>>> [4] http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Worksh >>>>> ops/SSWS/Ladwig-et-all-SSWS2011.pdf >>>>> >>>>> Claude Warren wrote on 9/2/17 12:44 PM: >>>>> >>>>> are you looking to use jena-on-cassandra or do you have ideas? what >>>>> leads >>>>> >>>>>> you to ask about it? >>>>>> >>>>>> >>>>>> On Sat, Sep 2, 2017 at 1:21 PM, <aj...@apache.org> wrote: >>>>>> >>>>>> Hey, Claude-- >>>>>> >>>>>>> >>>>>>> Just curious as to where https://github.com/Claudenw/je >>>>>>> na-on-cassandra >>>>>>> has ended up. Is that still work-in-progress? >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> ajs6f >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>> -- >>>> I like: Like Like - The likeliest place on the web >>>> <http://like-like.xenei.com> >>>> LinkedIn: http://www.linkedin.com/in/claudewarren >>>> >>>> >>> >>> >>> -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren