I just updated the readme at
https://github.com/Claudenw/jena-on-cassandra/blob/master/README.md to
cover this question.

Basically, I put the data into 4 tables (assuming that storage is cheap)
and added 3 indexes to each of those.  The primary index columns (g, s, p,
and o) are always populated, the other 3 indexes are populated when
appropriate.

Deletes and inserts are done with separate threads since we are assuming
eventual consistency anyway.

Caude

On Tue, Sep 5, 2017 at 3:40 PM, <aj...@apache.org> wrote:

> The requirements for distributed storage are actually that DRAS-TIC (see
> that grant description) be used, and DRAS-TIC is 100% based around
> Cassandra, so effectively, the requirement is that Cassandra be used, at
> least at core. So part of what I am wondering (if it's not obvious) is "If
> we're going to have a Cassandra cluster as part of this, how can we get as
> much mileage as possible out of it?"
>
> I know that Cassandra offers some ordering capabilities out-of-the-box,
> although I'm not familiar with them. Maybe they could be used to support
> merge join generally.
>
> CumulusRDF (as shown in that paper I forwarded) uses a structure in which
> they mostly leave column values empty. The information is stored entirely
> in the keys, and use is made of prefix lookup. Does your system do
> something like that, Claude? It sounds like you are storing tuple component
> in the column values.
>
>
> ajs6f
>
> Andy Seaborne wrote on 9/5/17 4:43 AM:
>
>
>> On Mon, Sep 4, 2017 at 12:10 PM, <aj...@apache.org> wrote:
>>>>
>>>> Little of both? :grin:
>>>>>
>>>>> Primarily I am interested because of a grant [1] in which the
>>>>> Smithsonian
>>>>> Institution (where I work) is participating in a supporting role
>>>>> (partly
>>>>> because I convinced us to). That work involves using Cassandra for
>>>>> distributed storage, and it will also involve a distributed LDP
>>>>> implementation (the Fedora API referred to in that grant description is
>>>>> really just a packaging of Memento [2] with LDP [3]), hence my
>>>>> interest in
>>>>> jena-on-cassandra.
>>>>>
>>>>
>> Turning this round - what are the requirements for the distributed
>> storage?
>>
>> As I understand the join question, the usual move with Cassandra is to
>>>>> denormalize and store the joined data together, but that's obviously
>>>>> nontrivial in our situation, where we don't know the potential queries.
>>>>> Have you looked at an indexing solution such as was used by CumulusRDF
>>>>> [4]?
>>>>>
>>>>
>> (single graph example)
>>
>> If Cassandra has stored PSO and POS then parallel merge joins are
>> possible.
>>
>>     Andy
>>
>>
>>>>> ajs6f
>>>>>
>>>>> [1] https://www.imls.gov/grants/awarded/lg-71-17-0159-17
>>>>> [2] http://www.mementoweb.org/guide/quick-intro/
>>>>> [3] https://www.w3.org/TR/ldp/
>>>>> [4] http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Worksh
>>>>> ops/SSWS/Ladwig-et-all-SSWS2011.pdf
>>>>>
>>>>> Claude Warren wrote on 9/2/17 12:44 PM:
>>>>>
>>>>> are you looking to use jena-on-cassandra or do you have ideas?  what
>>>>> leads
>>>>>
>>>>>> you to ask about it?
>>>>>>
>>>>>>
>>>>>> On Sat, Sep 2, 2017 at 1:21 PM, <aj...@apache.org> wrote:
>>>>>>
>>>>>> Hey, Claude--
>>>>>>
>>>>>>>
>>>>>>> Just curious as to where https://github.com/Claudenw/je
>>>>>>> na-on-cassandra
>>>>>>> has ended up. Is that still work-in-progress?
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> ajs6f
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> I like: Like Like - The likeliest place on the web
>>>> <http://like-like.xenei.com>
>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>>>
>>>>
>>>
>>>
>>>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Reply via email to