Re: Read efficiency question

2016-12-30 Thread Voytek Jarnot
Thank you Janne.  Yes, these are random-access (scatter) reads - I've
decided on option 1; having also considered (as you wrote) that it will
never make sense to look at ranges of key3.

On Fri, Dec 30, 2016 at 3:40 AM, Janne Jalkanen 
wrote:

> In practice, the performance you’re getting is likely to be impacted by
> your reading patterns.  If you do a lot of sequential reads where key1 and
> key2 stay the same, and only key3 varies, then you may be getting better
> peformance out of the second option due to hitting the row and disk caches
> more often. If you are doing a lot of scatter reads, then you’re likely to
> get better performance out of the first option, because the reads will be
> distributed more evenly to multiple nodes.  It also depends on how large
> rows you’re planning to use, as this will directly impact things like
> compaction which has an overall impact of the entire cluster speed.  For
> just a few values of key3, I doubt there would be much difference in
> performance, but if key3 has a cardinality of say, a million, you might be
> better off with option 1.
>
> As always the advice is - benchmark your intended use case - put a few
> hundred gigs of mock data to a cluster, trigger compactions and do perf
> tests for different kinds of read/write loads. :-)
>
> (Though if I didn’t know what my read pattern would be, I’d probably go
> for option 1 purely on a gut feeling if I was sure I would never need range
> queries on key3; shorter rows *usually* are a bit better for performance,
> compaction, etc.  Really wide rows can sometimes be a headache
> operationally.)
>
> May you have energy and success!
> /Janne
>
>
>
> On 28 Dec 2016, at 16:44, Manoj Khangaonkar  wrote:
>
> In the first case, the partitioning is based on key1,key2,key3.
>
> In the second case, partitioning is based on key1 , key2. Additionally you
> have a clustered key key3. This means within a partition you can do range
> queries on key3 efficiently. That is the difference.
>
> regards
>
> On Tue, Dec 27, 2016 at 7:42 AM, Voytek Jarnot 
> wrote:
>
>> Wondering if there's a difference when querying by primary key between
>> the two definitions below:
>>
>> primary key ((key1, key2, key3))
>> primary key ((key1, key2), key3)
>>
>> In terms of read speed/efficiency... I don't have much of a reason
>> otherwise to prefer one setup over the other, so would prefer the most
>> efficient for querying.
>>
>> Thanks.
>>
>
>
>
> --
> http://khangaonkar.blogspot.com/
>
>
>


Re: Read efficiency question

2016-12-30 Thread Janne Jalkanen
In practice, the performance you’re getting is likely to be impacted by your reading patterns.  If you do a lot of sequential reads where key1 and key2 stay the same, and only key3 varies, then you may be getting better peformance out of the second option due to hitting the row and disk caches more often. If you are doing a lot of scatter reads, then you’re likely to get better performance out of the first option, because the reads will be distributed more evenly to multiple nodes.  It also depends on how large rows you’re planning to use, as this will directly impact things like compaction which has an overall impact of the entire cluster speed.  For just a few values of key3, I doubt there would be much difference in performance, but if key3 has a cardinality of say, a million, you might be better off with option 1.As always the advice is - benchmark your intended use case - put a few hundred gigs of mock data to a cluster, trigger compactions and do perf tests for different kinds of read/write loads. :-)(Though if I didn’t know what my read pattern would be, I’d probably go for option 1 purely on a gut feeling if I was sure I would never need range queries on key3; shorter rows *usually* are a bit better for performance, compaction, etc.  Really wide rows can sometimes be a headache operationally.)
May you have energy and success!/Janne



On 28 Dec 2016, at 16:44, Manoj Khangaonkar  wrote:In the first case, the partitioning is based on key1,key2,key3.In the second case, partitioning is based on key1 , key2. Additionally you have a clustered key key3. This means within a partition you can do range queries on key3 efficiently. That is the difference.regardsOn Tue, Dec 27, 2016 at 7:42 AM, Voytek Jarnot  wrote:Wondering if there's a difference when querying by primary key between the two definitions below:primary key ((key1, key2, key3))primary key ((key1, key2), key3)In terms of read speed/efficiency... I don't have much of a reason otherwise to prefer one setup over the other, so would prefer the most efficient for querying.Thanks.
-- http://khangaonkar.blogspot.com/



Re: Read efficiency question

2016-12-28 Thread Manoj Khangaonkar
In the first case, the partitioning is based on key1,key2,key3.

In the second case, partitioning is based on key1 , key2. Additionally you
have a clustered key key3. This means within a partition you can do range
queries on key3 efficiently. That is the difference.

regards

On Tue, Dec 27, 2016 at 7:42 AM, Voytek Jarnot 
wrote:

> Wondering if there's a difference when querying by primary key between the
> two definitions below:
>
> primary key ((key1, key2, key3))
> primary key ((key1, key2), key3)
>
> In terms of read speed/efficiency... I don't have much of a reason
> otherwise to prefer one setup over the other, so would prefer the most
> efficient for querying.
>
> Thanks.
>



-- 
http://khangaonkar.blogspot.com/


Re: Read efficiency question

2016-12-27 Thread Oskar Kjellin
Yes sorry I missed the double parenthesis in the first case. 

I may be a bit off here, but I don't think the coordinator pinpoints the row 
but just the node it needs to go to. 
It's more a case of creating smaller partitions, which makes for more even load 
among the cluster and the node will not have to read a whole lot of data into 
memory to just GC later on. 

If you think of Cassandra as a hash map (which it kind of is). You like the key 
to be as unique as possible to not have to go to a bucket and filter there, or 
create hot spots. 

Sent from my iPhone

> On 27 Dec 2016, at 17:12, Voytek Jarnot  wrote:
> 
> Thank you Oskar.  I think you may be missing the double parentheses in the 
> first example - difference is between partition key of (key1, key2, key3) and 
> (key1, key2).  With that in mind, I believe your answer would be that the 
> first example is more efficient?
> 
> Is this essentially a case of the coordinator node being able to exactly 
> pinpoint a row (first example) vs the coordinator node pinpointing the 
> partition and letting the partition-owning node refine down to the right row 
> using the clustering key (key3 in the second example)?
> 
>> On Tue, Dec 27, 2016 at 10:06 AM, Oskar Kjellin  
>> wrote:
>> The second one will be the most efficient.
>> How much depends on how unique key1 is.
>> 
>> In the first case everything for the same key1 will be on the same 
>> partition.  If it's not unique at all that will be very bad.
>> In the second case the combo of key1 and key2 will decide what partition.
>> 
>> If you don't ever have to find all key2 for a given key1 I don't see any 
>> reason to do case 1
>> 
>> 
>> > On 27 Dec 2016, at 16:42, Voytek Jarnot  wrote:
>> >
>> > Wondering if there's a difference when querying by primary key between the 
>> > two definitions below:
>> >
>> > primary key ((key1, key2, key3))
>> > primary key ((key1, key2), key3)
>> >
>> > In terms of read speed/efficiency... I don't have much of a reason 
>> > otherwise to prefer one setup over the other, so would prefer the most 
>> > efficient for querying.
>> >
>> > Thanks.
> 


Re: Read efficiency question

2016-12-27 Thread Voytek Jarnot
Thank you Oskar.  I think you may be missing the double parentheses in the
first example - difference is between partition key of (key1, key2, key3)
and (key1, key2).  With that in mind, I believe your answer would be that
the first example is more efficient?

Is this essentially a case of the coordinator node being able to exactly
pinpoint a row (first example) vs the coordinator node pinpointing the
partition and letting the partition-owning node refine down to the right
row using the clustering key (key3 in the second example)?

On Tue, Dec 27, 2016 at 10:06 AM, Oskar Kjellin 
wrote:

> The second one will be the most efficient.
> How much depends on how unique key1 is.
>
> In the first case everything for the same key1 will be on the same
> partition.  If it's not unique at all that will be very bad.
> In the second case the combo of key1 and key2 will decide what partition.
>
> If you don't ever have to find all key2 for a given key1 I don't see any
> reason to do case 1
>
>
> > On 27 Dec 2016, at 16:42, Voytek Jarnot  wrote:
> >
> > Wondering if there's a difference when querying by primary key between
> the two definitions below:
> >
> > primary key ((key1, key2, key3))
> > primary key ((key1, key2), key3)
> >
> > In terms of read speed/efficiency... I don't have much of a reason
> otherwise to prefer one setup over the other, so would prefer the most
> efficient for querying.
> >
> > Thanks.
>


Re: Read efficiency question

2016-12-27 Thread Oskar Kjellin
The second one will be the most efficient. 
How much depends on how unique key1 is. 

In the first case everything for the same key1 will be on the same partition.  
If it's not unique at all that will be very bad. 
In the second case the combo of key1 and key2 will decide what partition. 

If you don't ever have to find all key2 for a given key1 I don't see any reason 
to do case 1


> On 27 Dec 2016, at 16:42, Voytek Jarnot  wrote:
> 
> Wondering if there's a difference when querying by primary key between the 
> two definitions below:
> 
> primary key ((key1, key2, key3))
> primary key ((key1, key2), key3)
> 
> In terms of read speed/efficiency... I don't have much of a reason otherwise 
> to prefer one setup over the other, so would prefer the most efficient for 
> querying.
> 
> Thanks.


Read efficiency question

2016-12-27 Thread Voytek Jarnot
Wondering if there's a difference when querying by primary key between the
two definitions below:

primary key ((key1, key2, key3))
primary key ((key1, key2), key3)

In terms of read speed/efficiency... I don't have much of a reason
otherwise to prefer one setup over the other, so would prefer the most
efficient for querying.

Thanks.