subject:"Replicate On Write behavior"

Replicate On Write behavior

2011-09-01 Thread David Hawthorne

I'm curious... digging through the source, it looks like replicate on write 
triggers a read of the entire row, and not just the columns/supercolumns that 
are affected by the counter update.  Is this the case?  It would certainly 
explain why my inserts/sec decay over time and why the average insert latency 
increases over time.  The strange thing is that I'm not seeing disk read IO 
increase over that same period, but that might be due to the OS buffer cache...

On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that normal?  
I'm using RandomPartitioner...

Address DC  RackStatus State   LoadOwns
Token  

136112946768375385385349842972707284580 
10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0   

10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
34028236692093846346337460743176821145  
10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
68056473384187692692674921486353642290  
10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
102084710076281539039012382229530463435 
10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
136112946768375385385349842972707284580 

The nodes with ReplicateOnWrites are the 3 in the middle.  The first node and 
last node both have a count of 0.  This is a clean cluster, and I've been doing 
3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours.  The last 
time this test ran, it went all the way down to 500 inserts/sec before I killed 
it.

Re: Replicate On Write behavior

2011-09-01 Thread Yang

when Cassandra reads, the entire CF is always read together, only at the
hand-over to client does the pruning happens

On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne wrote:

> I'm curious... digging through the source, it looks like replicate on write
> triggers a read of the entire row, and not just the columns/supercolumns
> that are affected by the counter update.  Is this the case?  It would
> certainly explain why my inserts/sec decay over time and why the average
> insert latency increases over time.  The strange thing is that I'm not
> seeing disk read IO increase over that same period, but that might be due to
> the OS buffer cache...
>
> On another note, on a 5-node cluster, I'm only seeing 3 nodes with
> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that
> normal?  I'm using RandomPartitioner...
>
> Address DC  RackStatus State   LoadOwns
>Token
>
>  136112946768375385385349842972707284580
> 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%
>  0
> 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%
>  34028236692093846346337460743176821145
> 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%
>  68056473384187692692674921486353642290
> 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%
>  102084710076281539039012382229530463435
> 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%
>  136112946768375385385349842972707284580
>
> The nodes with ReplicateOnWrites are the 3 in the middle.  The first node
> and last node both have a count of 0.  This is a clean cluster, and I've
> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12
> hours.  The last time this test ran, it went all the way down to 500
> inserts/sec before I killed it.

Re: Replicate On Write behavior

2011-09-01 Thread Ian Danforth

I'm not sure I understand the scalability of this approach. A given
column family can be HUGE with millions of rows and columns. In my
cluster I have a single column family that accounts for 90GB of load
on each node. Not only that but column family is distributed over the
entire ring.

Clearly I'm misunderstanding something.

Ian

On Thu, Sep 1, 2011 at 1:17 PM, Yang  wrote:
> when Cassandra reads, the entire CF is always read together, only at the
> hand-over to client does the pruning happens
>
> On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne 
> wrote:
>>
>> I'm curious... digging through the source, it looks like replicate on
>> write triggers a read of the entire row, and not just the
>> columns/supercolumns that are affected by the counter update.  Is this the
>> case?  It would certainly explain why my inserts/sec decay over time and why
>> the average insert latency increases over time.  The strange thing is that
>> I'm not seeing disk read IO increase over that same period, but that might
>> be due to the OS buffer cache...
>>
>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with
>> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that
>> normal?  I'm using RandomPartitioner...
>>
>> Address         DC          Rack        Status State   Load
>>  Owns    Token
>>
>>  136112946768375385385349842972707284580
>> 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB         20.00%
>>  0
>> 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB         20.00%
>>  34028236692093846346337460743176821145
>> 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB         20.00%
>>  68056473384187692692674921486353642290
>> 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB       20.00%
>>  102084710076281539039012382229530463435
>> 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB       20.00%
>>  136112946768375385385349842972707284580
>>
>> The nodes with ReplicateOnWrites are the 3 in the middle.  The first node
>> and last node both have a count of 0.  This is a clean cluster, and I've
>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12
>> hours.  The last time this test ran, it went all the way down to 500
>> inserts/sec before I killed it.
>

Re: Replicate On Write behavior

2011-09-01 Thread Konstantin Naryshkin

Yeah, I believe that Yan has a type in his post. A CF is no read in one go, a 
row is. As for the scalability of having all the columns being read at once, I 
do not believe that it was ever meant to be. All the columns in a row are 
stored together, on the same set of machines. This means that if you have very 
large rows, you can have an unbalanced cluster, but it also allows reads of 
several columns out of a row to be more efficient since they are all together 
on the same machine (no need to gather results from several machines) and 
should read quickly since they are all together on disk.

- Original Message -
From: "Ian Danforth" 
To: user@cassandra.apache.org
Sent: Thursday, September 1, 2011 4:35:33 PM
Subject: Re: Replicate On Write behavior

I'm not sure I understand the scalability of this approach. A given
column family can be HUGE with millions of rows and columns. In my
cluster I have a single column family that accounts for 90GB of load
on each node. Not only that but column family is distributed over the
entire ring.

Clearly I'm misunderstanding something.

Ian

On Thu, Sep 1, 2011 at 1:17 PM, Yang  wrote:
> when Cassandra reads, the entire CF is always read together, only at the
> hand-over to client does the pruning happens
>
> On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne 
> wrote:
>>
>> I'm curious... digging through the source, it looks like replicate on
>> write triggers a read of the entire row, and not just the
>> columns/supercolumns that are affected by the counter update.  Is this the
>> case?  It would certainly explain why my inserts/sec decay over time and why
>> the average insert latency increases over time.  The strange thing is that
>> I'm not seeing disk read IO increase over that same period, but that might
>> be due to the OS buffer cache...
>>
>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with
>> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that
>> normal?  I'm using RandomPartitioner...
>>
>> Address         DC          Rack        Status State   Load
>>  Owns    Token
>>
>>  136112946768375385385349842972707284580
>> 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB         20.00%
>>  0
>> 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB         20.00%
>>  34028236692093846346337460743176821145
>> 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB         20.00%
>>  68056473384187692692674921486353642290
>> 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB       20.00%
>>  102084710076281539039012382229530463435
>> 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB       20.00%
>>  136112946768375385385349842972707284580
>>
>> The nodes with ReplicateOnWrites are the 3 in the middle.  The first node
>> and last node both have a count of 0.  This is a clean cluster, and I've
>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12
>> hours.  The last time this test ran, it went all the way down to 500
>> inserts/sec before I killed it.
>

Re: Replicate On Write behavior

2011-09-01 Thread Yang

sorry i mean  cf * row

if you look in the code, db.cf  is just basically a set of columns
On Sep 1, 2011 1:36 PM, "Ian Danforth"  wrote:
> I'm not sure I understand the scalability of this approach. A given
> column family can be HUGE with millions of rows and columns. In my
> cluster I have a single column family that accounts for 90GB of load
> on each node. Not only that but column family is distributed over the
> entire ring.
>
> Clearly I'm misunderstanding something.
>
> Ian
>
> On Thu, Sep 1, 2011 at 1:17 PM, Yang  wrote:
>> when Cassandra reads, the entire CF is always read together, only at the
>> hand-over to client does the pruning happens
>>
>> On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne 
>> wrote:
>>>
>>> I'm curious... digging through the source, it looks like replicate on
>>> write triggers a read of the entire row, and not just the
>>> columns/supercolumns that are affected by the counter update.  Is this
the
>>> case?  It would certainly explain why my inserts/sec decay over time and
why
>>> the average insert latency increases over time.  The strange thing is
that
>>> I'm not seeing disk read IO increase over that same period, but that
might
>>> be due to the OS buffer cache...
>>>
>>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with
>>> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that
>>> normal?  I'm using RandomPartitioner...
>>>
>>> Address DC  RackStatus State   Load
>>>  OwnsToken
>>>
>>>  136112946768375385385349842972707284580
>>> 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB
20.00%
>>>  0
>>> 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB
20.00%
>>>  34028236692093846346337460743176821145
>>> 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB
20.00%
>>>  68056473384187692692674921486353642290
>>> 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB
20.00%
>>>  102084710076281539039012382229530463435
>>> 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB
20.00%
>>>  136112946768375385385349842972707284580
>>>
>>> The nodes with ReplicateOnWrites are the 3 in the middle.  The first
node
>>> and last node both have a count of 0.  This is a clean cluster, and I've
>>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last
12
>>> hours.  The last time this test ran, it went all the way down to 500
>>> inserts/sec before I killed it.
>>

Re: Replicate On Write behavior

2011-09-02 Thread Sylvain Lebresne

On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne  wrote:
> I'm curious... digging through the source, it looks like replicate on write 
> triggers a read of the entire row, and not just the columns/supercolumns that 
> are affected by the counter update.  Is this the case?  It would certainly 
> explain why my inserts/sec decay over time and why the average insert latency 
> increases over time.  The strange thing is that I'm not seeing disk read IO 
> increase over that same period, but that might be due to the OS buffer 
> cache...

It does not. It only reads the columns/supercolumns affected by the
counter update.
In the source, this happens in CounterMutation.java. If you look at
addReadCommandFromColumnFamily you'll see that it does a query by name
only for the column involved in the update (the update is basically
the content of the columnFamily parameter there).

And Cassandra does *not* always reads a full row. Never had, never will.

> On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that normal? 
>  I'm using RandomPartitioner...
>
> Address         DC          Rack        Status State   Load            Owns   
>  Token
>                                                                            
> 136112946768375385385349842972707284580
> 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB         20.00%  0
> 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB         20.00%  
> 34028236692093846346337460743176821145
> 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB         20.00%  
> 68056473384187692692674921486353642290
> 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB       20.00%  
> 102084710076281539039012382229530463435
> 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB       20.00%  
> 136112946768375385385349842972707284580
>
> The nodes with ReplicateOnWrites are the 3 in the middle.  The first node and 
> last node both have a count of 0.  This is a clean cluster, and I've been 
> doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours.  
> The last time this test ran, it went all the way down to 500 inserts/sec 
> before I killed it.

Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.

--
Sylvain

Re: Replicate On Write behavior

2011-09-02 Thread David Hawthorne

That's interesting.  I did an experiment wherein I added some entropy to the 
row name based on the time when the increment came in, (e.g. row = row + "/" + 
(timestamp - (timestamp % 300))) and now not only is the load (in GB) on my 
cluster more balanced, the performance has not decayed and has stayed steady 
(inserts/sec) with a relatively low average ms/insert.  Each row is now 
significantly shorter as a result of this change.



On Sep 2, 2011, at 12:30 AM, Sylvain Lebresne wrote:

> On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne  wrote:
>> I'm curious... digging through the source, it looks like replicate on write 
>> triggers a read of the entire row, and not just the columns/supercolumns 
>> that are affected by the counter update.  Is this the case?  It would 
>> certainly explain why my inserts/sec decay over time and why the average 
>> insert latency increases over time.  The strange thing is that I'm not 
>> seeing disk read IO increase over that same period, but that might be due to 
>> the OS buffer cache...
> 
> It does not. It only reads the columns/supercolumns affected by the
> counter update.
> In the source, this happens in CounterMutation.java. If you look at
> addReadCommandFromColumnFamily you'll see that it does a query by name
> only for the column involved in the update (the update is basically
> the content of the columnFamily parameter there).
> 
> And Cassandra does *not* always reads a full row. Never had, never will.
> 
>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
>> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
>> normal?  I'm using RandomPartitioner...
>> 
>> Address DC  RackStatus State   LoadOwns  
>>   Token
>>
>> 136112946768375385385349842972707284580
>> 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0
>> 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
>> 34028236692093846346337460743176821145
>> 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
>> 68056473384187692692674921486353642290
>> 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
>> 102084710076281539039012382229530463435
>> 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
>> 136112946768375385385349842972707284580
>> 
>> The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
>> and last node both have a count of 0.  This is a clean cluster, and I've 
>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
>> hours.  The last time this test ran, it went all the way down to 500 
>> inserts/sec before I killed it.
> 
> Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
> 
> --
> Sylvain

Re: Replicate On Write behavior

2011-09-02 Thread Ian Danforth

That ticket explains a lot, looking forward to a resolution on it.
(Sorry I don't have a patch to offer)

Ian

On Fri, Sep 2, 2011 at 12:30 AM, Sylvain Lebresne  wrote:
> On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne  wrote:
>> I'm curious... digging through the source, it looks like replicate on write 
>> triggers a read of the entire row, and not just the columns/supercolumns 
>> that are affected by the counter update.  Is this the case?  It would 
>> certainly explain why my inserts/sec decay over time and why the average 
>> insert latency increases over time.  The strange thing is that I'm not 
>> seeing disk read IO increase over that same period, but that might be due to 
>> the OS buffer cache...
>
> It does not. It only reads the columns/supercolumns affected by the
> counter update.
> In the source, this happens in CounterMutation.java. If you look at
> addReadCommandFromColumnFamily you'll see that it does a query by name
> only for the column involved in the update (the update is basically
> the content of the columnFamily parameter there).
>
> And Cassandra does *not* always reads a full row. Never had, never will.
>
>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
>> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
>> normal?  I'm using RandomPartitioner...
>>
>> Address         DC          Rack        Status State   Load            Owns  
>>   Token
>>                                                                            
>> 136112946768375385385349842972707284580
>> 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB         20.00%  0
>> 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB         20.00%  
>> 34028236692093846346337460743176821145
>> 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB         20.00%  
>> 68056473384187692692674921486353642290
>> 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB       20.00%  
>> 102084710076281539039012382229530463435
>> 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB       20.00%  
>> 136112946768375385385349842972707284580
>>
>> The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
>> and last node both have a count of 0.  This is a clean cluster, and I've 
>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
>> hours.  The last time this test ran, it went all the way down to 500 
>> inserts/sec before I killed it.
>
> Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
>
> --
> Sylvain
>

Re: Replicate On Write behavior

2011-09-02 Thread David Hawthorne

Does it always pick the node with the lowest IP address?  All of my hosts are 
in the same /24.  The fourth node in the 5 node cluster has the lowest value in 
the 4th octet (54).  I erased the cluster and rebuilt it from scratch as a 3 
node cluster using the first 3 nodes, and now the ReplicateOnWrites are all 
going to the third node, which is also the lowest valued IP address (55).

That would explain why only 1 node gets writes in a 3 node cluster (RF=3) and 
why 3 nodes get writes in a 5 node cluster, and why one of those 3 is taking 
66% of the writes.


> 
>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
>> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
>> normal?  I'm using RandomPartitioner...
>> 
>> Address DC  RackStatus State   LoadOwns  
>>   Token
>>
>> 136112946768375385385349842972707284580
>> 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0
>> 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
>> 34028236692093846346337460743176821145
>> 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
>> 68056473384187692692674921486353642290
>> 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
>> 102084710076281539039012382229530463435
>> 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
>> 136112946768375385385349842972707284580
>> 
>> The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
>> and last node both have a count of 0.  This is a clean cluster, and I've 
>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
>> hours.  The last time this test ran, it went all the way down to 500 
>> inserts/sec before I killed it.
> 
> Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
> 
> --
> Sylvain

Re: Replicate On Write behavior

2011-09-08 Thread David Hawthorne

It was exactly due to 2890, and the fact that the first replica is always the 
one with the lowest value IP address.  I patched cassandra to pick a random 
node out of the replica set in StorageProxy.java findSuitableEndpoint:

Random rng = new Random();

return endpoints.get(rng.nextInt(endpoints.size()));  // instead of return 
endpoints.get(0);

Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the 
inserts/sec throughput.

Here's the behavior I saw, and "disk work" refers to the ReplicateOnWrite load 
of a counter insert:

One node will get RF/n of the disk work.  Two nodes will always get 0 disk work.

in a 3 node cluster, 1 node gets disk hit really hard.  You get the performance 
of a one-node cluster.
in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you the 
performance of ~2 node cluster.
in a 10 node cluster, 1 node gets 30% of the disk work, giving you the 
performance of a ~3 node cluster.

I confirmed this behavior with a 3, 4, and 5 node cluster size.


> 
>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
>> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
>> normal?  I'm using RandomPartitioner...
>> 
>> Address DC  RackStatus State   LoadOwns  
>>   Token
>>
>> 136112946768375385385349842972707284580
>> 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00%  0
>> 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00%  
>> 34028236692093846346337460743176821145
>> 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00%  
>> 68056473384187692692674921486353642290
>> 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00%  
>> 102084710076281539039012382229530463435
>> 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00%  
>> 136112946768375385385349842972707284580
>> 
>> The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
>> and last node both have a count of 0.  This is a clean cluster, and I've 
>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
>> hours.  The last time this test ran, it went all the way down to 500 
>> inserts/sec before I killed it.
> 
> Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
> 
> --
> Sylvain

Re: Replicate On Write behavior

2011-09-09 Thread Sylvain Lebresne

We'll solve #2890 and we should have done it sooner.

That being said, a quick question: how do you do your inserts from the
clients ? Are you evenly
distributing the inserts among the nodes ? Or are you always hitting
the same coordinator ?

Because provided the nodes are correctly distributed on the ring, if
you distribute the inserts
(increment) requests across the nodes (again I'm talking of client
requests), you "should" not
see the behavior you observe.

--
Sylvain

On Thu, Sep 8, 2011 at 9:48 PM, David Hawthorne  wrote:
> It was exactly due to 2890, and the fact that the first replica is always the 
> one with the lowest value IP address.  I patched cassandra to pick a random 
> node out of the replica set in StorageProxy.java findSuitableEndpoint:
>
> Random rng = new Random();
>
> return endpoints.get(rng.nextInt(endpoints.size()));  // instead of return 
> endpoints.get(0);
>
> Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the 
> inserts/sec throughput.
>
> Here's the behavior I saw, and "disk work" refers to the ReplicateOnWrite 
> load of a counter insert:
>
> One node will get RF/n of the disk work.  Two nodes will always get 0 disk 
> work.
>
> in a 3 node cluster, 1 node gets disk hit really hard.  You get the 
> performance of a one-node cluster.
> in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you 
> the performance of ~2 node cluster.
> in a 10 node cluster, 1 node gets 30% of the disk work, giving you the 
> performance of a ~3 node cluster.
>
> I confirmed this behavior with a 3, 4, and 5 node cluster size.
>
>
>>
>>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
>>> ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
>>> normal?  I'm using RandomPartitioner...
>>>
>>> Address         DC          Rack        Status State   Load            Owns 
>>>    Token
>>>                                                                            
>>> 136112946768375385385349842972707284580
>>> 10.0.0.57    datacenter1 rack1       Up     Normal  2.26 GB         20.00%  >>> 0
>>> 10.0.0.56    datacenter1 rack1       Up     Normal  2.47 GB         20.00%  
>>> 34028236692093846346337460743176821145
>>> 10.0.0.55    datacenter1 rack1       Up     Normal  2.52 GB         20.00%  
>>> 68056473384187692692674921486353642290
>>> 10.0.0.54    datacenter1 rack1       Up     Normal  950.97 MB       20.00%  
>>> 102084710076281539039012382229530463435
>>> 10.0.0.72    datacenter1 rack1       Up     Normal  383.25 MB       20.00%  
>>> 136112946768375385385349842972707284580
>>>
>>> The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
>>> and last node both have a count of 0.  This is a clean cluster, and I've 
>>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
>>> hours.  The last time this test ran, it went all the way down to 500 
>>> inserts/sec before I killed it.
>>
>> Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
>>
>> --
>> Sylvain
>
>

Re: Replicate On Write behavior

2011-09-09 Thread David Hawthorne

They are evenly distributed.  5 nodes * 40 connections each using hector, and I 
can confirm that all 200 are active when this happened (from hector's 
perspective, from graphing the hector jmx data), and all 5 nodes saw roughly 40 
connections, and all were receiving traffic over those connections.  (netstat + 
ntop + trafshow, etc)  I can also confirm that I changed my insert strategy to 
break up the rows using composite row keys, which reduced the row lengths and 
gave me an almost perfectly even data distribution among the nodes, and that 
was when I started to really dig into why the ROWs were still backing up on one 
node specifically, and why 2 nodes weren't seeing any.  It was the 20%, 20%, 
60% ROW distribution that really got me thinking, and when I took the 60% node 
out of the cluster, that ROW load jumped back to the node with the next-lowest 
IP address, and the 2 nodes that weren't seeing any *still* wheren't seeing any 
ROWs.  At that point I tore down the cluster, recreated it as a 3 node cluster 
several times using various permutations of the 5 nodes available, and ROW load 
was *always* on the node with the lowest IP address.

the theory might not be right, but it certainly represents the behavior I saw.

On Sep 9, 2011, at 12:17 AM, Sylvain Lebresne wrote:

> We'll solve #2890 and we should have done it sooner.
> 
> That being said, a quick question: how do you do your inserts from the
> clients ? Are you evenly
> distributing the inserts among the nodes ? Or are you always hitting
> the same coordinator ?
> 
> Because provided the nodes are correctly distributed on the ring, if
> you distribute the inserts
> (increment) requests across the nodes (again I'm talking of client
> requests), you "should" not
> see the behavior you observe.
> 
> --
> Sylvain
> 
> On Thu, Sep 8, 2011 at 9:48 PM, David Hawthorne  wrote:
>> It was exactly due to 2890, and the fact that the first replica is always 
>> the one with the lowest value IP address.  I patched cassandra to pick a 
>> random node out of the replica set in StorageProxy.java findSuitableEndpoint:
>> 
>> Random rng = new Random();
>> 
>> return endpoints.get(rng.nextInt(endpoints.size()));  // instead of return 
>> endpoints.get(0);
>> 
>> Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the 
>> inserts/sec throughput.
>> 
>> Here's the behavior I saw, and "disk work" refers to the ReplicateOnWrite 
>> load of a counter insert:
>> 
>> One node will get RF/n of the disk work.  Two nodes will always get 0 disk 
>> work.
>> 
>> in a 3 node cluster, 1 node gets disk hit really hard.  You get the 
>> performance of a one-node cluster.
>> in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you 
>> the performance of ~2 node cluster.
>> in a 10 node cluster, 1 node gets 30% of the disk work, giving you the 
>> performance of a ~3 node cluster.
>> 
>> I confirmed this behavior with a 3, 4, and 5 node cluster size.
>> 
>> 
>>> 
 On another note, on a 5-node cluster, I'm only seeing 3 nodes with 
 ReplicateOnWrite Completed tasks in nodetool tpstats output.  Is that 
 normal?  I'm using RandomPartitioner...

 Address DC  RackStatus State   Load
 OwnsToken

 136112946768375385385349842972707284580
 10.0.0.57datacenter1 rack1   Up Normal  2.26 GB 20.00% 
  0
 10.0.0.56datacenter1 rack1   Up Normal  2.47 GB 20.00% 
  34028236692093846346337460743176821145
 10.0.0.55datacenter1 rack1   Up Normal  2.52 GB 20.00% 
  68056473384187692692674921486353642290
 10.0.0.54datacenter1 rack1   Up Normal  950.97 MB   20.00% 
  102084710076281539039012382229530463435
 10.0.0.72datacenter1 rack1   Up Normal  383.25 MB   20.00% 
  136112946768375385385349842972707284580

 The nodes with ReplicateOnWrites are the 3 in the middle.  The first node 
 and last node both have a count of 0.  This is a clean cluster, and I've 
 been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 
 hours.  The last time this test ran, it went all the way down to 500 
 inserts/sec before I killed it.
>>> 
>>> Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890.
>>> 
>>> --
>>> Sylvain
>> 
>>

Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

Re: Replicate On Write behavior

12 matches

Site Navigation

Mail list logo

Footer information