Replicate On Write behavior
I'm curious... digging through the source, it looks like replicate on write triggers a read of the entire row, and not just the columns/supercolumns that are affected by the counter update. Is this the case? It would certainly explain why my inserts/sec decay over time and why the average insert latency increases over time. The strange thing is that I'm not seeing disk read IO increase over that same period, but that might be due to the OS buffer cache... On another note, on a 5-node cluster, I'm only seeing 3 nodes with ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that normal? I'm using RandomPartitioner... Address DC RackStatus State LoadOwns Token 136112946768375385385349842972707284580 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% 34028236692093846346337460743176821145 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% 68056473384187692692674921486353642290 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% 102084710076281539039012382229530463435 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% 136112946768375385385349842972707284580 The nodes with ReplicateOnWrites are the 3 in the middle. The first node and last node both have a count of 0. This is a clean cluster, and I've been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours. The last time this test ran, it went all the way down to 500 inserts/sec before I killed it.
Re: Replicate On Write behavior
when Cassandra reads, the entire CF is always read together, only at the hand-over to client does the pruning happens On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne wrote: > I'm curious... digging through the source, it looks like replicate on write > triggers a read of the entire row, and not just the columns/supercolumns > that are affected by the counter update. Is this the case? It would > certainly explain why my inserts/sec decay over time and why the average > insert latency increases over time. The strange thing is that I'm not > seeing disk read IO increase over that same period, but that might be due to > the OS buffer cache... > > On another note, on a 5-node cluster, I'm only seeing 3 nodes with > ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that > normal? I'm using RandomPartitioner... > > Address DC RackStatus State LoadOwns >Token > > 136112946768375385385349842972707284580 > 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% > 0 > 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% > 34028236692093846346337460743176821145 > 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% > 68056473384187692692674921486353642290 > 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% > 102084710076281539039012382229530463435 > 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% > 136112946768375385385349842972707284580 > > The nodes with ReplicateOnWrites are the 3 in the middle. The first node > and last node both have a count of 0. This is a clean cluster, and I've > been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 > hours. The last time this test ran, it went all the way down to 500 > inserts/sec before I killed it.
Re: Replicate On Write behavior
I'm not sure I understand the scalability of this approach. A given column family can be HUGE with millions of rows and columns. In my cluster I have a single column family that accounts for 90GB of load on each node. Not only that but column family is distributed over the entire ring. Clearly I'm misunderstanding something. Ian On Thu, Sep 1, 2011 at 1:17 PM, Yang wrote: > when Cassandra reads, the entire CF is always read together, only at the > hand-over to client does the pruning happens > > On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne > wrote: >> >> I'm curious... digging through the source, it looks like replicate on >> write triggers a read of the entire row, and not just the >> columns/supercolumns that are affected by the counter update. Is this the >> case? It would certainly explain why my inserts/sec decay over time and why >> the average insert latency increases over time. The strange thing is that >> I'm not seeing disk read IO increase over that same period, but that might >> be due to the OS buffer cache... >> >> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >> normal? I'm using RandomPartitioner... >> >> Address DC Rack Status State Load >> Owns Token >> >> 136112946768375385385349842972707284580 >> 10.0.0.57 datacenter1 rack1 Up Normal 2.26 GB 20.00% >> 0 >> 10.0.0.56 datacenter1 rack1 Up Normal 2.47 GB 20.00% >> 34028236692093846346337460743176821145 >> 10.0.0.55 datacenter1 rack1 Up Normal 2.52 GB 20.00% >> 68056473384187692692674921486353642290 >> 10.0.0.54 datacenter1 rack1 Up Normal 950.97 MB 20.00% >> 102084710076281539039012382229530463435 >> 10.0.0.72 datacenter1 rack1 Up Normal 383.25 MB 20.00% >> 136112946768375385385349842972707284580 >> >> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >> and last node both have a count of 0. This is a clean cluster, and I've >> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >> hours. The last time this test ran, it went all the way down to 500 >> inserts/sec before I killed it. >
Re: Replicate On Write behavior
Yeah, I believe that Yan has a type in his post. A CF is no read in one go, a row is. As for the scalability of having all the columns being read at once, I do not believe that it was ever meant to be. All the columns in a row are stored together, on the same set of machines. This means that if you have very large rows, you can have an unbalanced cluster, but it also allows reads of several columns out of a row to be more efficient since they are all together on the same machine (no need to gather results from several machines) and should read quickly since they are all together on disk. - Original Message - From: "Ian Danforth" To: user@cassandra.apache.org Sent: Thursday, September 1, 2011 4:35:33 PM Subject: Re: Replicate On Write behavior I'm not sure I understand the scalability of this approach. A given column family can be HUGE with millions of rows and columns. In my cluster I have a single column family that accounts for 90GB of load on each node. Not only that but column family is distributed over the entire ring. Clearly I'm misunderstanding something. Ian On Thu, Sep 1, 2011 at 1:17 PM, Yang wrote: > when Cassandra reads, the entire CF is always read together, only at the > hand-over to client does the pruning happens > > On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne > wrote: >> >> I'm curious... digging through the source, it looks like replicate on >> write triggers a read of the entire row, and not just the >> columns/supercolumns that are affected by the counter update. Is this the >> case? It would certainly explain why my inserts/sec decay over time and why >> the average insert latency increases over time. The strange thing is that >> I'm not seeing disk read IO increase over that same period, but that might >> be due to the OS buffer cache... >> >> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >> normal? I'm using RandomPartitioner... >> >> Address DC Rack Status State Load >> Owns Token >> >> 136112946768375385385349842972707284580 >> 10.0.0.57 datacenter1 rack1 Up Normal 2.26 GB 20.00% >> 0 >> 10.0.0.56 datacenter1 rack1 Up Normal 2.47 GB 20.00% >> 34028236692093846346337460743176821145 >> 10.0.0.55 datacenter1 rack1 Up Normal 2.52 GB 20.00% >> 68056473384187692692674921486353642290 >> 10.0.0.54 datacenter1 rack1 Up Normal 950.97 MB 20.00% >> 102084710076281539039012382229530463435 >> 10.0.0.72 datacenter1 rack1 Up Normal 383.25 MB 20.00% >> 136112946768375385385349842972707284580 >> >> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >> and last node both have a count of 0. This is a clean cluster, and I've >> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >> hours. The last time this test ran, it went all the way down to 500 >> inserts/sec before I killed it. >
Re: Replicate On Write behavior
sorry i mean cf * row if you look in the code, db.cf is just basically a set of columns On Sep 1, 2011 1:36 PM, "Ian Danforth" wrote: > I'm not sure I understand the scalability of this approach. A given > column family can be HUGE with millions of rows and columns. In my > cluster I have a single column family that accounts for 90GB of load > on each node. Not only that but column family is distributed over the > entire ring. > > Clearly I'm misunderstanding something. > > Ian > > On Thu, Sep 1, 2011 at 1:17 PM, Yang wrote: >> when Cassandra reads, the entire CF is always read together, only at the >> hand-over to client does the pruning happens >> >> On Thu, Sep 1, 2011 at 11:52 AM, David Hawthorne >> wrote: >>> >>> I'm curious... digging through the source, it looks like replicate on >>> write triggers a read of the entire row, and not just the >>> columns/supercolumns that are affected by the counter update. Is this the >>> case? It would certainly explain why my inserts/sec decay over time and why >>> the average insert latency increases over time. The strange thing is that >>> I'm not seeing disk read IO increase over that same period, but that might >>> be due to the OS buffer cache... >>> >>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >>> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >>> normal? I'm using RandomPartitioner... >>> >>> Address DC RackStatus State Load >>> OwnsToken >>> >>> 136112946768375385385349842972707284580 >>> 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% >>> 0 >>> 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% >>> 34028236692093846346337460743176821145 >>> 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% >>> 68056473384187692692674921486353642290 >>> 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% >>> 102084710076281539039012382229530463435 >>> 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% >>> 136112946768375385385349842972707284580 >>> >>> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >>> and last node both have a count of 0. This is a clean cluster, and I've >>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >>> hours. The last time this test ran, it went all the way down to 500 >>> inserts/sec before I killed it. >>
Re: Replicate On Write behavior
On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne wrote: > I'm curious... digging through the source, it looks like replicate on write > triggers a read of the entire row, and not just the columns/supercolumns that > are affected by the counter update. Is this the case? It would certainly > explain why my inserts/sec decay over time and why the average insert latency > increases over time. The strange thing is that I'm not seeing disk read IO > increase over that same period, but that might be due to the OS buffer > cache... It does not. It only reads the columns/supercolumns affected by the counter update. In the source, this happens in CounterMutation.java. If you look at addReadCommandFromColumnFamily you'll see that it does a query by name only for the column involved in the update (the update is basically the content of the columnFamily parameter there). And Cassandra does *not* always reads a full row. Never had, never will. > On another note, on a 5-node cluster, I'm only seeing 3 nodes with > ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that normal? > I'm using RandomPartitioner... > > Address DC Rack Status State Load Owns > Token > > 136112946768375385385349842972707284580 > 10.0.0.57 datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 > 10.0.0.56 datacenter1 rack1 Up Normal 2.47 GB 20.00% > 34028236692093846346337460743176821145 > 10.0.0.55 datacenter1 rack1 Up Normal 2.52 GB 20.00% > 68056473384187692692674921486353642290 > 10.0.0.54 datacenter1 rack1 Up Normal 950.97 MB 20.00% > 102084710076281539039012382229530463435 > 10.0.0.72 datacenter1 rack1 Up Normal 383.25 MB 20.00% > 136112946768375385385349842972707284580 > > The nodes with ReplicateOnWrites are the 3 in the middle. The first node and > last node both have a count of 0. This is a clean cluster, and I've been > doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours. > The last time this test ran, it went all the way down to 500 inserts/sec > before I killed it. Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. -- Sylvain
Re: Replicate On Write behavior
That's interesting. I did an experiment wherein I added some entropy to the row name based on the time when the increment came in, (e.g. row = row + "/" + (timestamp - (timestamp % 300))) and now not only is the load (in GB) on my cluster more balanced, the performance has not decayed and has stayed steady (inserts/sec) with a relatively low average ms/insert. Each row is now significantly shorter as a result of this change. On Sep 2, 2011, at 12:30 AM, Sylvain Lebresne wrote: > On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne wrote: >> I'm curious... digging through the source, it looks like replicate on write >> triggers a read of the entire row, and not just the columns/supercolumns >> that are affected by the counter update. Is this the case? It would >> certainly explain why my inserts/sec decay over time and why the average >> insert latency increases over time. The strange thing is that I'm not >> seeing disk read IO increase over that same period, but that might be due to >> the OS buffer cache... > > It does not. It only reads the columns/supercolumns affected by the > counter update. > In the source, this happens in CounterMutation.java. If you look at > addReadCommandFromColumnFamily you'll see that it does a query by name > only for the column involved in the update (the update is basically > the content of the columnFamily parameter there). > > And Cassandra does *not* always reads a full row. Never had, never will. > >> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >> normal? I'm using RandomPartitioner... >> >> Address DC RackStatus State LoadOwns >> Token >> >> 136112946768375385385349842972707284580 >> 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 >> 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% >> 34028236692093846346337460743176821145 >> 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% >> 68056473384187692692674921486353642290 >> 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% >> 102084710076281539039012382229530463435 >> 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% >> 136112946768375385385349842972707284580 >> >> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >> and last node both have a count of 0. This is a clean cluster, and I've >> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >> hours. The last time this test ran, it went all the way down to 500 >> inserts/sec before I killed it. > > Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. > > -- > Sylvain
Re: Replicate On Write behavior
That ticket explains a lot, looking forward to a resolution on it. (Sorry I don't have a patch to offer) Ian On Fri, Sep 2, 2011 at 12:30 AM, Sylvain Lebresne wrote: > On Thu, Sep 1, 2011 at 8:52 PM, David Hawthorne wrote: >> I'm curious... digging through the source, it looks like replicate on write >> triggers a read of the entire row, and not just the columns/supercolumns >> that are affected by the counter update. Is this the case? It would >> certainly explain why my inserts/sec decay over time and why the average >> insert latency increases over time. The strange thing is that I'm not >> seeing disk read IO increase over that same period, but that might be due to >> the OS buffer cache... > > It does not. It only reads the columns/supercolumns affected by the > counter update. > In the source, this happens in CounterMutation.java. If you look at > addReadCommandFromColumnFamily you'll see that it does a query by name > only for the column involved in the update (the update is basically > the content of the columnFamily parameter there). > > And Cassandra does *not* always reads a full row. Never had, never will. > >> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >> normal? I'm using RandomPartitioner... >> >> Address DC Rack Status State Load Owns >> Token >> >> 136112946768375385385349842972707284580 >> 10.0.0.57 datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 >> 10.0.0.56 datacenter1 rack1 Up Normal 2.47 GB 20.00% >> 34028236692093846346337460743176821145 >> 10.0.0.55 datacenter1 rack1 Up Normal 2.52 GB 20.00% >> 68056473384187692692674921486353642290 >> 10.0.0.54 datacenter1 rack1 Up Normal 950.97 MB 20.00% >> 102084710076281539039012382229530463435 >> 10.0.0.72 datacenter1 rack1 Up Normal 383.25 MB 20.00% >> 136112946768375385385349842972707284580 >> >> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >> and last node both have a count of 0. This is a clean cluster, and I've >> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >> hours. The last time this test ran, it went all the way down to 500 >> inserts/sec before I killed it. > > Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. > > -- > Sylvain >
Re: Replicate On Write behavior
Does it always pick the node with the lowest IP address? All of my hosts are in the same /24. The fourth node in the 5 node cluster has the lowest value in the 4th octet (54). I erased the cluster and rebuilt it from scratch as a 3 node cluster using the first 3 nodes, and now the ReplicateOnWrites are all going to the third node, which is also the lowest valued IP address (55). That would explain why only 1 node gets writes in a 3 node cluster (RF=3) and why 3 nodes get writes in a 5 node cluster, and why one of those 3 is taking 66% of the writes. > >> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >> normal? I'm using RandomPartitioner... >> >> Address DC RackStatus State LoadOwns >> Token >> >> 136112946768375385385349842972707284580 >> 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 >> 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% >> 34028236692093846346337460743176821145 >> 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% >> 68056473384187692692674921486353642290 >> 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% >> 102084710076281539039012382229530463435 >> 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% >> 136112946768375385385349842972707284580 >> >> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >> and last node both have a count of 0. This is a clean cluster, and I've >> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >> hours. The last time this test ran, it went all the way down to 500 >> inserts/sec before I killed it. > > Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. > > -- > Sylvain
Re: Replicate On Write behavior
It was exactly due to 2890, and the fact that the first replica is always the one with the lowest value IP address. I patched cassandra to pick a random node out of the replica set in StorageProxy.java findSuitableEndpoint: Random rng = new Random(); return endpoints.get(rng.nextInt(endpoints.size())); // instead of return endpoints.get(0); Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the inserts/sec throughput. Here's the behavior I saw, and "disk work" refers to the ReplicateOnWrite load of a counter insert: One node will get RF/n of the disk work. Two nodes will always get 0 disk work. in a 3 node cluster, 1 node gets disk hit really hard. You get the performance of a one-node cluster. in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you the performance of ~2 node cluster. in a 10 node cluster, 1 node gets 30% of the disk work, giving you the performance of a ~3 node cluster. I confirmed this behavior with a 3, 4, and 5 node cluster size. > >> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >> normal? I'm using RandomPartitioner... >> >> Address DC RackStatus State LoadOwns >> Token >> >> 136112946768375385385349842972707284580 >> 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 >> 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% >> 34028236692093846346337460743176821145 >> 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% >> 68056473384187692692674921486353642290 >> 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% >> 102084710076281539039012382229530463435 >> 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% >> 136112946768375385385349842972707284580 >> >> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >> and last node both have a count of 0. This is a clean cluster, and I've >> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >> hours. The last time this test ran, it went all the way down to 500 >> inserts/sec before I killed it. > > Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. > > -- > Sylvain
Re: Replicate On Write behavior
We'll solve #2890 and we should have done it sooner. That being said, a quick question: how do you do your inserts from the clients ? Are you evenly distributing the inserts among the nodes ? Or are you always hitting the same coordinator ? Because provided the nodes are correctly distributed on the ring, if you distribute the inserts (increment) requests across the nodes (again I'm talking of client requests), you "should" not see the behavior you observe. -- Sylvain On Thu, Sep 8, 2011 at 9:48 PM, David Hawthorne wrote: > It was exactly due to 2890, and the fact that the first replica is always the > one with the lowest value IP address. I patched cassandra to pick a random > node out of the replica set in StorageProxy.java findSuitableEndpoint: > > Random rng = new Random(); > > return endpoints.get(rng.nextInt(endpoints.size())); // instead of return > endpoints.get(0); > > Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the > inserts/sec throughput. > > Here's the behavior I saw, and "disk work" refers to the ReplicateOnWrite > load of a counter insert: > > One node will get RF/n of the disk work. Two nodes will always get 0 disk > work. > > in a 3 node cluster, 1 node gets disk hit really hard. You get the > performance of a one-node cluster. > in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you > the performance of ~2 node cluster. > in a 10 node cluster, 1 node gets 30% of the disk work, giving you the > performance of a ~3 node cluster. > > I confirmed this behavior with a 3, 4, and 5 node cluster size. > > >> >>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with >>> ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that >>> normal? I'm using RandomPartitioner... >>> >>> Address DC Rack Status State Load Owns >>> Token >>> >>> 136112946768375385385349842972707284580 >>> 10.0.0.57 datacenter1 rack1 Up Normal 2.26 GB 20.00% >>> 0 >>> 10.0.0.56 datacenter1 rack1 Up Normal 2.47 GB 20.00% >>> 34028236692093846346337460743176821145 >>> 10.0.0.55 datacenter1 rack1 Up Normal 2.52 GB 20.00% >>> 68056473384187692692674921486353642290 >>> 10.0.0.54 datacenter1 rack1 Up Normal 950.97 MB 20.00% >>> 102084710076281539039012382229530463435 >>> 10.0.0.72 datacenter1 rack1 Up Normal 383.25 MB 20.00% >>> 136112946768375385385349842972707284580 >>> >>> The nodes with ReplicateOnWrites are the 3 in the middle. The first node >>> and last node both have a count of 0. This is a clean cluster, and I've >>> been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 >>> hours. The last time this test ran, it went all the way down to 500 >>> inserts/sec before I killed it. >> >> Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. >> >> -- >> Sylvain > >
Re: Replicate On Write behavior
They are evenly distributed. 5 nodes * 40 connections each using hector, and I can confirm that all 200 are active when this happened (from hector's perspective, from graphing the hector jmx data), and all 5 nodes saw roughly 40 connections, and all were receiving traffic over those connections. (netstat + ntop + trafshow, etc) I can also confirm that I changed my insert strategy to break up the rows using composite row keys, which reduced the row lengths and gave me an almost perfectly even data distribution among the nodes, and that was when I started to really dig into why the ROWs were still backing up on one node specifically, and why 2 nodes weren't seeing any. It was the 20%, 20%, 60% ROW distribution that really got me thinking, and when I took the 60% node out of the cluster, that ROW load jumped back to the node with the next-lowest IP address, and the 2 nodes that weren't seeing any *still* wheren't seeing any ROWs. At that point I tore down the cluster, recreated it as a 3 node cluster several times using various permutations of the 5 nodes available, and ROW load was *always* on the node with the lowest IP address. the theory might not be right, but it certainly represents the behavior I saw. On Sep 9, 2011, at 12:17 AM, Sylvain Lebresne wrote: > We'll solve #2890 and we should have done it sooner. > > That being said, a quick question: how do you do your inserts from the > clients ? Are you evenly > distributing the inserts among the nodes ? Or are you always hitting > the same coordinator ? > > Because provided the nodes are correctly distributed on the ring, if > you distribute the inserts > (increment) requests across the nodes (again I'm talking of client > requests), you "should" not > see the behavior you observe. > > -- > Sylvain > > On Thu, Sep 8, 2011 at 9:48 PM, David Hawthorne wrote: >> It was exactly due to 2890, and the fact that the first replica is always >> the one with the lowest value IP address. I patched cassandra to pick a >> random node out of the replica set in StorageProxy.java findSuitableEndpoint: >> >> Random rng = new Random(); >> >> return endpoints.get(rng.nextInt(endpoints.size())); // instead of return >> endpoints.get(0); >> >> Now work load is evenly balanced among all 5 nodes and I'm getting 2.5x the >> inserts/sec throughput. >> >> Here's the behavior I saw, and "disk work" refers to the ReplicateOnWrite >> load of a counter insert: >> >> One node will get RF/n of the disk work. Two nodes will always get 0 disk >> work. >> >> in a 3 node cluster, 1 node gets disk hit really hard. You get the >> performance of a one-node cluster. >> in a 6 node cluster, 1 node gets hit with 50% of the disk work, giving you >> the performance of ~2 node cluster. >> in a 10 node cluster, 1 node gets 30% of the disk work, giving you the >> performance of a ~3 node cluster. >> >> I confirmed this behavior with a 3, 4, and 5 node cluster size. >> >> >>> On another note, on a 5-node cluster, I'm only seeing 3 nodes with ReplicateOnWrite Completed tasks in nodetool tpstats output. Is that normal? I'm using RandomPartitioner... Address DC RackStatus State Load OwnsToken 136112946768375385385349842972707284580 10.0.0.57datacenter1 rack1 Up Normal 2.26 GB 20.00% 0 10.0.0.56datacenter1 rack1 Up Normal 2.47 GB 20.00% 34028236692093846346337460743176821145 10.0.0.55datacenter1 rack1 Up Normal 2.52 GB 20.00% 68056473384187692692674921486353642290 10.0.0.54datacenter1 rack1 Up Normal 950.97 MB 20.00% 102084710076281539039012382229530463435 10.0.0.72datacenter1 rack1 Up Normal 383.25 MB 20.00% 136112946768375385385349842972707284580 The nodes with ReplicateOnWrites are the 3 in the middle. The first node and last node both have a count of 0. This is a clean cluster, and I've been doing 3k ... 2.5k (decaying performance) inserts/sec for the last 12 hours. The last time this test ran, it went all the way down to 500 inserts/sec before I killed it. >>> >>> Could be due to https://issues.apache.org/jira//browse/CASSANDRA-2890. >>> >>> -- >>> Sylvain >> >>