Re: How to stop "nodetool repair" in 2.1.2?
Using JMX worked. Thanks a lot. On Wed, Apr 15, 2015 at 3:57 PM, Robert Coli wrote: > On Wed, Apr 15, 2015 at 3:30 PM, Benyi Wang wrote: > >> It didn't work. I ran the command on all nodes, but I still can see the >> repair activities. >> > > Your input as an operator who wants a nodetool command to trivially stop > repairs is welcome here : > > https://issues.apache.org/jira/browse/CASSANDRA-3486 > > For now, your two options are : > > 1) restart all nodes participating in the repair > 2) access the JMX endpoint forceTerminateAllRepairSessions on all nodes > participating in the repair > > =Rob > >
Re: Delete query range limitation
There's a ticket for range deletions in CQL here: https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-6237 On Apr 15, 2015 6:27 PM, "Dan Kinder" wrote: > > I understand that range deletes are currently not supported ( http://stackoverflow.com/questions/19390335/cassandra-cql-delete-using-a-less-than-operator-on-a-secondary-key ) > > Since Cassandra now does have range tombstones is there a reason why it can't be allowed? Is there a ticket for supporting this or is it a deliberate design decision not to?
Re: How to stop "nodetool repair" in 2.1.2?
On Wed, Apr 15, 2015 at 3:30 PM, Benyi Wang wrote: > It didn't work. I ran the command on all nodes, but I still can see the > repair activities. > Your input as an operator who wants a nodetool command to trivially stop repairs is welcome here : https://issues.apache.org/jira/browse/CASSANDRA-3486 For now, your two options are : 1) restart all nodes participating in the repair 2) access the JMX endpoint forceTerminateAllRepairSessions on all nodes participating in the repair =Rob
Re: How to stop "nodetool repair" in 2.1.2?
It didn't work. I ran the command on all nodes, but I still can see the repair activities. On Wed, Apr 15, 2015 at 3:20 PM, Sebastian Estevez < sebastian.este...@datastax.com> wrote: > nodetool stop *VALIDATION* > On Apr 15, 2015 5:16 PM, "Benyi Wang" wrote: > >> I ran "nodetool repair -- keyspace table" for a table, and it is still >> running after 4 days. I knew there is an issue for repair with vnodes >> https://issues.apache.org/jira/browse/CASSANDRA-5220. My question is how >> I can kill this sequential repair? >> >> I killed the process which I ran the repair command. But I still can find >> the repair activities running on different nodes in OpsCenter. >> >> Is there a way I can stop the repair without restarting the nodes? >> >> Thanks. >> >
Delete query range limitation
I understand that range deletes are currently not supported ( http://stackoverflow.com/questions/19390335/cassandra-cql-delete-using-a-less-than-operator-on-a-secondary-key ) Since Cassandra now does have range tombstones is there a reason why it can't be allowed? Is there a ticket for supporting this or is it a deliberate design decision not to?
Re: How to stop "nodetool repair" in 2.1.2?
nodetool stop *VALIDATION* On Apr 15, 2015 5:16 PM, "Benyi Wang" wrote: > I ran "nodetool repair -- keyspace table" for a table, and it is still > running after 4 days. I knew there is an issue for repair with vnodes > https://issues.apache.org/jira/browse/CASSANDRA-5220. My question is how > I can kill this sequential repair? > > I killed the process which I ran the repair command. But I still can find > the repair activities running on different nodes in OpsCenter. > > Is there a way I can stop the repair without restarting the nodes? > > Thanks. >
How to stop "nodetool repair" in 2.1.2?
I ran "nodetool repair -- keyspace table" for a table, and it is still running after 4 days. I knew there is an issue for repair with vnodes https://issues.apache.org/jira/browse/CASSANDRA-5220. My question is how I can kill this sequential repair? I killed the process which I ran the repair command. But I still can find the repair activities running on different nodes in OpsCenter. Is there a way I can stop the repair without restarting the nodes? Thanks.
Re: Keyspace Replication changes not synchronized after adding Datacenter
Hello, No, that is not expected at all, to run the alter statement in each DC. Yes, it indicates a larger problem, for sure. Check that ports are open between all nodes, especially 7000, if I recall correctly. We use a simple telnet check. Paul On 04/13/2015 10:22 AM, Thunder Stumpges wrote: Hi guys, We have recently added two datacenters to our existing 2.0.6 cluster. We followed the process here pretty much exactly: http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html We are using GossipingPropertyFileSnitch and NetworkTopologyStrategy across the board. All property files are identical in each of the three datacenters, and we use two nodes from each DC in the seed list. However when we came to step 7.a. we ran the ALTER KEYSPACE command on one of the new datacenters (to add it as a replica). This change was reflected on the new datacenter where it ran as returned by DESCRIBE KEYSPACE. However the change was NOT propagated to either of the other two datacenters. We effectively had to run the ALTER KEYSPACE command 3 times, one in each datacenter. Is this expected? I could find no documentation stating that this needed to be done, nor any documentation around how the system keyspace was kept in sync across datacenters in general. If this is indicative of a larger problem with our installation, how would we go about troubleshooting it? Thanks in advance! Thunder
Re: One node misbehaving (lot's of GC), ideas?
Hi Erik, Forgetting for a while that it's only a single row: does this node store any super-long rows? The first things that come to my mind after reading your e-mail is unthrottled compaction (sounds like a possible issue, but it would affect other nodes too) or very large rows. Or a mix of both? Maybe it will be of your interest: http://aryanet.com/blog/cassandra-garbage-collector-tuning re investigating GC issues (if you haven't seen it yet) and pinning it down further. M. Kind regards, Michał Michalski, michal.michal...@boxever.com On 15 April 2015 at 13:15, Erik Forsberg wrote: > Hi! > > We having problems with one node (out of 56 in total) misbehaving. > Symptoms are: > > * High number of full CMS old space collections during early morning > when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few > thrift insertions. > * Really long stop-the-world GC events (I've seen up to 50 seconds) for > both CMS and ParNew. > * CPU usage higher during early morning hours compared to other nodes. > * The large number of Garbage Collections *seems* to correspond to doing > a lot of compactions (SizeTiered for most of our CFs, Leveled for a few > small ones) > * Node loosing track of what other nodes are up and keeping that state > until restart (this I think is a bug caused by the GC behaviour, with > the stop-the-world making the node not accepting gossip connections from > other nodes) > > This is on 2.0.13 with vnodes (256 per node). > > All other nodes have normal behaviour, with a few (2-3) full CMS old > space in the same 3h period that the trouble node is making some 30 > ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the > problem was even worse (it seems, this is a bit hard to debug as it > happens *almost* every night). > > nodetool status shows that although we have a certain unbalance in the > cluster, this node is neither the most nor the least loaded. I.e. we > have between 1.6% and 2.1% in the "Owns" column, and the troublesome > node reports 1.7%. > > All nodes are under puppet control, so configuration is the same > everywhere. > > We're running NetworkTopolyStrategy with rack awareness, and here's a > deviation from recommended settings - we have slightly varying number of > nodes in the racks: > > 15 cssa01 > 15 cssa02 > 13 cssa03 > 13 cssa04 > > The affected node is in the cssa04 rack. Could this mean I have some > kind of hotspot situation? Why would that show up as more GC work? > > I'm quite puzzled here, so I'm looking for hints on how to identify what > is causing this. > > Regards, > \EF > > > > >
Re: Delete-only work loads crash Cassandra
I can readily reproduce the bug, and filed a JIRA ticket: https://issues.apache.org/jira/browse/CASSANDRA-9194 I’m posting for posterity On Apr 13, 2015, at 11:59 AM, Robert Wille mailto:rwi...@fold3.com>> wrote: Unfortunately, I’ve switched email systems and don’t have my emails from that time period. I did not file a Jira, and I don’t remember who made the patch for me or if he filed a Jira on my behalf. I vaguely recall seeing the fix in the Cassandra change logs, but I just went and read them and I don’t see it. I’m probably remembering wrong. My suspicion is that the original patch did not make it into the main branch, and I just have always had enough concurrent writing to keep Cassandra happy. Hopefully the author of the patch will read this and be able to chime in. This issue is very reproducible. I’ll try to come up with some time to write a simple program that illustrates the problem and file a Jira. Thanks Robert On Apr 13, 2015, at 10:39 AM, Philip Thompson mailto:philip.thomp...@datastax.com>> wrote: Did the original patch make it into upstream? That's unclear. If so, what was the JIRA #? Have you filed a JIRA for the new problem? On Mon, Apr 13, 2015 at 12:21 PM, Robert Wille mailto:rwi...@fold3.com>> wrote: Back in 2.0.4 or 2.0.5 I ran into a problem with delete-only workloads. If I did lots of deletes and no upserts, Cassandra would report that the memtable was 0 bytes because an accounting error. The memtable would never flush and Cassandra would eventually die. Someone was kind enough to create a patch, which seemed to have fixed the problem, but last night it reared its ugly head. I’m now running 2.0.14. I ran a cleanup process on my cluster (10 nodes, RF=3, CL=1). The workload was pretty light, because this cleanup process is single-threaded and does everything synchronously. It was performing 4 reads per second and about 3000 deletes per second. Over the course of many hours, heap slowly grew on all nodes. CPU utilization also increased as GC consumed an ever-increasing amount of time. Eventually a couple of nodes shed 3.5 GB of their 7.5 GB. Other nodes weren’t so fortunate and started flapping due to 30 second GC pauses. The workaround is pretty simple. This cleanup process can simply write a dummy record with a TTL periodically so that Cassandra can flush its memtables and function properly. However, I think this probably ought to be fixed. Delete-only workloads can’t be that rare. I can’t be the only one that needs to go through and cleanup their tables. Robert
One node misbehaving (lot's of GC), ideas?
Hi! We having problems with one node (out of 56 in total) misbehaving. Symptoms are: * High number of full CMS old space collections during early morning when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few thrift insertions. * Really long stop-the-world GC events (I've seen up to 50 seconds) for both CMS and ParNew. * CPU usage higher during early morning hours compared to other nodes. * The large number of Garbage Collections *seems* to correspond to doing a lot of compactions (SizeTiered for most of our CFs, Leveled for a few small ones) * Node loosing track of what other nodes are up and keeping that state until restart (this I think is a bug caused by the GC behaviour, with the stop-the-world making the node not accepting gossip connections from other nodes) This is on 2.0.13 with vnodes (256 per node). All other nodes have normal behaviour, with a few (2-3) full CMS old space in the same 3h period that the trouble node is making some 30 ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the problem was even worse (it seems, this is a bit hard to debug as it happens *almost* every night). nodetool status shows that although we have a certain unbalance in the cluster, this node is neither the most nor the least loaded. I.e. we have between 1.6% and 2.1% in the "Owns" column, and the troublesome node reports 1.7%. All nodes are under puppet control, so configuration is the same everywhere. We're running NetworkTopolyStrategy with rack awareness, and here's a deviation from recommended settings - we have slightly varying number of nodes in the racks: 15 cssa01 15 cssa02 13 cssa03 13 cssa04 The affected node is in the cssa04 rack. Could this mean I have some kind of hotspot situation? Why would that show up as more GC work? I'm quite puzzled here, so I'm looking for hints on how to identify what is causing this. Regards, \EF