Compaction is backed up that may be normal write load (because of the rack imbalance), or it may be a secondary index build. Hard to say for sure. nodetool compactionstats¹ if you¹re able to provide it. The jstack probably not necessary, streaming is being marked as failed and it¹s turning itself off. Not sure why streaming is marked as failing, though, anything on the sending sides?
From: Brian Spindler <brian.spind...@gmail.com> Reply-To: <user@cassandra.apache.org> Date: Saturday, August 12, 2017 at 6:34 PM To: <user@cassandra.apache.org> Subject: Re: Dropping down replication factor Thanks for replying Jeff. Responses below. On Sat, Aug 12, 2017 at 8:33 PM Jeff Jirsa <jji...@gmail.com> wrote: > Answers inline > > -- > Jeff Jirsa > > >> > On Aug 12, 2017, at 2:58 PM, brian.spind...@gmail.com wrote: >> > >> > Hi folks, hopefully a quick one: >> > >> > We are running a 12 node cluster (2.1.15) in AWS with Ec2Snitch. It's all >> in one region but spread across 3 availability zones. It was nicely balanced >> with 4 nodes in each. >> > >> > But with a couple of failures and subsequent provisions to the wrong az we >> now have a cluster with : >> > >> > 5 nodes in az A >> > 5 nodes in az B >> > 2 nodes in az C >> > >> > Not sure why, but when adding a third node in AZ C it fails to stream after >> getting all the way to completion and no apparent error in logs. I've looked >> at a couple of bugs referring to scrubbing and possible OOM bugs due to >> metadata writing at end of streaming (sorry don't have ticket handy). I'm >> worried I might not be able to do much with these since the disk space usage >> is high and they are under a lot of load given the small number of them for >> this rack. > > You'll definitely have higher load on az C instances with rf=3 in this ratio > > Streaming should still work - are you sure it's not busy doing something? Like > building secondary index or similar? jstack thread dump would be useful, or at > least nodetool tpstats > Only other thing might be a backup. We do incrementals x1hr and snapshots x24h; they are shipped to s3 then links are cleaned up. The error I get on the node I'm trying to add to rack C is: ERROR [main] 2017-08-12 23:54:51,546 CassandraDaemon.java:583 - Exception encountered during startup java.lang.RuntimeException: Error during boostrap: Stream failed at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:87) ~[apache-cassandra-2.1.15.jar:2.1.15] at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:11 66) ~[apache-cassandra-2.1.15.jar:2.1.15] at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.jav a:944) ~[apache-cassandra-2.1.15.jar:2.1.15] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:7 40) ~[apache-cassandra-2.1.15.jar:2.1.15] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:6 17) ~[apache-cassandra-2.1.15.jar:2.1.15] at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:391) [apache-cassandra-2.1.15.jar:2.1.15] at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:5 66) [apache-cassandra-2.1.15.jar:2.1.15] at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:655) [apache-cassandra-2.1.15.jar:2.1.15] Caused by: org.apache.cassandra.streaming.StreamException: Stream failed at org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(S treamEventJMXNotifier.java:85) ~[apache-cassandra-2.1.15.jar:2.1.15] at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na] at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.ex ecute(MoreExecutors.java:297) ~[guava-16.0.jar:na] at com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionLis t.java:156) ~[guava-16.0.jar:na] at com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:1 45) ~[guava-16.0.jar:na] at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture .java:202) ~[guava-16.0.jar:na] at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResult Future.java:209) ~[apache-cassandra-2.1.15.jar:2.1.15] at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(Stre amResultFuture.java:185) ~[apache-cassandra-2.1.15.jar:2.1.15] at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java :413) ~[apache-cassandra-2.1.15.jar:2.1.15] at org.apache.cassandra.streaming.StreamSession.maybeCompleted(StreamSession.ja va:700) ~[apache-cassandra-2.1.15.jar:2.1.15] at org.apache.cassandra.streaming.StreamSession.taskCompleted(StreamSession.jav a:661) ~[apache-cassandra-2.1.15.jar:2.1.15] at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(St reamReceiveTask.java:179) ~[apache-cassandra-2.1.15.jar:2.1.15] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_112] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_112] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11 42) ~[na:1.8.0_112] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6 17) ~[na:1.8.0_112] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_112] WARN [StorageServiceShutdownHook] 2017-08-12 23:54:51,582 Gossiper.java:1462 - No local state or state is in silent shutdown, not announcing shutdown INFO [StorageServiceShutdownHook] 2017-08-12 23:54:51,582 MessagingService.java:734 - Waiting for messaging service to quiesce INFO [ACCEPT-/10.40.17.114 <http://10.40.17.114> ] 2017-08-12 23:54:51,583 MessagingService.java:1020 - MessagingService has terminated the accept() thread And I got this on this same node when it was bootstrapping, I ran 'nodetool netstats' just before it shutdown: Receiving 377 files, 161928296443 bytes total. Already received 377 files, 161928296443 bytes total TPStats on host that was streaming the data to this node: Pool Name Active Pending Completed Blocked All time blocked MutationStage 1 1 4488289014 0 0 ReadStage 0 0 24486526 0 0 RequestResponseStage 0 0 3038847374 <tel:(303)%20884-7374> 0 0 ReadRepairStage 0 0 1601576 0 0 CounterMutationStage 0 0 68403 0 0 MiscStage 0 0 0 0 0 AntiEntropySessions 0 0 0 0 0 HintedHandoff 0 0 18 0 0 GossipStage 0 0 2786892 0 0 CacheCleanupExecutor 0 0 0 0 0 InternalResponseStage 0 0 61115 0 0 CommitLogArchiver 0 0 0 0 0 CompactionExecutor 4 83 304167 0 0 ValidationExecutor 0 0 78249 0 0 MigrationStage 0 0 94201 0 0 AntiEntropyStage 0 0 160505 0 0 PendingRangeCalculator 0 0 30 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 0 0 71270 0 0 MemtablePostFlush 0 0 175209 0 0 MemtableReclaimMemory 0 0 81222 0 0 Native-Transport-Requests 2 0 1983565628 0 9405444 Message type Dropped READ 218 RANGE_SLICE 15 _TRACE 0 MUTATION 2949001 COUNTER_MUTATION 0 BINARY 0 REQUEST_RESPONSE 0 PAGED_RANGE 0 READ_REPAIR 8571 I can get a jstack if needed. > >> > >> > Rather than troubleshoot this further, what I was thinking about doing was: >> > - drop the replication factor on our keyspace to two > > Repair before you do this, or you'll lose your consistency guarantees Given the load on the 2 nodes in rack C I'm hoping a repair will succeed. > >> > - hopefully this would reduce load on these two remaining nodes > > It should, racks awareness guarantees on replica per rack if rf==num racks, so > right now those 2 c machines have 2.5x as much data as the others. This will > drop that requirement and drop the load significantly > >> > - run repairs/cleanup across the cluster >> > - then shoot these two nodes in the 'c' rack > > Why shoot the c instances? Why not drop RF and then add 2 more C instances, > then increase RF back to 3, run repair, then Decom the extra instances in a > and b? > > Fair point. I was considering staying at RF two but I think with your points below, I should reconsider. >> > - run repairs/cleanup across the cluster >> > >> > Would this work with minimal/no disruption? > > The big risk of running rf=2 is that quorum==all - any gc pause or node > restarting will make you lose HA or strong consistency guarantees. > >> > Should I update their "rack" before hand or after ? > > You can't change a node's rack once it's in the cluster, it SHOULD refuse to > start if you do that > Got it. >> > What else am I not thinking about? >> > >> > My main goal atm is to get back to where the cluster is in a clean >> consistent state that allows nodes to properly bootstrap. >> > >> > Thanks for your help in advance. >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> > For additional commands, e-mail: user-h...@cassandra.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org >