I should also note, I also see nodes become locked up without seeing that
Exception. But the GossipStage buildup does seem correlated with gossip
activity, e.g. me restarting a different node.

On Thu, Sep 28, 2017 at 9:20 AM, Dan Kinder <dkin...@turnitin.com> wrote:

> Hi,
>
> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
> following. The cluster does function, for a while, but then some stages
> begin to back up and the node does not recover and does not drain the
> tasks, even under no load. This happens both to MutationStage and
> GossipStage.
>
> I do see the following exception happen in the logs:
>
>
> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
> CassandraDaemon.java:228 - Exception in thread
> Thread[ReadRepairStage:2328,5,main]
>
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out
> - received only 1 responses.
>
>         at org.apache.cassandra.service.DataResolver$
> RepairMergeListener.close(DataResolver.java:171)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at org.apache.cassandra.db.partitions.
> UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at 
> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at 
> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at 
> org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_91]
>
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ~[na:1.8.0_91]
>
>         at org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>
>
> But it's hard to correlate precisely with things going bad. It is also
> very strange to me since I have both read_repair_chance and
> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
> confusing why ReadRepairStage would err.
>
> Anyone have thoughts on this? It's pretty muddling, and causes nodes to
> lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
> If I can't find a resolution I'm going to need to downgrade and restore to
> backup...
>
> The only issue I found that looked similar is https://issues.apache.org/
> jira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10.
>
>
> $ nodetool tpstats
>
> Pool Name                         Active   Pending      Completed
> Blocked  All time blocked
>
> ReadStage                              0         0         582103         0
>                 0
>
> MiscStage                              0         0              0         0
>                 0
>
> CompactionExecutor                    11        11           2868         0
>                 0
>
> MutationStage                         32   4593678       55057393         0
>                 0
>
> GossipStage                            1      2818         371487         0
>                 0
>
> RequestResponseStage                   0         0        4345522         0
>                 0
>
> ReadRepairStage                        0         0         151473         0
>                 0
>
> CounterMutationStage                   0         0              0         0
>                 0
>
> MemtableFlushWriter                    1        81             76         0
>                 0
>
> MemtablePostFlush                      1       382            139         0
>                 0
>
> ValidationExecutor                     0         0              0         0
>                 0
>
> ViewMutationStage                      0         0              0         0
>                 0
>
> CacheCleanupExecutor                   0         0              0         0
>                 0
>
> PerDiskMemtableFlushWriter_10          0         0             69         0
>                 0
>
> PerDiskMemtableFlushWriter_11          0         0             69         0
>                 0
>
> MemtableReclaimMemory                  0         0             81         0
>                 0
>
> PendingRangeCalculator                 0         0             32         0
>                 0
>
> SecondaryIndexManagement               0         0              0         0
>                 0
>
> HintsDispatcher                        0         0            596         0
>                 0
>
> PerDiskMemtableFlushWriter_1           0         0             69         0
>                 0
>
> Native-Transport-Requests             11         0        4547746
> 0                67
>
> PerDiskMemtableFlushWriter_2           0         0             69         0
>                 0
>
> MigrationStage                         1      1545            586         0
>                 0
>
> PerDiskMemtableFlushWriter_0           0         0             80         0
>                 0
>
> Sampler                                0         0              0         0
>                 0
>
> PerDiskMemtableFlushWriter_5           0         0             69         0
>                 0
>
> InternalResponseStage                  0         0          45432         0
>                 0
>
> PerDiskMemtableFlushWriter_6           0         0             69         0
>                 0
>
> PerDiskMemtableFlushWriter_3           0         0             69         0
>                 0
>
> PerDiskMemtableFlushWriter_4           0         0             69         0
>                 0
>
> PerDiskMemtableFlushWriter_9           0         0             69         0
>                 0
>
> AntiEntropyStage                       0         0              0         0
>                 0
>
> PerDiskMemtableFlushWriter_7           0         0             69         0
>                 0
>
> PerDiskMemtableFlushWriter_8           0         0             69         0
>                 0
>
>
> Message type           Dropped
>
> READ                         0
>
> RANGE_SLICE                  0
>
> _TRACE                       0
>
> HINT                         0
>
> MUTATION                     0
>
> COUNTER_MUTATION             0
>
> BATCH_STORE                  0
>
> BATCH_REMOVE                 0
>
> REQUEST_RESPONSE             0
>
> PAGED_RANGE                  0
>
> READ_REPAIR                  0
>
>
> -dan
>



-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com

Reply via email to