Dan,

do you see any major GC? We have been hit by the following memory leak in our 
loadtest environment with 3.11.0.
https://issues.apache.org/jira/browse/CASSANDRA-13754

So, depending on the heap size and uptime, you might get into heap troubles.

Thomas

From: Dan Kinder [mailto:dkin...@turnitin.com]
Sent: Donnerstag, 28. September 2017 18:20
To: user@cassandra.apache.org
Subject:


Hi,

I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the 
following. The cluster does function, for a while, but then some stages begin 
to back up and the node does not recover and does not drain the tasks, even 
under no load. This happens both to MutationStage and GossipStage.

I do see the following exception happen in the logs:



ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440 CassandraDaemon.java:228 - 
Exception in thread Thread[ReadRepairStage:2328,5,main]

org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - 
received only 1 responses.

        at 
org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) 
~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
~[apache-cassandra-3.11.0.jar:3.11.0]

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
~[na:1.8.0_91]

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
~[na:1.8.0_91]

        at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
 ~[apache-cassandra-3.11.0.jar:3.11.0]

        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]



But it's hard to correlate precisely with things going bad. It is also very 
strange to me since I have both read_repair_chance and 
dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is confusing 
why ReadRepairStage would err.

Anyone have thoughts on this? It's pretty muddling, and causes nodes to lock 
up. Once it happens Cassandra can't even shut down, I have to kill -9. If I 
can't find a resolution I'm going to need to downgrade and restore to backup...

The only issue I found that looked similar is 
https://issues.apache.org/jira/browse/CASSANDRA-12689 but that appears to be 
fixed by 3.10.



$ nodetool tpstats

Pool Name                         Active   Pending      Completed   Blocked  
All time blocked

ReadStage                              0         0         582103         0     
            0

MiscStage                              0         0              0         0     
            0

CompactionExecutor                    11        11           2868         0     
            0

MutationStage                         32   4593678       55057393         0     
            0

GossipStage                            1      2818         371487         0     
            0

RequestResponseStage                   0         0        4345522         0     
            0

ReadRepairStage                        0         0         151473         0     
            0

CounterMutationStage                   0         0              0         0     
            0

MemtableFlushWriter                    1        81             76         0     
            0

MemtablePostFlush                      1       382            139         0     
            0

ValidationExecutor                     0         0              0         0     
            0

ViewMutationStage                      0         0              0         0     
            0

CacheCleanupExecutor                   0         0              0         0     
            0

PerDiskMemtableFlushWriter_10          0         0             69         0     
            0

PerDiskMemtableFlushWriter_11          0         0             69         0     
            0

MemtableReclaimMemory                  0         0             81         0     
            0

PendingRangeCalculator                 0         0             32         0     
            0

SecondaryIndexManagement               0         0              0         0     
            0

HintsDispatcher                        0         0            596         0     
            0

PerDiskMemtableFlushWriter_1           0         0             69         0     
            0

Native-Transport-Requests             11         0        4547746         0     
           67

PerDiskMemtableFlushWriter_2           0         0             69         0     
            0

MigrationStage                         1      1545            586         0     
            0

PerDiskMemtableFlushWriter_0           0         0             80         0     
            0

Sampler                                0         0              0         0     
            0

PerDiskMemtableFlushWriter_5           0         0             69         0     
            0

InternalResponseStage                  0         0          45432         0     
            0

PerDiskMemtableFlushWriter_6           0         0             69         0     
            0

PerDiskMemtableFlushWriter_3           0         0             69         0     
            0

PerDiskMemtableFlushWriter_4           0         0             69         0     
            0

PerDiskMemtableFlushWriter_9           0         0             69         0     
            0

AntiEntropyStage                       0         0              0         0     
            0

PerDiskMemtableFlushWriter_7           0         0             69         0     
            0

PerDiskMemtableFlushWriter_8           0         0             69         0     
            0



Message type           Dropped

READ                         0

RANGE_SLICE                  0

_TRACE                       0

HINT                         0

MUTATION                     0

COUNTER_MUTATION             0

BATCH_STORE                  0

BATCH_REMOVE                 0

REQUEST_RESPONSE             0

PAGED_RANGE                  0

READ_REPAIR                  0


-dan
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313

Reply via email to