Thanks for the responses.

@Prem yes this is after the entire cluster is on 3.11, but no I did not run
upgradesstables yet.

@Thomas no I don't see any major GC going on.

@Jeff yeah it's fully upgraded. I decided to shut the whole thing down and
bring it back (thankfully this cluster is not serving live traffic). The
nodes seemed okay for an hour or two, but I see the issue again, without me
bouncing any nodes. This time it's ReadStage that's building up, and the
exception I'm seeing in the logs is:

DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242 -
Digest mismatch:

org.apache.cassandra.service.DigestMismatchException: Mismatch for key
DecoratedKey(6150926370328526396, 696a6374652e6f7267)
(2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025)

        at
org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
~[apache-cassandra-3.11.0.jar:3.11.0]

        at
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
~[apache-cassandra-3.11.0.jar:3.11.0]

        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_71]

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_71]

        at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
[apache-cassandra-3.11.0.jar:3.11.0]

        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_71]


Do you think running upgradesstables would help? Or relocatesstables? I
presumed it shouldn't be necessary for Cassandra to function, just an
optimization.

On Thu, Sep 28, 2017 at 12:49 PM, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Dan,
>
>
>
> do you see any major GC? We have been hit by the following memory leak in
> our loadtest environment with 3.11.0.
>
> https://issues.apache.org/jira/browse/CASSANDRA-13754
>
>
>
> So, depending on the heap size and uptime, you might get into heap
> troubles.
>
>
>
> Thomas
>
>
>
> *From:* Dan Kinder [mailto:dkin...@turnitin.com]
> *Sent:* Donnerstag, 28. September 2017 18:20
> *To:* user@cassandra.apache.org
> *Subject:*
>
>
>
> Hi,
>
> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
> following. The cluster does function, for a while, but then some stages
> begin to back up and the node does not recover and does not drain the
> tasks, even under no load. This happens both to MutationStage and
> GossipStage.
>
> I do see the following exception happen in the logs:
>
>
>
> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
> CassandraDaemon.java:228 - Exception in thread
> Thread[ReadRepairStage:2328,5,main]
>
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out
> - received only 1 responses.
>
>         at org.apache.cassandra.service.DataResolver$
> RepairMergeListener.close(DataResolver.java:171)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at org.apache.cassandra.db.partitions.
> UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at 
> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at 
> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at 
> org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_91]
>
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ~[na:1.8.0_91]
>
>         at org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
>         at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>
>
>
> But it's hard to correlate precisely with things going bad. It is also
> very strange to me since I have both read_repair_chance and
> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
> confusing why ReadRepairStage would err.
>
> Anyone have thoughts on this? It's pretty muddling, and causes nodes to
> lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
> If I can't find a resolution I'm going to need to downgrade and restore to
> backup...
>
> The only issue I found that looked similar is https://issues.apache.org/
> jira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10.
>
>
>
> $ nodetool tpstats
>
> Pool Name                         Active   Pending      Completed
> Blocked  All time blocked
>
> ReadStage                              0         0         582103         0
>                 0
>
> MiscStage                              0         0              0         0
>                 0
>
> CompactionExecutor                    11        11           2868         0
>                 0
>
> MutationStage                         32   4593678       55057393         0
>                 0
>
> GossipStage                            1      2818         371487         0
>                 0
>
> RequestResponseStage                   0         0        4345522         0
>                 0
>
> ReadRepairStage                        0         0         151473         0
>                 0
>
> CounterMutationStage                   0         0              0         0
>                 0
>
> MemtableFlushWriter                    1        81             76         0
>                 0
>
> MemtablePostFlush                      1       382            139         0
>                 0
>
> ValidationExecutor                     0         0              0         0
>                 0
>
> ViewMutationStage                      0         0              0         0
>                 0
>
> CacheCleanupExecutor                   0         0              0         0
>                 0
>
> PerDiskMemtableFlushWriter_10          0         0             69         0
>                 0
>
> PerDiskMemtableFlushWriter_11          0         0             69         0
>                 0
>
> MemtableReclaimMemory                  0         0             81         0
>                 0
>
> PendingRangeCalculator                 0         0             32         0
>                 0
>
> SecondaryIndexManagement               0         0              0         0
>                 0
>
> HintsDispatcher                        0         0            596         0
>                 0
>
> PerDiskMemtableFlushWriter_1           0         0             69         0
>                 0
>
> Native-Transport-Requests             11         0        4547746
> 0                67
>
> PerDiskMemtableFlushWriter_2           0         0             69         0
>                 0
>
> MigrationStage                         1      1545            586         0
>                 0
>
> PerDiskMemtableFlushWriter_0           0         0             80         0
>                 0
>
> Sampler                                0         0              0         0
>                 0
>
> PerDiskMemtableFlushWriter_5           0         0             69         0
>                 0
>
> InternalResponseStage                  0         0          45432         0
>                 0
>
> PerDiskMemtableFlushWriter_6           0         0             69         0
>                 0
>
> PerDiskMemtableFlushWriter_3           0         0             69         0
>                 0
>
> PerDiskMemtableFlushWriter_4           0         0             69         0
>                 0
>
> PerDiskMemtableFlushWriter_9           0         0             69         0
>                 0
>
> AntiEntropyStage                       0         0              0         0
>                 0
>
> PerDiskMemtableFlushWriter_7           0         0             69         0
>                 0
>
> PerDiskMemtableFlushWriter_8           0         0             69         0
>                 0
>
>
>
> Message type           Dropped
>
> READ                         0
>
> RANGE_SLICE                  0
>
> _TRACE                       0
>
> HINT                         0
>
> MUTATION                     0
>
> COUNTER_MUTATION             0
>
> BATCH_STORE                  0
>
> BATCH_REMOVE                 0
>
> REQUEST_RESPONSE             0
>
> PAGED_RANGE                  0
>
> READ_REPAIR                  0
>
>
>
> -dan
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313&entry=gmail&source=g>
>



-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com

Reply via email to