Re: Counter errors - RC1

Kane Wilson Mon, 10 May 2021 17:15:45 -0700

Seems like some of your nodes are overloaded. Is it intentional that some
of your nodes have varying numbers of tokens?


It seems like some of your nodes are overloaded, potentially at least #RF
of them. If nodes are heavily overloaded GC tuning generally won't help
much, you're best off starting by reducing load or increasing capacity.

raft.so - Cassandra consulting, support, and managed services


On Tue, May 11, 2021 at 7:44 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi all - I'm getting the following error on RC1:
>
> WARN  [Messaging-EventLoop-3-23] 2021-05-10 17:29:12,431
> NoSpamLogger.java:95 -
> /172.16.100.39:7000->/172.16.100.248:7000-URGENT_MESSAGES-e8d21588
> dropping message of type FAILURE_RSP whose timeout expired before
> reaching the network
> ERROR [CounterMutationStage-62] 2021-05-10 17:29:12,431
> AbstractLocalAwareExecutorService.java:166 - Uncaught exception on
> thread Thread[CounterMutationStage-62,5,main]
> java.lang.RuntimeException:
> org.apache.cassandra.exceptions.WriteTimeoutException: Operation timed
> out - received only 0 responses.
>         at
>
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2278)
>         at
>
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at
>
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
>         at
>
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
>         at
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
>         at
>
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.base/java.lang.Thread.run(Thread.java:829)
> Caused by: org.apache.cassandra.exceptions.WriteTimeoutException:
> Operation timed out - received only 0 responses.
>         at
>
> org.apache.cassandra.db.CounterMutation.grabCounterLocks(CounterMutation.java:162)
>         at
>
> org.apache.cassandra.db.CounterMutation.applyCounterMutation(CounterMutation.java:131)
>         at
>
> org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:1678)
>         at
>
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2274)
>         ... 6 common frames omitted
>
> This happens under load.
>
> I'm also seeing a lot of these messages:
>
> WARN  [GossipTasks:1] 2021-05-10 17:30:20,969 FailureDetector.java:319
> - Not marking nodes down due to local pause of 5785753812ns > 5000000000ns
> DEBUG [GossipTasks:1] 2021-05-10 17:30:20,969 FailureDetector.java:325 -
> Still not marking nodes down due to local pause
> DEBUG [GossipTasks:1] 2021-05-10 17:30:20,969 FailureDetector.java:325 -
> Still not marking nodes down due to local pause
> DEBUG [GossipTasks:1] 2021-05-10 17:30:20,969 FailureDetector.java:325 -
> Still not marking nodes down due to local pause
>
> The other messages are slow queries like:
> SELECT mediatype, origvalue FROM doc.origdoc WHERE uuid =
> DS_5_2021-05-08T06-53-41.442Z_Hi0ywdNE LIMIT 1>, time 1370 msec - slow
> timeout 500 msec
>
> I've tried switching the G1 garbage collector (java 11), and that did
> reduce these times (was seeing over 5000msec).  The above select
> statement is on a table where uuid is the primary key.
>
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address         Load       Tokens  Owns
> (effective)  Host
> ID                               Rack
> UN  172.16.100.208  9.16 GiB   30
> 9.3%            Â 2529b6ed-cdb2-43c2-bdd7-171cfe308bd3  rack1
> UN  172.16.100.249  60.69 GiB  200
> 62.9%           Â 49e4f571-7d1c-4e1e-aca7-5bbe076596f7  rack1
> UN  172.16.100.36   61.16 GiB  200
> 62.9%           Â d9702f96-256e-45ae-8e12-69a42712be50  rack1
> UN  172.16.100.39   61.07 GiB  200
> 63.0%           Â 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47  rack1
> UN  172.16.100.253  1.24 GiB   4
> 1.3%            Â a1a16910-9167-4174-b34b-eb859d36347e  rack1
> UN  172.16.100.248  60.35 GiB  200
> 62.9%           Â 4bbbe57c-6219-41e5-bbac-de92a9594d53  rack1
> UN  172.16.100.37   37.18 GiB  120
> 37.7%           Â 08a19658-40be-4e55-8709-812b3d4ac750  rack1
>
> nodetool tablestats doc.origdoc
> Total number of tables: 74
> ----------------
> Keyspace : doc
>         Read Count: 37511
>         Read Latency: 33.929465116899046 ms
>         Write Count: 4604965
>         Write Latency: 0.20405303102195133 ms
>         Pending Flushes: 0
>                 Table: origdoc
>                 SSTable count: 85
>                 Old SSTable count: 0
>                 Space used (live): 54635707180
>                 Space used (total): 54635707180
>                 Space used by snapshots (total): 0
>                 Off heap memory used (total): 258773554
>                 SSTable Compression Ratio:
> 0.33099344385825985
>                 Number of partitions (estimate): 114982637
>                 Memtable cell count: 0
>                 Memtable data size: 0
>                 Memtable off heap memory used: 0
>                 Memtable switch count: 0
>                 Local read count: 5749
>                 Local read latency: 240.422 ms
>                 Local write count: 0
>                 Local write latency: NaN ms
>                 Pending flushes: 0
>                 Percent repaired: 0.01
>                 Bloom filter false positives: 16
>                 Bloom filter false ratio: 0.00000
>                 Bloom filter space used: 141861208
>                 Bloom filter off heap memory used: 141860528
>                 Index summary off heap memory used: 44391250
>                 Compression metadata off heap memory
> used: 72521776
>                 Compacted partition minimum bytes: 259
>                 Compacted partition maximum bytes: 4768
>                 Compacted partition mean bytes: 1366
>                 Average live cells per slice (last five
> minutes): 1.0
>                 Maximum live cells per slice (last five
> minutes): 1
>                 Average tombstones per slice (last five
> minutes): 1.0
>                 Maximum tombstones per slice (last five
> minutes): 1
>                 Dropped Mutations: 0
> Things to check?  Things to try?
>
> Thanks!
>
> -Joe
>
>

Re: Counter errors - RC1

Reply via email to