Have you tried increasing concurrent reads until you see more activity in disk? If you've always got 32 active reads and high pending reads it could just be dropping the reads because the queues are saturated. Could be artificially bottlenecking at the C* process level.
Also what does this metric show over time: org.apache.cassandra.metrics.type=ThreadPools.path=transport.scope=Native-Transport-Requests.name=TotalBlockedTasks.Count On Thu, Jun 27, 2019 at 1:52 AM Dmitry Simonov <dimmobor...@gmail.com> wrote: > Hello! > > We've met several times the following problem. > > Cassandra cluster (5 nodes) becomes unresponsive for ~30 minutes: > - all CPUs have 100% load (normally we have LA 5 on 16-cores machine) > - cassandra's threads count raises from 300 to 1300 - 2000,most of them > are Thrift threads in java.net.SocketInputStream.socketRead0(Native > Method) method, count of other threads doesn't increase > - some Read messages are dropped > - read latency (p99.9) increases to 20-30 seconds > - there are up to 32 active Read Tasks, up to 3k - 6k pending Read Tasks > > Problem starts synchronously on all nodes of cluster. > I cannot tie this problem with increased load from clients ("read rate" > does't increase during the problem). > Also looks like there is no problem with disks (I/O latencies are OK). > > Could anybody please give some advice in further troubleshooting? > > -- > Best Regards, > Dmitry Simonov > -- www.vorstella.com 408 691 8402