[ https://issues.apache.org/jira/browse/CASSANDRA-11340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240140#comment-15240140 ]
Jeff Jirsa commented on CASSANDRA-11340: ---------------------------------------- Apologies for not being as diligent with followups as you deserve. I suspect it's not the size of the cluster, but the number (and rate) of clients reconnecting that triggers this. We're sitting at about 2k connected (tcp/9042 ESTABLISHED) clients per server. I do appreciate you trying to reproduce. I'm going to try again, as well. > Heavy read activity on system_auth tables can cause apparent livelock > --------------------------------------------------------------------- > > Key: CASSANDRA-11340 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11340 > Project: Cassandra > Issue Type: Bug > Reporter: Jeff Jirsa > Assignee: Aleksey Yeschenko > Attachments: mass_connect.py, prepare_mass_connect.py > > > Reproduced in at least 2.1.9. > It appears possible for queries against system_auth tables to trigger > speculative retry, which causes auth to block on traffic going off node. In > some cases, it appears possible for threads to become deadlocked, causing > load on the nodes to increase sharply. This happens even in clusters with RF > of system_auth == N, as all requests being served locally puts the bar for > 99% SR pretty low. > Incomplete stack trace below, but we haven't yet figured out what exactly is > blocking: > {code} > Thread 82291: (state = BLOCKED) > - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information > may be imprecise) > - java.util.concurrent.locks.LockSupport.parkNanos(long) @bci=11, line=338 > (Compiled frame) > - > org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUntil(long) > @bci=28, line=307 (Compiled frame) > - org.apache.cassandra.utils.concurrent.SimpleCondition.await(long, > java.util.concurrent.TimeUnit) @bci=76, line=63 (Compiled frame) > - org.apache.cassandra.service.ReadCallback.await(long, > java.util.concurrent.TimeUnit) @bci=25, line=92 (Compiled frame) > - > org.apache.cassandra.service.AbstractReadExecutor$SpeculatingReadExecutor.maybeTryAdditionalReplicas() > @bci=39, line=281 (Compiled frame) > - org.apache.cassandra.service.StorageProxy.fetchRows(java.util.List, > org.apache.cassandra.db.ConsistencyLevel) @bci=175, line=1338 (Compiled frame) > - org.apache.cassandra.service.StorageProxy.readRegular(java.util.List, > org.apache.cassandra.db.ConsistencyLevel) @bci=9, line=1274 (Compiled frame) > - org.apache.cassandra.service.StorageProxy.read(java.util.List, > org.apache.cassandra.db.ConsistencyLevel, > org.apache.cassandra.service.ClientState) @bci=57, line=1199 (Compiled frame) > - > org.apache.cassandra.cql3.statements.SelectStatement.execute(org.apache.cassandra.service.pager.Pageable, > org.apache.cassandra.cql3.QueryOptions, int, long, > org.apache.cassandra.service.QueryState) @bci=35, line=272 (Compiled frame) > - > org.apache.cassandra.cql3.statements.SelectStatement.execute(org.apache.cassandra.service.QueryState, > org.apache.cassandra.cql3.QueryOptions) @bci=105, line=224 (Compiled frame) > - org.apache.cassandra.auth.Auth.selectUser(java.lang.String) @bci=27, > line=265 (Compiled frame) > - org.apache.cassandra.auth.Auth.isExistingUser(java.lang.String) @bci=1, > line=86 (Compiled frame) > - > org.apache.cassandra.service.ClientState.login(org.apache.cassandra.auth.AuthenticatedUser) > @bci=11, line=206 (Compiled frame) > - > org.apache.cassandra.transport.messages.AuthResponse.execute(org.apache.cassandra.service.QueryState) > @bci=58, line=82 (Compiled frame) > - > org.apache.cassandra.transport.Message$Dispatcher.channelRead0(io.netty.channel.ChannelHandlerContext, > org.apache.cassandra.transport.Message$Request) @bci=75, line=439 (Compiled > frame) > - > org.apache.cassandra.transport.Message$Dispatcher.channelRead0(io.netty.channel.ChannelHandlerContext, > java.lang.Object) @bci=6, line=335 (Compiled frame) > - > io.netty.channel.SimpleChannelInboundHandler.channelRead(io.netty.channel.ChannelHandlerContext, > java.lang.Object) @bci=17, line=105 (Compiled frame) > - > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(java.lang.Object) > @bci=9, line=333 (Compiled frame) > - > io.netty.channel.AbstractChannelHandlerContext.access$700(io.netty.channel.AbstractChannelHandlerContext, > java.lang.Object) @bci=2, line=32 (Compiled frame) > - io.netty.channel.AbstractChannelHandlerContext$8.run() @bci=8, line=324 > (Compiled frame) > - java.util.concurrent.Executors$RunnableAdapter.call() @bci=4, line=511 > (Compiled frame) > - > org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run() > @bci=5, line=164 (Compiled frame) > - org.apache.cassandra.concurrent.SEPWorker.run() @bci=87, line=105 > (Interpreted frame) > - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame) > {code} > In a cluster with many connected clients (potentially thousands), a > reconnection flood (for example, restarting all at once) is likely to trigger > this bug. However, it is unlikely to be seen in normal operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)