[ https://issues.apache.org/jira/browse/CASSANDRA-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14086659#comment-14086659 ]
Brandon Williams edited comment on CASSANDRA-7696 at 8/5/14 7:56 PM: --------------------------------------------------------------------- -I suspect this may somehow be related to CASSANDRA-6590- Nope, that would only affect 2.0. was (Author: brandon.williams): I suspect this may somehow be related to CASSANDRA-6590 > Fail to reconnect to other nodes after intermittent network failure (SSL + > auth enabled) > ---------------------------------------------------------------------------------------- > > Key: CASSANDRA-7696 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7696 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Cassandra 1.2.16, CentOS 6.5, Cluster spread to 3 > datacenters with 1 node each. SSL enabled between Cassandra nodes and also to > clients, authentization enabled. > Reporter: Jiri Horky > > We experienced a strange problem after intermittent network failure when > the affected node did not reconnect to the rest of the cluster but did > allow to autenticate users (which was not possible during the actual > network outage, see below). The cluster consists of 1 node in each of 3 > datacenters, it uses C* 1.2.16 with SSL enabled both to clients and > between C* nodes. The authentication is enabled as well. > The problem started around 2014-08-01 when Cassandra first noticed a > network problem: > INFO [GossipTasks:1] 2014-08-01 07:47:52,618 Gossiper.java (line 823) > InetAddress /A.B.D.20 is now DOWN > INFO [GossipTasks:1] 2014-08-01 07:47:55,619 Gossiper.java (line 823) > InetAddress mia10/A.B.C.20 is now DOWN > The network came up for a while: > INFO [GossipStage:1] 2014-08-01 07:51:29,380 Gossiper.java (line 809) > InetAddress /A.B.C.20 is now UP > INFO [HintedHandoff:1] 2014-08-01 07:51:29,381 > HintedHandOffManager.java (line 296) Started hinted handoff for host: > 9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /A.B.C.20 > INFO [GossipStage:1] 2014-08-01 07:51:29,381 Gossiper.java (line 809) > InetAddress /A.B.D.20 is now UP > INFO [HintedHandoff:2] 2014-08-01 07:51:29,385 > HintedHandOffManager.java (line 296) Started hinted handoff for host: > 97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /A.B.D.20 > But it failed to send hints: > INFO [HintedHandoff:1] 2014-08-01 07:51:39,389 > HintedHandOffManager.java (line 427) Timed out replaying hints to > /A.B.C.20; aborting (0 delivered) > INFO [HintedHandoff:2] 2014-08-01 07:51:39,390 > HintedHandOffManager.java (line 427) Timed out replaying hints to > /A.B.D.20; aborting (0 delivered) > Also, the log started to be "flooded" with failed autentication tries. > My understanding is that authentication data are read with QUORUM which > failed as the other two nodes were down: > ERROR [Native-Transport-Requests:446116] 2014-08-01 07:51:39,985 > QueryMessage.java (line 97) Unexpected error during query > com.google.common.util.concurrent.UncheckedExecutionException: > java.lang.RuntimeException: > org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed > out - received only 0 responses. > at > com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2258) > at com.google.common.cache.LocalCache.get(LocalCache.java:3990) > at > com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3994) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4878) > at > org.apache.cassandra.service.ClientState.authorize(ClientState.java:292) > at > org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:172) > at > org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:165) > at > org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:149) > at > org.apache.cassandra.cql3.statements.SelectStatement.checkAccess(SelectStatement.java:116) > at > org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:102) > at > org.apache.cassandra.cql3.QueryProcessor.process(QueryProcessor.java:113) > at > org.apache.cassandra.transport.messages.QueryMessage.execute(QueryMessage.java:87) > at > org.apache.cassandra.transport.Message$Dispatcher.messageReceived(Message.java:287) > at > org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) > at > org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) > at > org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) > at > org.jboss.netty.handler.execution.ChannelUpstreamEventRunnable.doRun(ChannelUpstreamEventRunnable.java:43) > at > org.jboss.netty.handler.execution.ChannelEventRunnable.run(ChannelEventRunnable.java:67) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > Caused by: java.lang.RuntimeException: > org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed > out - received only 0 responses. > at org.apache.cassandra.auth.Auth.selectUser(Auth.java:256) > at org.apache.cassandra.auth.Auth.isSuperuser(Auth.java:84) > at > org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:50) > at > org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:68) > at > org.apache.cassandra.service.ClientState$1.load(ClientState.java:278) > at > org.apache.cassandra.service.ClientState$1.load(ClientState.java:275) > at > com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3589) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2374) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337) > at > com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252) > ... 20 more > Caused by: org.apache.cassandra.exceptions.ReadTimeoutException: > Operation timed out - received only 0 responses. > at > org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:105) > at > org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:930) > at > org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:815) > at > org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:140) > at org.apache.cassandra.auth.Auth.selectUser(Auth.java:245) > ... 29 more > These exception stopped to appear at 2014-08-01 08:59:17,48 (and did not > reapper after that), and cassandra seemed to be happy as it marked the > other two nodes as up for a while. But after a few tens of seconds, it > marked them as DOWN again and kept it in this state until restart three > days later, even the network connectivity was stable: > INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809) > InetAddress /A.B.C.20 is now UP > INFO [HintedHandoff:2] 2014-08-01 09:02:37,305 > HintedHandOffManager.java (line 296) Started hinted handoff for host: > 9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /A.B.C.20 > INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809) > InetAddress /A.B.D.20 is now UP > INFO [HintedHandoff:1] 2014-08-01 09:02:37,308 > HintedHandOffManager.java (line 296) Started hinted handoff for host: > 97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /A.B.D.20 > INFO [BatchlogTasks:1] 2014-08-01 09:02:45,724 ColumnFamilyStore.java > (line 633) Enqueuing flush of Memtable-batchlog@1311733948(239547/247968 > serialized/live bytes, 32 ops) > INFO [FlushWriter:1221] 2014-08-01 09:02:45,725 Memtable.java (line > 398) Writing Memtable-batchlog@1311733948(239547/247968 serialized/live > bytes, 32 ops) > INFO [FlushWriter:1221] 2014-08-01 09:02:45,738 Memtable.java (line > 443) Completed flushing; nothing needed to be retained. Commitlog > position was ReplayPosition(segmentId=1403712545417, position=20482758) > INFO [HintedHandoff:1] 2014-08-01 09:02:47,312 > HintedHandOffManager.java (line 427) Timed out replaying hints to > /A.B.D.20; aborting (0 delivered) > INFO [HintedHandoff:2] 2014-08-01 09:02:47,312 > HintedHandOffManager.java (line 427) Timed out replaying hints to > /A.B.C.20; aborting (0 delivered) > INFO [GossipTasks:1] 2014-08-01 09:02:59,690 Gossiper.java (line 823) > InetAddress /A.B.D.20 is now DOWN > INFO [GossipTasks:1] 2014-08-01 09:03:08,693 Gossiper.java (line 823) > InetAddress mia10/A.B.C.20 is now DOWN > After one day, the node started to discards hints it saved for the other > nodes. > WARN [OptionalTasks:1] 2014-08-02 09:09:10,277 > HintedHandoffMetrics.java (line 78) /A.B.D.20 has 9039 dropped > hints, because node is down past configured hint window. > WARN [OptionalTasks:1] 2014-08-02 09:09:10,290 > HintedHandoffMetrics.java (line 78) /A.B.C.20 has 8816 dropped > hints, because node is down past configured hint window. > The "nodetool status" marked the other two nodes as down. After restart, > it reconnected to the cluster without any problem. > What puzzles me is the fact that the authentization apparently started > to work after the network recovered but the exchange of data did not. -- This message was sent by Atlassian JIRA (v6.2#6252)