[ 
https://issues.apache.org/jira/browse/CASSANDRA-13058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15798508#comment-15798508
 ] 

Stefan Podkowinski commented on CASSANDRA-13058:
------------------------------------------------

This looks like a regression caused by CASSANDRA-12192 to me. The test seems to 
work fine without the commit. 

Take a look at this dtest log:

{{./run_dtests.py --vnodes false --nose-options -v  
hintedhandoff_test.py:TestHintedHandoff.hintedhandoff_decom_test}}

{noformat}
Node 4:
DEBUG [StorageServiceShutdownHook] 2017-01-04 15:06:16,748 
StorageService.java:1437 - DRAINING: starting drain process                     
      
DEBUG [StorageServiceShutdownHook] 2017-01-04 15:06:20,011 
StorageService.java:1437 - DRAINED                                              
      

Node 1:
INFO  [GossipStage:1] 2017-01-04 15:06:16,756 Gossiper.java:1036 - InetAddress 
/127.0.0.4 is now DOWN                                           
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 OutboundTcpConnection.java:180 - 
Enqueuing socket close for /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 OutboundTcpConnection.java:180 - 
Enqueuing socket close for /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 OutboundTcpConnection.java:180 - 
Enqueuing socket close for /127.0.0.4
DEBUG [GossipStage:1] 2017-01-04 15:06:16,757 FailureDetector.java:325 - 
Forcing conviction of /127.0.0.4                                       
DEBUG [GossipStage:1] 2017-01-04 15:06:16,758 Gossiper.java:370 - /127.0.0.4 
marked as alive: false                                             
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:16,758 
OutboundTcpConnection.java:374 - Socket to /127.0.0.4 closed        
DEBUG [MessagingService-Outgoing-/127.0.0.4-Small] 2017-01-04 15:06:16,759 
OutboundTcpConnection.java:374 - Socket to /127.0.0.4 closed
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:17,883 
OutboundTcpConnection.java:388 - Attempting to connect to /127.0.0.4
INFO  [HANDSHAKE-/127.0.0.4] 2017-01-04 15:06:17,884 
OutboundTcpConnection.java:510 - Handshaking version with /127.0.0.4
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:17,884 
OutboundTcpConnection.java:482 - Done connecting to /127.0.0.4
DEBUG [Native-Transport-Requests-3] 2017-01-04 15:06:20,825 
FailureDetector.java:252 - /127.0.0.4 is ALIVE
INFO  [RMI TCP Connection(2)-127.0.0.1] 2017-01-04 15:06:52,930 
StorageService.java:1435 - LEAVING: streaming hints to other nodes              
INFO  [HintsDispatcher:2] 2017-01-04 15:06:52,955 
HintsDispatchExecutor.java:152 - Transferring all hints to 
054006b5-412c-4d4e-a28a-7610574de79     d                                       
                                                                                
                        
DEBUG [MessagingService-Outgoing-/127.0.0.4-Gossip] 2017-01-04 15:06:53,035 
OutboundTcpConnection.java:495 - Unable to connect to /127.0.0.4
java.net.ConnectException: Connection refused
▸  at sun.nio.ch.Net.connect0(Native Method) ~[na:1.8.0_112]
▸  at sun.nio.ch.Net.connect(Net.java:454) ~[na:1.8.0_112]
▸  at sun.nio.ch.Net.connect(Net.java:446) ~[na:1.8.0_112]
▸  at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) 
~[na:1.8.0_112]
▸  at 
org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:146)
 ~[main/:na]
▸  at 
org.apache.cassandra.net.OutboundTcpConnectionPool.newSocket(OutboundTcpConnectionPool.java:132)
 ~[main/:na]
▸  at 
org.apache.cassandra.net.OutboundTcpConnection.connect(OutboundTcpConnection.java:397)
 [main/:na]
▸  at 
org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:234)
 [main/:na]
{noformat}


I've added some additional log messages, so don't get confused when you can't 
find some of them in the code. What's happening here is that the gossiper will 
down node1 and will trigger {{OutboundTcpConnectionPool.reset()}} via 
{{MessagingService.convict()}}. This will cause a reconnect shortly, as the 
message should now always be retried from the backlog again. 

Unfortunately I can't see which message is affected here. But my assumption is 
that node4 keeps accepting messages even for a short time after gossip change 
propagation and will return a response to node1, which will flag node4 alive 
again in the process. But I'm not exactly sure, as the node status related code 
is not easy to understand with all the side effects and unclear state handling. 
:(

/cc [~thobbs]


> dtest failure in hintedhandoff_test.TestHintedHandoff.hintedhandoff_decom_test
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13058
>             Project: Cassandra
>          Issue Type: Test
>          Components: Testing
>            Reporter: Sean McCarthy
>            Priority: Blocker
>              Labels: dtest, test-failure
>             Fix For: 3.10
>
>         Attachments: node1.log, node1_debug.log, node1_gc.log, node2.log, 
> node2_debug.log, node2_gc.log, node3.log, node3_debug.log, node3_gc.log, 
> node4.log, node4_debug.log, node4_gc.log
>
>
> example failure:
> http://cassci.datastax.com/job/cassandra-3.X_novnode_dtest/16/testReport/hintedhandoff_test/TestHintedHandoff/hintedhandoff_decom_test/
> {code}
> Error Message
> Subprocess ['nodetool', '-h', 'localhost', '-p', '7100', ['decommission']] 
> exited with non-zero status; exit status: 2; 
> stderr: error: Error while decommissioning node: Failed to transfer all hints 
> to 59f20b4f-0215-4e18-be1b-7e00f2901629
> {code}{code}
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
>     testMethod()
>   File "/home/automaton/cassandra-dtest/hintedhandoff_test.py", line 167, in 
> hintedhandoff_decom_test
>     node1.decommission()
>   File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 1314, in 
> decommission
>     self.nodetool("decommission")
>   File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 783, in 
> nodetool
>     return handle_external_tool_process(p, ['nodetool', '-h', 'localhost', 
> '-p', str(self.jmx_port), cmd.split()])
>   File "/usr/local/lib/python2.7/dist-packages/ccmlib/node.py", line 1993, in 
> handle_external_tool_process
>     raise ToolError(cmd_args, rc, out, err)
> {code}{code}
> java.lang.RuntimeException: Error while decommissioning node: Failed to 
> transfer all hints to 59f20b4f-0215-4e18-be1b-7e00f2901629
>       at 
> org.apache.cassandra.service.StorageService.decommission(StorageService.java:3924)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:497)
>       at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:497)
>       at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
>       at 
> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
>       at 
> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
>       at 
> com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
>       at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
>       at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
>       at 
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
>       at 
> com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
>       at 
> javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1466)
>       at 
> javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
>       at 
> javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1307)
>       at 
> javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1399)
>       at 
> javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:828)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:497)
>       at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:323)
>       at sun.rmi.transport.Transport$1.run(Transport.java:200)
>       at sun.rmi.transport.Transport$1.run(Transport.java:197)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
>       at 
> sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:568)
>       at 
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:826)
>       at 
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$241(TCPTransport.java:683)
>       at 
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$284/1694175644.run(Unknown
>  Source)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at 
> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:682)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to