nodetool removetoken force causes an inconsistent state
-------------------------------------------------------

                 Key: CASSANDRA-3876
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3876
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 1.0.7, 1.1
            Reporter: Sam Overton


Steps to reproduce (tested on 1.0.7 and trunk):
* Create a cluster of 3 nodes
* Insert some data
* stop one of the nodes
* Call removetoken on the token of the stopped node
* Immediately after, do removetoken force 
  - this will cause the original removetoken to fail with an error after 30s 
since the generation changed for the leaving node, but this is a convenient way 
of simulating the case where a removetoken hangs at streaming since the cleanup 
logic at the end of StorageService.removeToken is never executed.
  - if you want a more realistic reproduction then get a removetoken to hang in 
streaming, then do removetoken force

Effects:
* "removetoken status" now throws an exception because 
StorageService.removingNode is not cleared, but the endpoint is no longer a 
member of the ring:

$ nodetool -h localhost removetoken status
{noformat}
Exception in thread "main" java.lang.AssertionError
        at 
org.apache.cassandra.locator.TokenMetadata.getToken(TokenMetadata.java:304)
        at 
org.apache.cassandra.service.StorageService.getRemovalStatus(StorageService.java:2369)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at 
com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:111)
        at 
com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:45)
        at 
com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:226)
        at 
com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:83)
        at 
com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:205)
        at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:683)
        at 
com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:672)
        at 
javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427)
        at 
javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:90)
        at 
javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1285)
        at 
javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1383)
        at 
javax.management.remote.rmi.RMIConnectionImpl.getAttribute(RMIConnectionImpl.java:619)
        at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:616)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:322)
        at sun.rmi.transport.Transport$1.run(Transport.java:177)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:173)
        at 
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:553)
        at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:808)
        at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:667)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:636)
{noformat}

* truncate no longer works in the cli because the removed endpoint is not 
removed from Gossiper.unreachableEndpoints. 
The cli errors immediately with:
{noformat}
[default@ks1] truncate cf1;
null
UnavailableException()
        at 
org.apache.cassandra.thrift.Cassandra$truncate_result.read(Cassandra.java:20978)
        at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
        at 
org.apache.cassandra.thrift.Cassandra$Client.recv_truncate(Cassandra.java:942)
        at 
org.apache.cassandra.thrift.Cassandra$Client.truncate(Cassandra.java:929)
        at 
org.apache.cassandra.cli.CliClient.executeTruncate(CliClient.java:1417)
        at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:270)
        at 
org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:219)
        at org.apache.cassandra.cli.CliMain.main(CliMain.java:346)
{noformat}

The logs show:
{noformat}
INFO [Thrift:11] 2012-02-08 11:55:50,135 StorageProxy.java (line 1172) Cannot 
perform truncate, some hosts are down
{noformat}

* there are probably other schema related things that fail for the same reason 
although this wasn't tested

Workaround:
* Restart the affected node.

Fix:
It looks like StorageService.forceRemoveCompletion is missing some cleanup 
logic which is present at the end of StorageService.removeToken. Adding this 
cleanup logic to forceRemoveCompletion fixes the above issues (see attached).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to