[ https://issues.apache.org/jira/browse/CASSANDRA-18572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730851#comment-17730851 ]
Stefan Miklosovic commented on CASSANDRA-18572: ----------------------------------------------- The proposed patch uses a proper NodeTool while calling "nodetoolResult" on an Instance instead of "mocked and fake" NodeTool which uses InternalNodeProbe. NodeTool is used when there is JMX Feature added into config and it fallbacks to the old way of doing things when it is not. Usage of a proper NodeTool with NodeProbe results in e.g. "nodetool info" not failing. There is a bunch of methods in InternalNodeProbe which throw UnsupportedOperationException because these methods in original NodeProbe are calling proper JMX which is missing in InternalNodeProbe. The patch works as such but there is one corner-case where it does not and it breaks 2 existing tests. I tried to debug this but I can not say I made any significant breakthrough there. Consider this corner-case example: {code:java} public class MyTest extends TestBaseImpl { @Test public void test() throws Throwable { try (Cluster cluster = Cluster.build(2).withConfig(c -> c.with(Feature.values())).start()) { ClusterUtils.stopUnchecked(cluster.get(2)); System.out.println(cluster.get(1).nodetoolResult("status").getStderr()); } } } {code} I create a cluster with 2 nodes and I stop the second one. "stopUnchecked" is just this: {code:java} public static void stopUnchecked(IInstance i) { Futures.getUnchecked(i.shutdown()); } {code} It just converts whatever shutdown throws to unchecked exception. Easy. So, while the second node is stopped, I want to call "nodetool status" and see the results. As I am using "Feature.values()", that includes JMX too, that means the proper JMX functionality introduced in this patch will be used as well. Under the hood, it will create JMXConnector via JMXUtil.getJmxConnector and that connector will be used upon connection in proper / non-mocked NodeProbe. If I was not using JMX feature, when I call nodetooResult("status") on Instance, I would use the old way of interacting with NodeTool, which is not connecting to it but it will just call services directly via InternalNodeProbe. Now the problem is, to my very surprise, that this does not work and the results are pretty interesting. When I print the output, it says this: {code:java} java.net.BindException: Address already in use (Bind failed) at java.net.PlainSocketImpl.socketBind(Native Method) at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387) at java.net.ServerSocket.bind(ServerSocket.java:390) at java.net.ServerSocket.<init>(ServerSocket.java:252) at javax.net.DefaultServerSocketFactory.createServerSocket(ServerSocketFactory.java:231) at org.apache.cassandra.distributed.impl.CollectingRMIServerSocketFactoryImpl.createServerSocket(CollectingRMIServerSocketFactoryImpl.java:51) at sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:670) at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:335) at sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:254) at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:412) at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147) at sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:237) at javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:143) at javax.management.remote.rmi.RMIJRMPServerImpl.makeClient(RMIJRMPServerImpl.java:209) at javax.management.remote.rmi.RMIServerImpl.doNewClient(RMIServerImpl.java:250) at javax.management.remote.rmi.RMIServerImpl.newClient(RMIServerImpl.java:199) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357) at sun.rmi.transport.Transport$1.run(Transport.java:200) at sun.rmi.transport.Transport$1.run(Transport.java:197) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:196) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {code} I put heavy logging there and what I found is that *it can not bind because it tries to bind to the address of the first, running node.* This does not make sense, I stopped the second node, so why does it even try to bind anything yet again if I have not touched JMX stuff of the first node at all? After further investigation, when I put a logging to CollectingRMIServerSocketFactoryImpl like this: {code:java} @Override public ServerSocket createServerSocket(int pPort) throws IOException { System.out.println("CURRENT THREAD " + Thread.currentThread().getName()); ServerSocket result = ServerSocketFactory.getDefault().createServerSocket(pPort, 0, bindAddress); try ... {code} I see three cases when it prints, two times when the cluster is being started and it creates JMX stuff, per node: {code:java} CURRENT THREAD node1_isolatedExecutor:1 ... CURRENT THREAD node2_isolatedExecutor:2{code} and the third time just before the error: {code:java} CURRENT THREAD RMI TCP Connection(8)-127.0.0.1 error: Address already in use (Bind failed) -- StackTrace -- {code} So, it seems to me that the initialization works just fine but when I shutdown a node in the middle of the test, it seems like for some reason when nodetool logic is being run on the first node to execute some command, it somehow tries to create a server socket yet again - as if it was never there, and this is happening in a different thread - thread of RMI TCP connection ... My primary suspect is that when the second node is shut down, IsolatedJMX is calling this at the end of 'stopJMX' where it clears 'localEndpoints': {code:java} private <K, V> void clearMapField(Class<?> clazz, Object instance, String mapName) throws IllegalAccessException, NoSuchFieldException { Field mapField = ReflectionUtils.getField(clazz, mapName); mapField.setAccessible(true); Map<K, V> map = (Map<K, V>) mapField.get(instance); // Because multiple instances can be shutting down at once, // synchronize on the map to avoid ConcurrentModificationException synchronized (map) { for (Iterator<Map.Entry<K, V>> it = map.entrySet().iterator(); it.hasNext(); ) { it.next(); it.remove(); } } } {code} The problem is that when the second node shuts down and this is called, I am afraid that it will clear all localEndpoints in there. If one debugs what is the content of localEndpoints map as the second node cleanup logic calls it, there are also TCPTransports for the first node too which gets wiped out as well. So the next call to nodetool from the first node will not see it there so the internal logic of RMI TCP stuff will evaluate it like it needs to be started again - but it fails to do so - because it was actually not "stopped", we just somehow cleared some map somewhere ... I tried to not clean TCPTransport in this method, further doing some crazy reflection but I was not sure I knew exactly what I was doing so I need some expertise here. > Instance.nodetoolResult should connect to JMX if there is such feature > enabled in its config > -------------------------------------------------------------------------------------------- > > Key: CASSANDRA-18572 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18572 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest/java > Reporter: Stefan Miklosovic > Assignee: Stefan Miklosovic > Priority: Normal > Fix For: 5.x > > Time Spent: 2h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org