[jira] [Commented] (CASSANDRA-18572) Instance.nodetoolResult should connect to JMX if there is such feature enabled in its config

Stefan Miklosovic (Jira) Fri, 09 Jun 2023 00:40:04 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730851#comment-17730851
 ]


Stefan Miklosovic commented on CASSANDRA-18572:
-----------------------------------------------

The proposed patch uses a proper NodeTool while calling "nodetoolResult" on an 
Instance instead of "mocked and fake" NodeTool which uses InternalNodeProbe. 
NodeTool is used when there is JMX Feature added into config and it fallbacks 
to the old way of doing things when it is not.

Usage of a proper NodeTool with NodeProbe results in e.g. "nodetool info" not 
failing. There is a bunch of methods in InternalNodeProbe which throw 
UnsupportedOperationException because these methods in original NodeProbe are 
calling proper JMX which is missing in InternalNodeProbe.

The patch works as such but there is one corner-case where it does not and it 
breaks 2 existing tests. I tried to debug this but I can not say I made any 
significant breakthrough there. Consider this corner-case example:
{code:java}
public class MyTest extends TestBaseImpl
{
    @Test
    public void test() throws Throwable
    {
        try (Cluster cluster = Cluster.build(2).withConfig(c -> 
c.with(Feature.values())).start())
        {
            ClusterUtils.stopUnchecked(cluster.get(2));
            
System.out.println(cluster.get(1).nodetoolResult("status").getStderr());
        }
    }
} {code}
I create a cluster with 2 nodes and I stop the second one. "stopUnchecked" is 
just this:
{code:java}
public static void stopUnchecked(IInstance i)
{
    Futures.getUnchecked(i.shutdown());
}
 {code}
It just converts whatever shutdown throws to unchecked exception. Easy.

So, while the second node is stopped, I want to call "nodetool status" and see 
the results.

As I am using "Feature.values()", that includes JMX too, that means the proper 
JMX functionality introduced in this patch will be used as well. Under the 
hood, it will create JMXConnector via JMXUtil.getJmxConnector and that 
connector will be used upon connection in proper / non-mocked NodeProbe.

If I was not using JMX feature, when I call nodetooResult("status") on 
Instance, I would use the old way of interacting with NodeTool, which is not 
connecting to it but it will just call services directly via InternalNodeProbe.

Now the problem is, to my very surprise, that this does not work and the 
results are pretty interesting. When I print the output, it says this:
{code:java}
java.net.BindException: Address already in use (Bind failed)
        at java.net.PlainSocketImpl.socketBind(Native Method)
        at 
java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:387)
        at java.net.ServerSocket.bind(ServerSocket.java:390)
        at java.net.ServerSocket.<init>(ServerSocket.java:252)
        at 
javax.net.DefaultServerSocketFactory.createServerSocket(ServerSocketFactory.java:231)
        at 
org.apache.cassandra.distributed.impl.CollectingRMIServerSocketFactoryImpl.createServerSocket(CollectingRMIServerSocketFactoryImpl.java:51)
        at 
sun.rmi.transport.tcp.TCPEndpoint.newServerSocket(TCPEndpoint.java:670)
        at sun.rmi.transport.tcp.TCPTransport.listen(TCPTransport.java:335)
        at 
sun.rmi.transport.tcp.TCPTransport.exportObject(TCPTransport.java:254)
        at sun.rmi.transport.tcp.TCPEndpoint.exportObject(TCPEndpoint.java:412)
        at sun.rmi.transport.LiveRef.exportObject(LiveRef.java:147)
        at 
sun.rmi.server.UnicastServerRef.exportObject(UnicastServerRef.java:237)
        at 
javax.management.remote.rmi.RMIJRMPServerImpl.export(RMIJRMPServerImpl.java:143)
        at 
javax.management.remote.rmi.RMIJRMPServerImpl.makeClient(RMIJRMPServerImpl.java:209)
        at 
javax.management.remote.rmi.RMIServerImpl.doNewClient(RMIServerImpl.java:250)
        at 
javax.management.remote.rmi.RMIServerImpl.newClient(RMIServerImpl.java:199)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
        at sun.rmi.transport.Transport$1.run(Transport.java:200)
        at sun.rmi.transport.Transport$1.run(Transport.java:197)
        at java.security.AccessController.doPrivileged(Native Method)
        at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
        at 
sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
        at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
        at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
        at java.security.AccessController.doPrivileged(Native Method)
        at 
sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750) {code}
I put heavy logging there and what I found is that *it can not bind because it 
tries to bind to the address of the first, running node.*

This does not make sense, I stopped the second node, so why does it even try to 
bind anything yet again if I have not touched JMX stuff of the first node at 
all?

After further investigation, when I put a logging to 
CollectingRMIServerSocketFactoryImpl  like this:
{code:java}
@Override
public ServerSocket createServerSocket(int pPort) throws IOException
{
    System.out.println("CURRENT THREAD " + Thread.currentThread().getName());
    ServerSocket result = 
ServerSocketFactory.getDefault().createServerSocket(pPort, 0, bindAddress);
    try ...
{code}
I see three cases when it prints, two times when the cluster is being started 
and it creates JMX stuff, per node:
{code:java}
CURRENT THREAD node1_isolatedExecutor:1 
...
CURRENT THREAD node2_isolatedExecutor:2{code}
and the third time just before the error:
{code:java}
CURRENT THREAD RMI TCP Connection(8)-127.0.0.1
error: Address already in use (Bind failed)
-- StackTrace -- {code}
So, it seems to me that the initialization works just fine but when I shutdown 
a node in the middle of the test, it seems like for some reason when nodetool 
logic is being run on the first node to execute some command, it somehow tries 
to create a server socket yet again - as if it was never there, and this is 
happening in a different thread - thread of RMI TCP connection ...

My primary suspect is that when the second node is shut down, IsolatedJMX is 
calling this at the end of 'stopJMX' where it clears 'localEndpoints':
{code:java}
private <K, V> void clearMapField(Class<?> clazz, Object instance, String 
mapName)
throws IllegalAccessException, NoSuchFieldException {
    Field mapField = ReflectionUtils.getField(clazz, mapName);
    mapField.setAccessible(true);
    Map<K, V> map = (Map<K, V>) mapField.get(instance);
    // Because multiple instances can be shutting down at once,
    // synchronize on the map to avoid ConcurrentModificationException
    synchronized (map)
    {
        for (Iterator<Map.Entry<K, V>> it = map.entrySet().iterator(); 
it.hasNext(); )
        {
            it.next();
            it.remove();
        }
    }
} {code}
The problem is that when the second node shuts down and this is called, I am 
afraid that it will clear all localEndpoints in there. If one debugs what is 
the content of localEndpoints map as the second node cleanup logic calls it, 
there are also TCPTransports for the first node too which gets wiped out as 
well. So the next call to nodetool from the first node will not see it there so 
the internal logic of RMI TCP stuff will evaluate it like it needs to be 
started again - but it fails to do so - because it was actually not "stopped", 
we just somehow cleared some map somewhere ...

I tried to not clean TCPTransport in this method, further doing some crazy 
reflection but I was not sure I knew exactly what I was doing so I need some 
expertise here.

> Instance.nodetoolResult should connect to JMX if there is such feature 
> enabled in its config
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18572
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18572
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Test/dtest/java
>            Reporter: Stefan Miklosovic
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>             Fix For: 5.x
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18572) Instance.nodetoolResult should connect to JMX if there is such feature enabled in its config

Reply via email to