[ 
https://issues.apache.org/jira/browse/CASSANDRA-18572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17731552#comment-17731552
 ] 

Stefan Miklosovic edited comment on CASSANDRA-18572 at 6/12/23 10:54 AM:
-------------------------------------------------------------------------

I tried Doug's patch and it solves the issue I had but when I run the whole 
test suite, the only test it fails on is ForceRepairTest.

I think that the problem is that "mocked" NodeTool/NodeProbe by means of 
'InternalNodeProbe' is actually not closing MessagingService MBean in its 
close() method. That means that even the second node in that test is shut down, 
one clearly sees in the logs that the connection is still happening between the 
first and third node with the second one we "stopped".

However, the way of re-using a proper NodeTool has the same behavior. We are 
just closing jmx connector which should invalidate all mbeans etc but it just 
does not play together and I am not sure what the difference is.

The workaround is to enumerate features instead of calling .values() in config, 
like this: .with(Feature.NETWORK, Feature.GOSSIP, Feature.NATIVE_PROTOCOL)) so 
we are not using JMX which means it will fallback to the old way of doing 
things.

It is worth to note that this is happening only in case one uses 
Feature.value() in cluster configuration and one has to stop a node in the 
middle of the test plus it does not happen every time. There is also 
FailedBootstrapTest which uses same logic (Feature.values() and stopping a node 
in the middle of the test) which is successful so this problem is not present 
every time.

I would dedicate a separate ticket for this and I would just continue to work 
on porting this ticket to older branches back to 3.11.

The branch for trunk is here [https://github.com/apache/cassandra/pull/2394]

 

What I see in ForceRepairTest is that after stopping the second node, the first 
node is attempting to repair a cluster by doing this:

 
{code:java}
node1.nodetoolResult(ArrayUtils.addAll(new String[] {"repair", KEYSPACE}, 
args)).asserts().failure();
node1.nodetoolResult(ArrayUtils.addAll(new String[] {"repair", KEYSPACE, 
"--force"}, args)).asserts().success(); {code}
However it does not even make it after the first repair, what I see in the logs 
is this, repeatedly:
{code:java}
INFO  [node3_Messaging-EventLoop-3-2] node3 2023-06-12 12:52:27,968 
NoSpamLogger.java:105 - 
/127.0.0.3:7012->/127.0.0.2:7012-URGENT_MESSAGES-[no-channel] failed to connect
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: 
/127.0.0.2:7012
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
    at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
    at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:707)
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
    at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
    at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:750) {code}
So it looks like the third node is still trying to contact the second one via 
MessageService but it is not able to do that ... 


was (Author: smiklosovic):
I tried Doug's patch and it solves the issue I had but when I run the whole 
test suite, the only test it fails on is ForceRepairTest.

I think that the problem is that "mocked" NodeTool/NodeProbe by means of 
'InternalNodeProbe' is actually not closing MessagingService MBean in its 
close() method. That means that even the second node in that test is shut down, 
one clearly sees in the logs that the connection is still happening between the 
first and third node with the second one we "stopped".

However, the way of re-using a proper NodeTool has the same behavior. We are 
just closing jmx connector which should invalidate all mbeans etc but it just 
does not play together and I am not sure what the difference is.

The workaround is to enumerate features instead of calling .values() in config, 
like this: .with(Feature.NETWORK, Feature.GOSSIP, Feature.NATIVE_PROTOCOL)) so 
we are not using JMX which means it will fallback to the old way of doing 
things.

It is worth to note that this is happening only in case one uses 
Feature.value() in cluster configuration and one has to stop a node in the 
middle of the test plus it does not happen every time. There is also 
FailedBootstrapTest which uses same logic (Feature.values() and stopping a node 
in the middle of the test) which is successful so this problem is not present 
every time.

I would dedicate a separate ticket for this and I would just continue to work 
on porting this ticket to older branches back to 3.11.

The branch for trunk is here [https://github.com/apache/cassandra/pull/2394]

> Instance.nodetoolResult should connect to JMX if there is such feature 
> enabled in its config
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18572
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18572
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Test/dtest/java
>            Reporter: Stefan Miklosovic
>            Assignee: Stefan Miklosovic
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: fix-jmx-issue-on-shutdown.patch
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to