[ https://issues.apache.org/jira/browse/CASSANDRA-18560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741204#comment-17741204 ]
Brad Vernon commented on CASSANDRA-18560: ----------------------------------------- We did an upgrade with an existing instance from 4.1.1 to 4.1.2 and the same exact issue happened with nodes that previously had no issues connecting across DC using the public IP assigned. Only outbound connections were affected and it was random across the nodes not being able to use the public IP. Downgrading to 4.1.1 restored normal operations. This seems like a much larger bug that will definitely impact clusters that have both local private IPs and public IPs for cross dc access. Error message for one node which should be using IP 34.248.<redacted> but instead is using 10.34.37.10 which is the private IP of the host and only available in the local VPC. {code:java} WARN [Messaging-EventLoop-3-3] 2023-07-07 21:52:27,929 NoSpamLogger.java:108 - /3.114.<redacted>:7000->/34.248<redacted>:7000-URGENT_MESSAGES-[no-channel] dropping message of type ECHO_RSP whose timeout expired before reaching the networkINFO [Messaging-EventLoop-3-3] 2023-07-07 21:52:47,391 NoSpamLogger.java:105 - /3.114.<redacted>:7000->/34.248.<redacted>:7000-URGENT_MESSAGES-[no-channel] failed to connectio.netty.channel.ConnectTimeoutException: connection timed out: /10.34.37.10:7000 at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576) at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:829) {code} Nodetool status showing the randomness of the cross-dc nodes picking to use the private ip. {code:java} ubuntu@10.34.51.10(ap-northeast-1-cassandra-node0):~# ntool status Datacenter: ap-northeast-1 ========================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 54.238.<redacted> 4.26 GiB 8 100.0% 4affb962-7bf0-42f7-9956-fdbec1c07e5f 1d UN 52.196.<redacted> 3.71 GiB 8 100.0% 6857d4de-c497-440f-a2ff-c4d18907fa39 1c UN 3.114.<redacted> 4.28 GiB 8 100.0% d43d2fb3-27a0-4ecd-9887-741c9fc010da 1a Datacenter: eu-west-1 ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 54.229.<redacted> 4.06 GiB 8 100.0% a8c866d3-bde0-453d-8892-dbe544b7e910 1a UN 52.18.<redacted> 4.06 GiB 8 100.0% 4530631d-7e2c-455d-89ff-3ddd3e9c64b7 1b DN 34.248.<redacted> 4.06 GiB 8 100.0% 26daf7cf-5f1a-4969-a7be-c58ff36e9176 1c Datacenter: us-east-1 ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack DN 52.54.<redacted> 4.06 GiB 8 100.0% a2edd4b1-d286-441d-a0b1-5d98b88ee2f2 1c UN 34.203.<redacted>2 4.08 GiB 8 100.0% 5c64292f-df51-45f3-b3b6-ed325ea669ff 1a UN 3.229.<redacted> 4.06 GiB 8 100.0% 53a6d308-25b6-4d87-8581-3cc3fd43c165 1b Datacenter: us-west-2 ===================== Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack DN 44.233.<redacted> 4.26 GiB 8 100.0% d53ab9bf-2606-4516-a689-7e19d053d857 2b UN 54.200.<redacted> 4.26 GiB 8 100.0% 4ec7c54d-465c-489a-8aed-5ba38264cec8 2a DN 52.27.<redacted> 4.26 GiB 8 100.0% 8ae55f1a-bf5a-4ce4-892b-4812773036fa 2c {code} > Incorrect IP used for gossip across DCs with prefer_local=true > -------------------------------------------------------------- > > Key: CASSANDRA-18560 > URL: https://issues.apache.org/jira/browse/CASSANDRA-18560 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip > Reporter: Brad Vernon > Assignee: Brandon Williams > Priority: Urgent > Fix For: 4.0.x, 4.1.x, 5.x > > > After installing a new node using 4.0.10 we experienced a situation where the > new node attempted to connect to the private ip of a random number of nodes > remote DCs which are only accessible via public ip for cross dc > communications. > The only impact was new nodes outbound connections, inbound from pre-4.0.10 > were not affected. system.peers_v2 (below) showed that the preferred_ip and > preferred_port as null, only those in 4.0.10 nodes dc have perferred_ip > values as expected. > We believe the issue originated with > https://issues.apache.org/jira/browse/CASSANDRA-16718 > Details on cluster: > * All nodes have public IP configured as well as private IP > * Listen/rpc addressrs are configured for private ip, broadcast is public IP > * prefer_local=true is enabled for all nodes > The log that showed the connection failing: > {code:java} > INFO [Messaging-EventLoop-3-8] 2023-06-01 00:14:21,565 NoSpamLogger.java:92 > - > /99.81.<redacted>:7000->/44.208.<redacted>:7000-URGENT_MESSAGES-[no-channel] > failed to connectio.netty.channel.ConnectTimeoutException: connection timed > out: /10.26.5.11:7000 at > io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576){code} > 99 and 44 instances can only access each other using public ips. > gossipinfo output from 4.0.10 node > {code:java} > /44.208.<redacted> > generation:1661113358 > heartbeat:25267691 > LOAD:25267683:1.7882044268E10 > SCHEMA:24692061:e98b918d-499f-3ccc-8dbe-5af31f685bda > DC:13:us-east-1 > RACK:15:1a > RELEASE_VERSION:6:4.0.5 > NET_VERSION:2:12 > HOST_ID:3:9a41e668-060d-4cfe-bb1e-013f5116422d > RPC_READY:1407:true > INTERNAL_ADDRESS_AND_PORT:9:10.26.5.11:7000 > NATIVE_ADDRESS_AND_PORT:4:44.208.<redacted>:9042 > STATUS_WITH_PORT:1393:NORMAL,-2262036356854762881 > SSTABLE_VERSIONS:7:big-nb > TOKENS:1392:<hidden> {code} > Peers output from 4.0.10 node: > {code:java} > peer | peer_port | data_center | host_id > | native_address | native_port | preferred_ip | preferred_port | > rack | release_version | schema_version | > tokens----------------+-----------+---------------------+--------------------------------------+----------------+-------------+--------------+----------------+------+-----------------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > 44.208.<redacted> | 7000 | us-east-1 | > 9a41e668-060d-4cfe-bb1e-013f5116422d | 44.208.<redacted> | 9042 | > null | null | 1a | 4.0.5 | > e98b918d-499f-3ccc-8dbe-5af31f685bda | {'-2262036356854762881', > '-4197710115038136897', '-7072386316096662315', '2085255826742630980', > '249732489387853170', '4976300208126705818', '7187184456885833289', > '8777189009399731927'} {code} > To solve temporarily we routed outbound traffic to the private ip to public > using iptables which resulted in successful outbound connections. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org