[jira] [Commented] (CASSANDRA-18560) Incorrect IP used for gossip across DCs with prefer_local=true

Brad Vernon (Jira) Fri, 07 Jul 2023 15:13:04 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741204#comment-17741204
 ]


Brad Vernon commented on CASSANDRA-18560:
-----------------------------------------

We did an upgrade with an existing instance from 4.1.1 to 4.1.2 and the same 
exact issue happened with nodes that previously had no issues connecting across 
DC using the public IP assigned. Only outbound connections were affected and it 
was random across the nodes not being able to use the public IP. Downgrading to 
4.1.1 restored normal operations.

This seems like a much larger bug that will definitely impact clusters that 
have both local private IPs and public IPs for cross dc access.

Error message for one node which should be using IP 34.248.<redacted> but 
instead is using 10.34.37.10 which is the private IP of the host and only 
available in the local VPC.
{code:java}
WARN  [Messaging-EventLoop-3-3] 2023-07-07 21:52:27,929 NoSpamLogger.java:108 - 
/3.114.<redacted>:7000->/34.248<redacted>:7000-URGENT_MESSAGES-[no-channel] 
dropping message of type ECHO_RSP whose timeout expired before reaching the 
networkINFO  [Messaging-EventLoop-3-3] 2023-07-07 21:52:47,391 
NoSpamLogger.java:105 - 
/3.114.<redacted>:7000->/34.248.<redacted>:7000-URGENT_MESSAGES-[no-channel] 
failed to connectio.netty.channel.ConnectTimeoutException: connection timed 
out: /10.34.37.10:7000  at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576)
  at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)  at 
io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)  
at 
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
  at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
  at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)  at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)  
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
  at java.base/java.lang.Thread.run(Thread.java:829) {code}
 

Nodetool status showing the randomness of the cross-dc nodes picking to use the 
private ip.
{code:java}
ubuntu@10.34.51.10(ap-northeast-1-cassandra-node0):~# ntool status
Datacenter: ap-northeast-1
==========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load      Tokens  Owns (effective)  Host ID                 
              Rack
UN  54.238.<redacted>  4.26 GiB  8       100.0%            
4affb962-7bf0-42f7-9956-fdbec1c07e5f  1d
UN  52.196.<redacted>   3.71 GiB  8       100.0%            
6857d4de-c497-440f-a2ff-c4d18907fa39  1c
UN  3.114.<redacted>    4.28 GiB  8       100.0%            
d43d2fb3-27a0-4ecd-9887-741c9fc010da  1a

Datacenter: eu-west-1
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load      Tokens  Owns (effective)  Host ID                 
              Rack
UN  54.229.<redacted>   4.06 GiB  8       100.0%            
a8c866d3-bde0-453d-8892-dbe544b7e910  1a
UN  52.18.<redacted>    4.06 GiB  8       100.0%            
4530631d-7e2c-455d-89ff-3ddd3e9c64b7  1b
DN  34.248.<redacted>   4.06 GiB  8       100.0%            
26daf7cf-5f1a-4969-a7be-c58ff36e9176  1c

Datacenter: us-east-1
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load      Tokens  Owns (effective)  Host ID                 
              Rack
DN  52.54.<redacted>     4.06 GiB  8       100.0%            
a2edd4b1-d286-441d-a0b1-5d98b88ee2f2  1c
UN  34.203.<redacted>2  4.08 GiB  8       100.0%            
5c64292f-df51-45f3-b3b6-ed325ea669ff  1a
UN  3.229.<redacted>   4.06 GiB  8       100.0%            
53a6d308-25b6-4d87-8581-3cc3fd43c165  1b

Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load      Tokens  Owns (effective)  Host ID                 
              Rack
DN  44.233.<redacted>   4.26 GiB  8       100.0%            
d53ab9bf-2606-4516-a689-7e19d053d857  2b
UN  54.200.<redacted>  4.26 GiB  8       100.0%            
4ec7c54d-465c-489a-8aed-5ba38264cec8  2a
DN  52.27.<redacted>     4.26 GiB  8       100.0%            
8ae55f1a-bf5a-4ce4-892b-4812773036fa  2c {code}

> Incorrect IP used for gossip across DCs with prefer_local=true
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-18560
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18560
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Brad Vernon
>            Assignee: Brandon Williams
>            Priority: Urgent
>             Fix For: 4.0.x, 4.1.x, 5.x
>
>
> After installing a new node using 4.0.10 we experienced a situation where the 
> new node attempted to connect to the private ip of a random number of nodes 
> remote DCs which are only accessible via public ip for cross dc 
> communications.
> The only impact was new nodes outbound connections, inbound from pre-4.0.10 
> were not affected.  system.peers_v2 (below) showed that the preferred_ip and 
> preferred_port as null, only those in 4.0.10 nodes dc have perferred_ip 
> values as expected.
> We believe the issue originated with 
> https://issues.apache.org/jira/browse/CASSANDRA-16718 
> Details on cluster:
>  * All nodes have public IP configured as well as private IP
>  * Listen/rpc addressrs are configured for private ip, broadcast is public IP
>  * prefer_local=true is enabled for all nodes
> The log that showed the connection failing:
> {code:java}
> INFO  [Messaging-EventLoop-3-8] 2023-06-01 00:14:21,565 NoSpamLogger.java:92 
> - 
> /99.81.<redacted>:7000->/44.208.<redacted>:7000-URGENT_MESSAGES-[no-channel] 
> failed to connectio.netty.channel.ConnectTimeoutException: connection timed 
> out: /10.26.5.11:7000  at 
> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576){code}
> 99 and 44 instances can only access each other using public ips.
> gossipinfo output from 4.0.10 node
> {code:java}
> /44.208.<redacted>
>   generation:1661113358
>   heartbeat:25267691
>   LOAD:25267683:1.7882044268E10
>   SCHEMA:24692061:e98b918d-499f-3ccc-8dbe-5af31f685bda
>   DC:13:us-east-1
>   RACK:15:1a
>   RELEASE_VERSION:6:4.0.5
>   NET_VERSION:2:12
>   HOST_ID:3:9a41e668-060d-4cfe-bb1e-013f5116422d
>   RPC_READY:1407:true
>   INTERNAL_ADDRESS_AND_PORT:9:10.26.5.11:7000
>   NATIVE_ADDRESS_AND_PORT:4:44.208.<redacted>:9042
>   STATUS_WITH_PORT:1393:NORMAL,-2262036356854762881
>   SSTABLE_VERSIONS:7:big-nb
>   TOKENS:1392:<hidden> {code}
> Peers output from 4.0.10 node:
> {code:java}
>    peer           | peer_port | data_center         | host_id                 
>              | native_address | native_port | preferred_ip | preferred_port | 
> rack | release_version | schema_version                       | 
> tokens----------------+-----------+---------------------+--------------------------------------+----------------+-------------+--------------+----------------+------+-----------------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>   44.208.<redacted> |      7000 |      us-east-1 | 
> 9a41e668-060d-4cfe-bb1e-013f5116422d |  44.208.<redacted> |        9042 |     
>     null |           null |   1a |           4.0.5 | 
> e98b918d-499f-3ccc-8dbe-5af31f685bda |    {'-2262036356854762881', 
> '-4197710115038136897', '-7072386316096662315', '2085255826742630980', 
> '249732489387853170', '4976300208126705818', '7187184456885833289', 
> '8777189009399731927'} {code}
> To solve temporarily we routed outbound traffic to the private ip to public 
> using iptables which resulted in successful outbound connections.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-18560) Incorrect IP used for gossip across DCs with prefer_local=true

Reply via email to