[ 
https://issues.apache.org/jira/browse/CASSANDRA-18560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17733083#comment-17733083
 ] 

Brandon Williams commented on CASSANDRA-18560:
----------------------------------------------

Ah, streaming, that makes sense.  Perhaps we should have removed the 
preferred_ip usage there as well, as I thought about on CASSANDRA-16718 but 
decided against... it seems that the compromise was a mistake.  I'm not sure we 
should take another swing at that for the next release, and perhaps should just 
revert CASSANDRA-16718.  I'm raising the priority of this ticket to hopefully 
block any releases without it while this is in progress.

> Incorrect IP used for gossip across DCs with prefer_local=true
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-18560
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18560
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Brad Vernon
>            Assignee: Brandon Williams
>            Priority: Urgent
>             Fix For: 4.0.x, 4.1.x, 5.x
>
>
> After installing a new node using 4.0.10 we experienced a situation where the 
> new node attempted to connect to the private ip of a random number of nodes 
> remote DCs which are only accessible via public ip for cross dc 
> communications.
> The only impact was new nodes outbound connections, inbound from pre-4.0.10 
> were not affected.  system.peers_v2 (below) showed that the preferred_ip and 
> preferred_port as null, only those in 4.0.10 nodes dc have perferred_ip 
> values as expected.
> We believe the issue originated with 
> https://issues.apache.org/jira/browse/CASSANDRA-16718 
> Details on cluster:
>  * All nodes have public IP configured as well as private IP
>  * Listen/rpc addressrs are configured for private ip, broadcast is public IP
>  * prefer_local=true is enabled for all nodes
> The log that showed the connection failing:
> {code:java}
> INFO  [Messaging-EventLoop-3-8] 2023-06-01 00:14:21,565 NoSpamLogger.java:92 
> - 
> /99.81.<redacted>:7000->/44.208.<redacted>:7000-URGENT_MESSAGES-[no-channel] 
> failed to connectio.netty.channel.ConnectTimeoutException: connection timed 
> out: /10.26.5.11:7000  at 
> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576){code}
> 99 and 44 instances can only access each other using public ips.
> gossipinfo output from 4.0.10 node
> {code:java}
> /44.208.<redacted>
>   generation:1661113358
>   heartbeat:25267691
>   LOAD:25267683:1.7882044268E10
>   SCHEMA:24692061:e98b918d-499f-3ccc-8dbe-5af31f685bda
>   DC:13:us-east-1
>   RACK:15:1a
>   RELEASE_VERSION:6:4.0.5
>   NET_VERSION:2:12
>   HOST_ID:3:9a41e668-060d-4cfe-bb1e-013f5116422d
>   RPC_READY:1407:true
>   INTERNAL_ADDRESS_AND_PORT:9:10.26.5.11:7000
>   NATIVE_ADDRESS_AND_PORT:4:44.208.<redacted>:9042
>   STATUS_WITH_PORT:1393:NORMAL,-2262036356854762881
>   SSTABLE_VERSIONS:7:big-nb
>   TOKENS:1392:<hidden> {code}
> Peers output from 4.0.10 node:
> {code:java}
>    peer           | peer_port | data_center         | host_id                 
>              | native_address | native_port | preferred_ip | preferred_port | 
> rack | release_version | schema_version                       | 
> tokens----------------+-----------+---------------------+--------------------------------------+----------------+-------------+--------------+----------------+------+-----------------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>   44.208.<redacted> |      7000 |      us-east-1 | 
> 9a41e668-060d-4cfe-bb1e-013f5116422d |  44.208.<redacted> |        9042 |     
>     null |           null |   1a |           4.0.5 | 
> e98b918d-499f-3ccc-8dbe-5af31f685bda |    {'-2262036356854762881', 
> '-4197710115038136897', '-7072386316096662315', '2085255826742630980', 
> '249732489387853170', '4976300208126705818', '7187184456885833289', 
> '8777189009399731927'} {code}
> To solve temporarily we routed outbound traffic to the private ip to public 
> using iptables which resulted in successful outbound connections.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to