[jira] [Updated] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails with no internode encryption

2023-12-06 Thread Aldo (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aldo updated CASSANDRA-19178:
-
Resolution: (was: Invalid)
Status: Open  (was: Resolved)

> Cluster upgrade 3.x -> 4.x fails with no internode encryption
> -
>
> Key: CASSANDRA-19178
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Aldo
>Priority: Normal
> Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force update it in order to force a restart, 
> scale to 0 and then 1 the service, restart an entire server, turn off and 
> then turn on all the 3 servers. Never found an issue on this.
> I also just completed a full upgrade of the cluster from version 2.2.8 to 
> 3.11.16 (simply upgrading the Docker official image associated with the 
> services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
> server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
> finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables 
> have now the {{me-*}} prefix.
>  
> The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
> procedure that I follow is very simple:
>  # I start from the _cassandra7_ service (which is a seed node)
>  # {{nodetool drain}}
>  # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
>  # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version
> The procedure is exactly the same I followed for the upgrade 2.2.8 --> 
> 3.11.16, obviously with a different version at step 4. Unfortunately the 
> upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and 
> attempts to communicate with the other seed node ({_}cassandra9{_}) but the 
> log of _cassandra7_ shows the following:
> {code:java}
> INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
> OutboundConnectionInitiator.java:390 - Failed to connect to peer 
> tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
> io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
> Connection reset by peer{code}
> The relevant port of the log, related to the missing internode communication, 
> is attached in _cassandra7.log_
> In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
> So only _cassandra7_ is saying something in the logs.
> I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
> always the same. Of course when I follow the steps 1..3, then restore the 3.x 
> snapshot and finally perform the step #4 using the official 3.11.16 version 
> the node 7 restarts correctly and joins the cluster. I attached the relevant 
> part of the log (see {_}cassandra7.downgrade.log{_}) where you can see that 
> node 7 and 9 can communicate.
> I suspect this could be related to the port 7000 now (with Cassandra 4.x) 
> supporting both encrypted and unencrypted traffic. As stated previously I'm 
> using the untouched official Cassandra images so all my cluster, inside the 
> Docker Swarm, is not (and has never been) configured with encryption.
> I can also add the following: if I perform the 4 above steps also for the 
> _cassandra9_ and _cassandra8_ services, in the end the cluster works. But 
> this is not acceptable, because the cluster is unavailable until I finish the 
> full upgrade of all nodes: I need to perform a step-update, one 

[jira] [Updated] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails with no internode encryption

2023-12-06 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-19178:
-
Resolution: Invalid
Status: Resolved  (was: Triage Needed)

I don't see any debug logs here, examining the one on the other side of the 
'Connection reset by peer' may reveal something.

bq. Any idea on how to further investigate the issue?

This jira is for the development of Apache Cassandra and as such, makes for a 
poor vehicle for support.  We recommend contacting the community via slack or 
the ML instead: https://cassandra.apache.org/_/community.html  If in the end 
you discover a bug then please come back and file an actionable report here.

> Cluster upgrade 3.x -> 4.x fails with no internode encryption
> -
>
> Key: CASSANDRA-19178
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Aldo
>Priority: Normal
> Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force update it in order to force a restart, 
> scale to 0 and then 1 the service, restart an entire server, turn off and 
> then turn on all the 3 servers. Never found an issue on this.
> I also just completed a full upgrade of the cluster from version 2.2.8 to 
> 3.11.16 (simply upgrading the Docker official image associated with the 
> services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
> server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
> finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables 
> have now the {{me-*}} prefix.
>  
> The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
> procedure that I follow is very simple:
>  # I start from the _cassandra7_ service (which is a seed node)
>  # {{nodetool drain}}
>  # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
>  # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version
> The procedure is exactly the same I followed for the upgrade 2.2.8 --> 
> 3.11.16, obviously with a different version at step 4. Unfortunately the 
> upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and 
> attempts to communicate with the other seed node ({_}cassandra9{_}) but the 
> log of _cassandra7_ shows the following:
> {code:java}
> INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
> OutboundConnectionInitiator.java:390 - Failed to connect to peer 
> tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
> io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
> Connection reset by peer{code}
> The relevant port of the log, related to the missing internode communication, 
> is attached in _cassandra7.log_
> In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
> So only _cassandra7_ is saying something in the logs.
> I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
> always the same. Of course when I follow the steps 1..3, then restore the 3.x 
> snapshot and finally perform the step #4 using the official 3.11.16 version 
> the node 7 restarts correctly and joins the cluster. I attached the relevant 
> part of the log (see {_}cassandra7.downgrade.log{_}) where you can see that 
> node 7 and 9 can communicate.
> I suspect this could be related to the port 7000 now (with Cassandra 4.x) 
> supporting 

[jira] [Updated] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails with no internode encryption

2023-12-06 Thread Aldo (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aldo updated CASSANDRA-19178:
-
Description: 
I have a Docker swarm cluster with 3 distinct Cassandra services (named 
{_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
servers. The 3 services are running the version 3.11.16, using the official 
Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
with the following environment variables
{code:java}
CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
the _cassandra.yaml_ for the first service contains the following (and the rest 
is the image default):
{code:java}
# grep tasks /etc/cassandra/cassandra.yaml
          - seeds: "tasks.cassandra7,tasks.cassandra9"
listen_address: tasks.cassandra7
broadcast_address: tasks.cassandra7
broadcast_rpc_address: tasks.cassandra7 {code}
Other services (8 and 9) have a similar configuration, obviously with a 
different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
{{{}tasks.cassandra9{}}}).

The cluster is running smoothly and all the nodes are perfectly able to rejoin 
the cluster whichever event occurs, thanks to the Docker Swarm 
{{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
Docker swarm to restart it, force update it in order to force a restart, scale 
to 0 and then 1 the service, restart an entire server, turn off and then turn 
on all the 3 servers. Never found an issue on this.

I also just completed a full upgrade of the cluster from version 2.2.8 to 
3.11.16 (simply upgrading the Docker official image associated with the 
services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables have 
now the {{me-*}} prefix.

 

The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
procedure that I follow is very simple:
 # I start from the _cassandra7_ service (which is a seed node)
 # {{nodetool drain}}
 # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
 # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version

The procedure is exactly the same I followed for the upgrade 2.2.8 --> 3.11.16, 
obviously with a different version at step 4. Unfortunately the upgrade 3.x --> 
4.x is not working, the _cassandra7_ service restarts and attempts to 
communicate with the other seed node ({_}cassandra9{_}) but the log of 
_cassandra7_ shows the following:
{code:java}
INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
OutboundConnectionInitiator.java:390 - Failed to connect to peer 
tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
Connection reset by peer{code}
The relevant port of the log, related to the missing internode communication, 
is attached in _cassandra7.log_

In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
So only _cassandra7_ is saying something in the logs.

I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
always the same. Of course when I follow the steps 1..3, then restore the 3.x 
snapshot and finally perform the step #4 using the official 3.11.16 version the 
node 7 restarts correctly and joins the cluster. I attached the relevant part 
of the log (see {_}cassandra7.downgrade.log{_}) where you can see that node 7 
and 9 can communicate.

I suspect this could be related to the port 7000 now (with Cassandra 4.x) 
supporting both encrypted and unencrypted traffic. As stated previously I'm 
using the untouched official Cassandra images so all my cluster, inside the 
Docker Swarm, is not (and has never been) configured with encryption.

I can also add the following: if I perform the 4 above steps also for the 
_cassandra9_ and _cassandra8_ services, in the end the cluster works. But this 
is not acceptable, because the cluster is unavailable until I finish the full 
upgrade of all nodes: I need to perform a step-update, one node after the 
other, where only 1 node is temporarily down and the other N-1 stay up.

Any idea on how to further investigate the issue? Thanks

 

  was:
I have a Docker swarm cluster with 3 distinct Cassandra services (named 
{_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
servers. The 3 services are running the version 3.11.16, using the official 
Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
with the following environment variables
{code:java}
CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
which in turn, at startup, modifies