[jira] [Commented] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails due to IP change

2023-12-07 Thread Aldo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17794463#comment-17794463
 ] 

Aldo commented on CASSANDRA-19178:
--

{quote}One way out may be to add a new node to the cluster that knows about 
cassandra7 and cassandra9 that can "introduce" those nodes to each other once 
it knows about their correct addresses. It may not even need to complete 
bootstrapping for this to happen.
{quote}
Good to know, thanks. To be honest, my 3-nodes scenario is a test environment 
where I'm simulating the 3.x->4.x upgrade. The real production environment is a 
5-nodes scenario, with RF=3. So, given the fact that I can temporarily "accept" 
the quick downtime of 2 nodes out of 5, I can probably:
 # select a seed node X for upgrade: such node will restart, receive a new IP, 
and stay of the cluster until another node will discover its IP and communicate 
with its inbound
 # select another node Y and just restart it (without upgrade) but giving to it 
the full list of all 5 nodes as seeds: in this way Y will resolve the 
hostnames->IP of all the 5 nodes (including X) thus "introducing" X back into 
the cluster, according to your suggestion

I will repeat 1-2 for all the 5 nodes until everything is upgraded to 4.x. 
Moreover, after the first 1-2 steps on the very first node, the next iterations 
of the 1-2 steps can be simplified: maybe I can use your other suggestion 
(nodetool reloadseeds) and perform the step #2 by selecting a Y node of type 
4.x and executing "nodetool reloadseeds".

> Cluster upgrade 3.x -> 4.x fails due to IP change
> -
>
> Key: CASSANDRA-19178
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Aldo
>Priority: Normal
> Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force update it in order to force a restart, 
> scale to 0 and then 1 the service, restart an entire server, turn off and 
> then turn on all the 3 servers. Never found an issue on this.
> I also just completed a full upgrade of the cluster from version 2.2.8 to 
> 3.11.16 (simply upgrading the Docker official image associated with the 
> services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
> server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
> finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables 
> have now the {{me-*}} prefix.
>  
> The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
> procedure that I follow is very simple:
>  # I start from the _cassandra7_ service (which is a seed node)
>  # {{nodetool drain}}
>  # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
>  # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version
> The procedure is exactly the same I followed for the upgrade 2.2.8 --> 
> 3.11.16, obviously with a different version at step 4. Unfortunately the 
> upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and 
> attempts to communicate with the other seed node ({_}cassandra9{_}) but the 
> log of _cassandra7_ shows the following:
> {code:java}
> INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
> OutboundConnectionInitiator.java:390 - Failed to connect to peer 
> tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
> io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
> 

[jira] [Commented] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails due to IP change

2023-12-07 Thread Aldo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17794434#comment-17794434
 ] 

Aldo commented on CASSANDRA-19178:
--

{quote}Is the seed list on cassandra9 up to date with cassandra7?
{quote}
Is there a way to dynamically update the seed list in a living node? In my 
configuration I have:
 * cassandra 7, just upgraded with 4.x but out-of-the-cluster until it is able 
to properly communicate with other peers
 * cassandra 8, running with 3.x and paired with cassandra 9. Don't know the 
new IP of cassandra7
 * cassandra 9, running with 3.x and paired with cassandra 8. Don't know the 
new IP of cassandra7

If I can trigger some kind of live seed list refresh on cassandra 9 or 8, this 
will result in what you described: cassandra7 will learn the 8 & 9 messaging 
version when they communicate with it. But to do it one between 9 or 8 must be 
triggered in order to use the new cassandra7 IP. "Triggered" to me means "live 
trigger": it's unaccettable to restart another node (8 or 9) when the 7 is 
already out of the cluster. Is it possible? Through JMX or something similar ?

 

> Cluster upgrade 3.x -> 4.x fails due to IP change
> -
>
> Key: CASSANDRA-19178
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Aldo
>Priority: Normal
> Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force update it in order to force a restart, 
> scale to 0 and then 1 the service, restart an entire server, turn off and 
> then turn on all the 3 servers. Never found an issue on this.
> I also just completed a full upgrade of the cluster from version 2.2.8 to 
> 3.11.16 (simply upgrading the Docker official image associated with the 
> services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
> server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
> finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables 
> have now the {{me-*}} prefix.
>  
> The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
> procedure that I follow is very simple:
>  # I start from the _cassandra7_ service (which is a seed node)
>  # {{nodetool drain}}
>  # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
>  # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version
> The procedure is exactly the same I followed for the upgrade 2.2.8 --> 
> 3.11.16, obviously with a different version at step 4. Unfortunately the 
> upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and 
> attempts to communicate with the other seed node ({_}cassandra9{_}) but the 
> log of _cassandra7_ shows the following:
> {code:java}
> INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
> OutboundConnectionInitiator.java:390 - Failed to connect to peer 
> tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
> io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
> Connection reset by peer{code}
> The relevant port of the log, related to the missing internode communication, 
> is attached in _cassandra7.log_
> In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
> So only _cassandra7_ is saying something in the logs.
> I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
> always the same. Of course when I follow the 

[jira] [Commented] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails due to IP change

2023-12-07 Thread Aldo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17794431#comment-17794431
 ] 

Aldo commented on CASSANDRA-19178:
--

Unfortunately the answer is no: cassandra7 just restarted and got a brand new 
IP from Docker Swarm. So there is not way for cassandra9 to contact cassandra7 
by itself. It is cassandra7 that, just restarted, must communicate with 
cassandra9. And according to the code I studied in my last comment above, it 
should work. But it instead, in my environment, the cassandra9 answer is 
completely masked by the connection reset by peer.

> Cluster upgrade 3.x -> 4.x fails due to IP change
> -
>
> Key: CASSANDRA-19178
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Aldo
>Priority: Normal
> Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force update it in order to force a restart, 
> scale to 0 and then 1 the service, restart an entire server, turn off and 
> then turn on all the 3 servers. Never found an issue on this.
> I also just completed a full upgrade of the cluster from version 2.2.8 to 
> 3.11.16 (simply upgrading the Docker official image associated with the 
> services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
> server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
> finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables 
> have now the {{me-*}} prefix.
>  
> The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
> procedure that I follow is very simple:
>  # I start from the _cassandra7_ service (which is a seed node)
>  # {{nodetool drain}}
>  # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
>  # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version
> The procedure is exactly the same I followed for the upgrade 2.2.8 --> 
> 3.11.16, obviously with a different version at step 4. Unfortunately the 
> upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and 
> attempts to communicate with the other seed node ({_}cassandra9{_}) but the 
> log of _cassandra7_ shows the following:
> {code:java}
> INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
> OutboundConnectionInitiator.java:390 - Failed to connect to peer 
> tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
> io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
> Connection reset by peer{code}
> The relevant port of the log, related to the missing internode communication, 
> is attached in _cassandra7.log_
> In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
> So only _cassandra7_ is saying something in the logs.
> I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
> always the same. Of course when I follow the steps 1..3, then restore the 3.x 
> snapshot and finally perform the step #4 using the official 3.11.16 version 
> the node 7 restarts correctly and joins the cluster. I attached the relevant 
> part of the log (see {_}cassandra7.downgrade.log{_}) where you can see that 
> node 7 and 9 can communicate.
> I suspect this could be related to the port 7000 now (with Cassandra 4.x) 
> supporting both encrypted and unencrypted traffic. As stated previously I'm 
> using the untouched official Cassandra images so all my cluster, inside 

[jira] [Commented] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails due to IP change

2023-12-07 Thread Aldo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17794427#comment-17794427
 ] 

Aldo commented on CASSANDRA-19178:
--

I read carefully the code of _IncomingTcpConnection.java_ (trunk 3.11.16). The 
[receiveMessages|https://github.com/apache/cassandra/blob/681b6ca103d91d940a9fecb8cd812f58dd2490d0/src/java/org/apache/cassandra/net/IncomingTcpConnection.java#L142]
 method seems to do two things:
 # write and flush its current version (11)
 # throw an IOException

The IOException results in the socket to be closed.

On the other side, the caller is busy on the _OutboundConnectionInitiator.java_ 
(trunk 4.1.3). It *for sure* enters the 
[decode|https://github.com/apache/cassandra/blob/2a4cd36475de3eb47207cd88d2d472b876c6816d/src/java/org/apache/cassandra/net/OutboundConnectionInitiator.java#L263C27-L263C27]
 method and proceeds to [line 
267|https://github.com/apache/cassandra/blob/2a4cd36475de3eb47207cd88d2d472b876c6816d/src/java/org/apache/cassandra/net/OutboundConnectionInitiator.java#L267]
 where it *should* decode the message, discover the version 11, print 
{{received second handshake message from peer}} as per [line 
273|https://github.com/apache/cassandra/blob/2a4cd36475de3eb47207cd88d2d472b876c6816d/src/java/org/apache/cassandra/net/OutboundConnectionInitiator.java#L273]
 and then re-contact the peer this time with version 11. But according to my 
log snippet of cassandra7 above, the _OutboundConnectionInitiator.decode()_ 
method instead is unable to execute the code at line 267, which result in an 
exception being thrown and catched at [line 
363|https://github.com/apache/cassandra/blob/2a4cd36475de3eb47207cd88d2d472b876c6816d/src/java/org/apache/cassandra/net/OutboundConnectionInitiator.java#L363].
 From there the 
[exceptionCaught|https://github.com/apache/cassandra/blob/2a4cd36475de3eb47207cd88d2d472b876c6816d/src/java/org/apache/cassandra/net/OutboundConnectionInitiator.java#L368]
 method is invoked and we can see the exception log with {{{}Failed to connect 
to peer ... Connection reset by peer{}}}.

I wonder what is causing this kind of behavior:
 # Is it a good practice, in version 3.11.16, write and flush the correct 
messaging version (11) and then abruptly close the socket?
 # How can the caller (4.1.3) be guaranteed to receive the few bytes indicating 
the correct messaging? In my environment it seems that the socket abruptly 
closed by the other peer is "winning" over such few response bytes.
 # Is there something at netty-level (some kind of System properties) able to 
mitigate such kind of behavior, either in the 4.1.3 node or on the 3.11.16 node?
 # Is it possible that my environment (AWS servers, Docker Swarm) triggered 
something similar to what is documented at [line 
372|https://github.com/apache/cassandra/blob/2a4cd36475de3eb47207cd88d2d472b876c6816d/src/java/org/apache/cassandra/net/OutboundConnectionInitiator.java#L372]?
 The comment relates to {{SslClosedEngineException}} (which is not my case), 
but the reference to {{io.netty.channel.unix.Errors$NativeIoException: 
readAddress(..) }}is matching my logs.

> Cluster upgrade 3.x -> 4.x fails due to IP change
> -
>
> Key: CASSANDRA-19178
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Aldo
>Priority: Normal
> Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force 

[jira] [Updated] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails due to IP change

2023-12-06 Thread Aldo (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aldo updated CASSANDRA-19178:
-
Summary: Cluster upgrade 3.x -> 4.x fails due to IP change  (was: Cluster 
upgrade 3.x -> 4.x fails with no internode encryption)

> Cluster upgrade 3.x -> 4.x fails due to IP change
> -
>
> Key: CASSANDRA-19178
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Aldo
>Priority: Normal
> Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force update it in order to force a restart, 
> scale to 0 and then 1 the service, restart an entire server, turn off and 
> then turn on all the 3 servers. Never found an issue on this.
> I also just completed a full upgrade of the cluster from version 2.2.8 to 
> 3.11.16 (simply upgrading the Docker official image associated with the 
> services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
> server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
> finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables 
> have now the {{me-*}} prefix.
>  
> The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
> procedure that I follow is very simple:
>  # I start from the _cassandra7_ service (which is a seed node)
>  # {{nodetool drain}}
>  # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
>  # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version
> The procedure is exactly the same I followed for the upgrade 2.2.8 --> 
> 3.11.16, obviously with a different version at step 4. Unfortunately the 
> upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and 
> attempts to communicate with the other seed node ({_}cassandra9{_}) but the 
> log of _cassandra7_ shows the following:
> {code:java}
> INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
> OutboundConnectionInitiator.java:390 - Failed to connect to peer 
> tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
> io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
> Connection reset by peer{code}
> The relevant port of the log, related to the missing internode communication, 
> is attached in _cassandra7.log_
> In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
> So only _cassandra7_ is saying something in the logs.
> I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
> always the same. Of course when I follow the steps 1..3, then restore the 3.x 
> snapshot and finally perform the step #4 using the official 3.11.16 version 
> the node 7 restarts correctly and joins the cluster. I attached the relevant 
> part of the log (see {_}cassandra7.downgrade.log{_}) where you can see that 
> node 7 and 9 can communicate.
> I suspect this could be related to the port 7000 now (with Cassandra 4.x) 
> supporting both encrypted and unencrypted traffic. As stated previously I'm 
> using the untouched official Cassandra images so all my cluster, inside the 
> Docker Swarm, is not (and has never been) configured with encryption.
> I can also add the following: if I perform the 4 above steps also for the 
> _cassandra9_ and _cassandra8_ services, in the end the cluster works. But 
> this is not acceptable, because the cluster is unavailable until I finish the 
> full upgrade of all nodes: 

[jira] [Comment Edited] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails with no internode encryption

2023-12-06 Thread Aldo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793944#comment-17793944
 ] 

Aldo edited comment on CASSANDRA-19178 at 12/6/23 11:06 PM:


I apologize in advance if reopening is not the correct behavior, please tell me 
if I need to open a new issue. 
I think I've discovered the source cause of the issue, and wonder if it's a bug 
or it's caused by a misconfiguration on my side.
 
Using {{nodetool setlogginglevel org.apache.cassandra TRACE}} on both the 4.x 
upgraded node (cassandra7) and on the running 3.x seed node (cassandra9) I was 
able to isolate the relevant logs:
 
On cassandra7:
 
 
{code:java}
TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,410 
EndpointMessagingVersions.java:67 - Assuming current protocol version for 
tasks.cassandra9/10.0.2.92:7000 
TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,410 
OutboundConnectionInitiator.java:131 - creating outbound bootstrap to peer: 
(tasks.cassandra9/10.0.2.92:7000, tasks.cassandra9/10.0.2.92:7000), framing: 
CRC, encryption: unencrypted, requestVersion: 12
TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,411 
OutboundConnectionInitiator.java:236 - starting handshake with peer 
tasks.cassandra9/10.0.2.92:7000(tasks.cassandra9/10.0.2.92:7000), msg = 
Initiate(request: 12, min: 10, max: 12, type: URGENT_MESSAGES, framing: true, 
from: tasks.cassandra7/10.0.2.137:7000) 
INFO  [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,412 
OutboundConnectionInitiator.java:390 - Failed to connect to peer 
tasks.cassandra9/10.0.2.92:7000(tasks.cassandra9/10.0.2.92:7000) 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
Connection reset by peer  {code}
 
On cassandra9:
 
{code:java}
TRACE [ACCEPT-tasks.cassandra9/10.0.2.92] 2023-12-06 22:16:56,411 
MessagingService.java:1315 - Connection version 12 from /10.0.2.137
TRACE [MessagingService-Incoming-/10.0.2.137] 2023-12-06 22:16:56,412 
IncomingTcpConnection.java:111 - IOException reading from socket; closing
java.io.IOException: Peer-used messaging version 12 is larger than max 
supported 11
        at 
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:153)
        at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:98)
TRACE [MessagingService-Incoming-/10.0.2.137] 2023-12-06 22:16:56,412 
IncomingTcpConnection.java:125 - Closing socket 
Socket[addr=/10.0.2.137,port=45680,localport=7000] - isclosed: false {code}
 
So it seems there is a mismatch on this {_}messaging version{_}.

 I'm trying to understand the behaviour of _EndpointMessagingVersions.java_ and 
_OutboundConnectionInitiator.java_ on the 4.1.x trunk and it seems that there 
are few facts:
 # the internal map of _EndpointMessagingVersions_ on the node just restarted 
(cassandra7) for sure doesn't include information about the existing node 
(cassandra9). This because on my network configuration cassandra7 (or more 
precisely the tasks.cassandra7 hostname) changed IP due to the restart. So 
cassandra9 (the 3.x running node) cannot send its messaging version (=11) to 
the newest cassandra7 until the handshake completes.
 # therefore inside _OutboundConnectionInitiator_ the messaging version for the 
cassandra7–> cassandra9 handshake is assumed equal to the current (=12)
 # when the 3.x node (cassandra9) determines the messaging version mismatch it 
throws an IOException and closed the connection
 # the 4.x node (cassandra7) just sees a connection reset by peer and seems not 
capable of downgrade the messaging version and retry the handshake

I can again state that a similar upgrade path, with different involved versions 
(2.2.8 --> to 3.11.16) on the same exact architecture, involving the same 
Docker swarm services, the same IP-changing behaviour, etc... worked like a 
charm. So I think something changed on the source code and breaked that 
behavior when the upgrade is 3.11.16 --> 4.1.3.


was (Author: JIRAUSER303409):
I apologize in advance if reopening is not the correct behavior, please tell me 
if I need to open a new issue. 
I think I've discovered the source cause of the issue, and wonder if it's a bug 
or it's caused by a misconfiguration on my side.
 
Using {{nodetool setlogginglevel org.apache.cassandra TRACE}} on both the 4.x 
upgraded node (cassandra7) and on the running 3.x seed node (cassandra9) I was 
able to isolate the relevant logs:
 
On cassandra7:
 
 
{code:java}
TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,410 
EndpointMessagingVersions.java:67 - Assuming current protocol version for 
tasks.cassandra9/10.0.2.92:7000 
TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,410 
OutboundConnectionInitiator.java:131 - creating outbound bootstrap to peer: 
(tasks.cassandra9/10.0.2.92:7000, tasks.cassandra9/10.0.2.92:7000), framing: 
CRC, encryption: unencrypted, requestVersion: 12
TRACE 

[jira] [Updated] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails with no internode encryption

2023-12-06 Thread Aldo (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aldo updated CASSANDRA-19178:
-
Resolution: (was: Invalid)
Status: Open  (was: Resolved)

> Cluster upgrade 3.x -> 4.x fails with no internode encryption
> -
>
> Key: CASSANDRA-19178
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Aldo
>Priority: Normal
> Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force update it in order to force a restart, 
> scale to 0 and then 1 the service, restart an entire server, turn off and 
> then turn on all the 3 servers. Never found an issue on this.
> I also just completed a full upgrade of the cluster from version 2.2.8 to 
> 3.11.16 (simply upgrading the Docker official image associated with the 
> services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
> server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
> finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables 
> have now the {{me-*}} prefix.
>  
> The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
> procedure that I follow is very simple:
>  # I start from the _cassandra7_ service (which is a seed node)
>  # {{nodetool drain}}
>  # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
>  # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version
> The procedure is exactly the same I followed for the upgrade 2.2.8 --> 
> 3.11.16, obviously with a different version at step 4. Unfortunately the 
> upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and 
> attempts to communicate with the other seed node ({_}cassandra9{_}) but the 
> log of _cassandra7_ shows the following:
> {code:java}
> INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
> OutboundConnectionInitiator.java:390 - Failed to connect to peer 
> tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
> io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
> Connection reset by peer{code}
> The relevant port of the log, related to the missing internode communication, 
> is attached in _cassandra7.log_
> In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
> So only _cassandra7_ is saying something in the logs.
> I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
> always the same. Of course when I follow the steps 1..3, then restore the 3.x 
> snapshot and finally perform the step #4 using the official 3.11.16 version 
> the node 7 restarts correctly and joins the cluster. I attached the relevant 
> part of the log (see {_}cassandra7.downgrade.log{_}) where you can see that 
> node 7 and 9 can communicate.
> I suspect this could be related to the port 7000 now (with Cassandra 4.x) 
> supporting both encrypted and unencrypted traffic. As stated previously I'm 
> using the untouched official Cassandra images so all my cluster, inside the 
> Docker Swarm, is not (and has never been) configured with encryption.
> I can also add the following: if I perform the 4 above steps also for the 
> _cassandra9_ and _cassandra8_ services, in the end the cluster works. But 
> this is not acceptable, because the cluster is unavailable until I finish the 
> full upgrade of all nodes: I need to perform a step-update, one 

[jira] [Commented] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails with no internode encryption

2023-12-06 Thread Aldo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793944#comment-17793944
 ] 

Aldo commented on CASSANDRA-19178:
--

I apologize in advance if reopening is not the correct behavior, please tell me 
if I need to open a new issue. 
I think I've discovered the source cause of the issue, and wonder if it's a bug 
or it's caused by a misconfiguration on my side.
 
Using {{nodetool setlogginglevel org.apache.cassandra TRACE}} on both the 4.x 
upgraded node (cassandra7) and on the running 3.x seed node (cassandra9) I was 
able to isolate the relevant logs:
 
On cassandra7:
 
 
{code:java}
TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,410 
EndpointMessagingVersions.java:67 - Assuming current protocol version for 
tasks.cassandra9/10.0.2.92:7000 
TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,410 
OutboundConnectionInitiator.java:131 - creating outbound bootstrap to peer: 
(tasks.cassandra9/10.0.2.92:7000, tasks.cassandra9/10.0.2.92:7000), framing: 
CRC, encryption: unencrypted, requestVersion: 12
TRACE [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,411 
OutboundConnectionInitiator.java:236 - starting handshake with peer 
tasks.cassandra9/10.0.2.92:7000(tasks.cassandra9/10.0.2.92:7000), msg = 
Initiate(request: 12, min: 10, max: 12, type: URGENT_MESSAGES, framing: true, 
from: tasks.cassandra7/10.0.2.137:7000) 
INFO  [Messaging-EventLoop-3-3] 2023-12-06 22:16:56,412 
OutboundConnectionInitiator.java:390 - Failed to connect to peer 
tasks.cassandra9/10.0.2.92:7000(tasks.cassandra9/10.0.2.92:7000) 
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
Connection reset by peer  {code}
 
On cassandra9:
 
{code:java}
TRACE [ACCEPT-tasks.cassandra9/10.0.2.92] 2023-12-06 22:16:56,411 
MessagingService.java:1315 - Connection version 12 from /10.0.2.137
TRACE [MessagingService-Incoming-/10.0.2.137] 2023-12-06 22:16:56,412 
IncomingTcpConnection.java:111 - IOException reading from socket; closing
java.io.IOException: Peer-used messaging version 12 is larger than max 
supported 11
        at 
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:153)
        at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:98)
TRACE [MessagingService-Incoming-/10.0.2.137] 2023-12-06 22:16:56,412 
IncomingTcpConnection.java:125 - Closing socket 
Socket[addr=/10.0.2.137,port=45680,localport=7000] - isclosed: false {code}
 
So it seems there is a mismatch on this {_}messaging version{_}.
 

> Cluster upgrade 3.x -> 4.x fails with no internode encryption
> -
>
> Key: CASSANDRA-19178
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Aldo
>Priority: Normal
> Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force update it in order to force a restart, 
> scale to 0 and then 1 the service, restart an entire server, turn off and 
> then turn on all the 3 servers. Never found an issue on this.
> I also just completed a full upgrade of the cluster from version 2.2.8 to 
> 3.11.16 (simply upgrading the Docker official image associated with the 
> services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
> server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
> finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables 
> have 

[jira] [Commented] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails with no internode encryption

2023-12-06 Thread Aldo (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793934#comment-17793934
 ] 

Aldo commented on CASSANDRA-19178:
--

Thanks, I moved the question on StackExchange 
[here|https://dba.stackexchange.com/questions/333799/cassandra-cluster-upgrade-3-x-4-x-fails-with-internode-encryption-none].

> Cluster upgrade 3.x -> 4.x fails with no internode encryption
> -
>
> Key: CASSANDRA-19178
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip
>Reporter: Aldo
>Priority: Normal
> Attachments: cassandra7.downgrade.log, cassandra7.log
>
>
> I have a Docker swarm cluster with 3 distinct Cassandra services (named 
> {_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
> servers. The 3 services are running the version 3.11.16, using the official 
> Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
> with the following environment variables
> {code:java}
> CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
> CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
> which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
> the _cassandra.yaml_ for the first service contains the following (and the 
> rest is the image default):
> {code:java}
> # grep tasks /etc/cassandra/cassandra.yaml
>           - seeds: "tasks.cassandra7,tasks.cassandra9"
> listen_address: tasks.cassandra7
> broadcast_address: tasks.cassandra7
> broadcast_rpc_address: tasks.cassandra7 {code}
> Other services (8 and 9) have a similar configuration, obviously with a 
> different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
> {{{}tasks.cassandra9{}}}).
> The cluster is running smoothly and all the nodes are perfectly able to 
> rejoin the cluster whichever event occurs, thanks to the Docker Swarm 
> {{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
> Docker swarm to restart it, force update it in order to force a restart, 
> scale to 0 and then 1 the service, restart an entire server, turn off and 
> then turn on all the 3 servers. Never found an issue on this.
> I also just completed a full upgrade of the cluster from version 2.2.8 to 
> 3.11.16 (simply upgrading the Docker official image associated with the 
> services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
> server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
> finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables 
> have now the {{me-*}} prefix.
>  
> The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
> procedure that I follow is very simple:
>  # I start from the _cassandra7_ service (which is a seed node)
>  # {{nodetool drain}}
>  # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
>  # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version
> The procedure is exactly the same I followed for the upgrade 2.2.8 --> 
> 3.11.16, obviously with a different version at step 4. Unfortunately the 
> upgrade 3.x --> 4.x is not working, the _cassandra7_ service restarts and 
> attempts to communicate with the other seed node ({_}cassandra9{_}) but the 
> log of _cassandra7_ shows the following:
> {code:java}
> INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
> OutboundConnectionInitiator.java:390 - Failed to connect to peer 
> tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
> io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
> Connection reset by peer{code}
> The relevant port of the log, related to the missing internode communication, 
> is attached in _cassandra7.log_
> In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
> So only _cassandra7_ is saying something in the logs.
> I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
> always the same. Of course when I follow the steps 1..3, then restore the 3.x 
> snapshot and finally perform the step #4 using the official 3.11.16 version 
> the node 7 restarts correctly and joins the cluster. I attached the relevant 
> part of the log (see {_}cassandra7.downgrade.log{_}) where you can see that 
> node 7 and 9 can communicate.
> I suspect this could be related to the port 7000 now (with Cassandra 4.x) 
> supporting both encrypted and unencrypted traffic. As stated previously I'm 
> using the untouched official Cassandra images so all my cluster, inside the 
> Docker Swarm, is not (and has never been) configured with encryption.
> I can also add the following: if I perform the 4 above steps also for the 
> _cassandra9_ and _cassandra8_ services, in the end the cluster works. But 
> this 

[jira] [Updated] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails with no internode encryption

2023-12-06 Thread Aldo (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-19178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aldo updated CASSANDRA-19178:
-
Description: 
I have a Docker swarm cluster with 3 distinct Cassandra services (named 
{_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
servers. The 3 services are running the version 3.11.16, using the official 
Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
with the following environment variables
{code:java}
CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
the _cassandra.yaml_ for the first service contains the following (and the rest 
is the image default):
{code:java}
# grep tasks /etc/cassandra/cassandra.yaml
          - seeds: "tasks.cassandra7,tasks.cassandra9"
listen_address: tasks.cassandra7
broadcast_address: tasks.cassandra7
broadcast_rpc_address: tasks.cassandra7 {code}
Other services (8 and 9) have a similar configuration, obviously with a 
different {{CASSANDRA_LISTEN_ADDRESS }}(\{{{}tasks.cassandra8}} and 
{{{}tasks.cassandra9{}}}).

The cluster is running smoothly and all the nodes are perfectly able to rejoin 
the cluster whichever event occurs, thanks to the Docker Swarm 
{{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
Docker swarm to restart it, force update it in order to force a restart, scale 
to 0 and then 1 the service, restart an entire server, turn off and then turn 
on all the 3 servers. Never found an issue on this.

I also just completed a full upgrade of the cluster from version 2.2.8 to 
3.11.16 (simply upgrading the Docker official image associated with the 
services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables have 
now the {{me-*}} prefix.

 

The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
procedure that I follow is very simple:
 # I start from the _cassandra7_ service (which is a seed node)
 # {{nodetool drain}}
 # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
 # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version

The procedure is exactly the same I followed for the upgrade 2.2.8 --> 3.11.16, 
obviously with a different version at step 4. Unfortunately the upgrade 3.x --> 
4.x is not working, the _cassandra7_ service restarts and attempts to 
communicate with the other seed node ({_}cassandra9{_}) but the log of 
_cassandra7_ shows the following:
{code:java}
INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
OutboundConnectionInitiator.java:390 - Failed to connect to peer 
tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
Connection reset by peer{code}
The relevant port of the log, related to the missing internode communication, 
is attached in _cassandra7.log_

In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
So only _cassandra7_ is saying something in the logs.

I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
always the same. Of course when I follow the steps 1..3, then restore the 3.x 
snapshot and finally perform the step #4 using the official 3.11.16 version the 
node 7 restarts correctly and joins the cluster. I attached the relevant part 
of the log (see {_}cassandra7.downgrade.log{_}) where you can see that node 7 
and 9 can communicate.

I suspect this could be related to the port 7000 now (with Cassandra 4.x) 
supporting both encrypted and unencrypted traffic. As stated previously I'm 
using the untouched official Cassandra images so all my cluster, inside the 
Docker Swarm, is not (and has never been) configured with encryption.

I can also add the following: if I perform the 4 above steps also for the 
_cassandra9_ and _cassandra8_ services, in the end the cluster works. But this 
is not acceptable, because the cluster is unavailable until I finish the full 
upgrade of all nodes: I need to perform a step-update, one node after the 
other, where only 1 node is temporarily down and the other N-1 stay up.

Any idea on how to further investigate the issue? Thanks

 

  was:
I have a Docker swarm cluster with 3 distinct Cassandra services (named 
{_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
servers. The 3 services are running the version 3.11.16, using the official 
Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
with the following environment variables
{code:java}
CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
which in turn, at startup, modifies 

[jira] [Created] (CASSANDRA-19178) Cluster upgrade 3.x -> 4.x fails with no internode encryption

2023-12-06 Thread Aldo (Jira)
Aldo created CASSANDRA-19178:


 Summary: Cluster upgrade 3.x -> 4.x fails with no internode 
encryption
 Key: CASSANDRA-19178
 URL: https://issues.apache.org/jira/browse/CASSANDRA-19178
 Project: Cassandra
  Issue Type: Bug
  Components: Cluster/Gossip
Reporter: Aldo
 Attachments: cassandra7.downgrade.log, cassandra7.log

I have a Docker swarm cluster with 3 distinct Cassandra services (named 
{_}cassandra7{_}, {_}cassandra8{_}, {_}cassandra9{_}) running on 3 different 
servers. The 3 services are running the version 3.11.16, using the official 
Cassandra image 3.11.16 on Docker Hub. The first service is configured just 
with the following environment variables
{code:java}
CASSANDRA_LISTEN_ADDRESS="tasks.cassandra7"
CASSANDRA_SEEDS="tasks.cassandra7,tasks.cassandra9" {code}
which in turn, at startup, modifies the {_}cassandra.yaml{_}. So for instance 
the _cassandra.yaml_ for the first service contains the following (and the rest 
is the image default):
{code:java}
# grep tasks /etc/cassandra/cassandra.yaml
          - seeds: "tasks.cassandra7,tasks.cassandra9"
listen_address: tasks.cassandra7
broadcast_address: tasks.cassandra7
broadcast_rpc_address: tasks.cassandra7 {code}
Other services (8 and 9) have a similar configuration, obviously with a 
different {{CASSANDRA_LISTEN_ADDRESS }}({{{}tasks.cassandra8{}}} and 
{{{}tasks.cassandra9{}}}).

The cluster is running smoothly and all the nodes are perfectly able to rejoin 
the cluster whichever event occurs, thanks to the Docker Swarm 
{{tasks.cassandraXXX}} "hostname": i can kill a Docker container waiting for 
Docker swarm to restart it, force update it in order to force a restart, scale 
to 0 and then 1 the service, restart an entire server, turn off and then turn 
on all the 3 servers. Never found an issue on this.

I also just completed a full upgrade of the cluster from version 2.2.8 to 
3.11.16 (simply upgrading the Docker official image associated with the 
services) without issues. I was also able, thanks to a 2.2.8 snapshot on each 
server, to perform a full downgrade to 2.2.8 and back to 3.11.16 again. I 
finally issued a {{nodetool upgradesstables}} on all nodes, so my SSTables have 
now the {{me-*}} prefix.

 

The problem I'm facing right now is the upgrade from 3.11.16 to 4.x. The 
procedure that I follow is very simple:
 # I start from the _cassandra7_ service (which is a seed node)
 # {{nodetool drain}}
 # Wait for the {{DRAINING ... DRAINED}} messages to appear in the log
 # Upgrade the Docker image of _cassandra7_ to the official 4.1.3 version

The procedure is exactly the same I followed for the upgrade 2.2.8 --> 3.11.16, 
obviously with a different version at step 4. Unfortunately the upgrade 3.x --> 
4.x is not working, the _cassandra7_ service restarts and attempts to 
communicate with the other seed node ({_}cassandra9{_}) but the log of 
_cassandra7_ shows the following:
{code:java}
INFO  [Messaging-EventLoop-3-3] 2023-12-06 17:15:04,727 
OutboundConnectionInitiator.java:390 - Failed to connect to peer 
tasks.cassandra9/10.0.2.196:7000(tasks.cassandra9/10.0.2.196:7000)
io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: 
Connection reset by peer{code}
The relevant port of the log, related to the missing internode communication, 
is attached in _cassandra7.log_

In the log of _cassandra9_ there is nothing after the abovementioned step #4. 
So only _cassandra7_ is saying something in the logs.

I tried with multiple versions (4.0.11 but also 4.0.0) but the outcome is 
always the same. Of course when I follow the steps 1..3, then restore the 3.x 
snapshot and finally perform the step #4 using the official 3.11.16 version the 
node 7 restarts correctly and joins the cluster. I attached the relevant part 
of the log (see {_}cassandra7.downgrade.log{_}) where you can see that node 7 
and 9 can communicate.

I suspect this could be related to the port 7000 now (with Cassandra 4.x) 
supporting both encrypted and unencrypted traffic. As stated previously I'm 
using the untouched official Cassandra images so all my cluster, inside the 
Docker Swarm, is not (and has never been) configured with encryption.

Any idea on how to further investigate the issue? Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (AIRFLOW-2755) k8s workers think DAGs are always in `/tmp/dags`

2018-07-16 Thread Aldo (JIRA)
Aldo created AIRFLOW-2755:
-

 Summary: k8s workers think DAGs are always in `/tmp/dags`
 Key: AIRFLOW-2755
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2755
 Project: Apache Airflow
  Issue Type: Bug
  Components: configuration, worker
Reporter: Aldo


We have Airflow configured to use the `KubernetesExecutor` and run tasks in 
newly created pods.

I tried to use the `PythonOperator` to import the python callable from a python 
module located in the DAGs directory as [that should be 
possible|https://github.com/apache/incubator-airflow/blob/c7a472ed6b0d8a4720f57ba1140c8cf665757167/airflow/__init__.py#L42].
 Airflow complained that the module was not found.

After a fair amount of digging we found that the issue was that the workers 
have the `AIRFLOW__CORE__DAGS_FOLDER` environment variable set to `/tmp/dags` 
as [you can see from the 
code|https://github.com/apache/incubator-airflow/blob/master/airflow/contrib/kubernetes/worker_configuration.py#L84].

Unset that environment variable from within the task's pod and running the task 
manually worked as expected. I think that this path should be configurable 
(I'll give it a try to add a `kubernetes.worker_dags_folder` configuration).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TOREE-399) Make Spark Kernel work on Windows

2017-04-06 Thread aldo (JIRA)

[ 
https://issues.apache.org/jira/browse/TOREE-399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958656#comment-15958656
 ] 

aldo commented on TOREE-399:


Hi Jakob,

I created a quick run.bat with hardcoded values

%SPARK_HOME%/bin/spark-submit --class org.apache.toree.Main 
C:\ProgramData\jupyter\kernels\apache_toree_scala\lib\toree-assembly-0.2.0.dev1-incubating-SNAPSHOT.jar

This passes the previous error, but still getting errors. I guess related to 
some scala config see below. Any idea?

Besides the error, with the goal to create a windows version of the run.sh, is 
not clear to me how kernel.json var are passed to the run.bat and how can I 
refer to them in run.bat.
Any direction?


> Make Spark Kernel work on Windows
> -
>
> Key: TOREE-399
> URL: https://issues.apache.org/jira/browse/TOREE-399
> Project: TOREE
>  Issue Type: New Feature
> Environment: Windows 7/8/10
>Reporter: aldo
>
> After a successful install of the Spark Kernel the error: "Failed to run 
> command:" occurs when from jupyter we select a Scala Notebook.
> The error happens because the kernel.json runs 
> C:\\ProgramData\\jupyter\\kernels\\apache_toree_scala\\bin\\run.sh which is 
> bash shell script and hence cannot work on windows.
> Can you give me some direction to fix this, and I will implement it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (TOREE-399) Make Spark Kernel work on Windows

2017-04-06 Thread aldo (JIRA)

[ 
https://issues.apache.org/jira/browse/TOREE-399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15958656#comment-15958656
 ] 

aldo edited comment on TOREE-399 at 4/6/17 9:38 AM:


Hi Jakob,

I created a quick run.bat with hardcoded values

%SPARK_HOME%/bin/spark-submit --class org.apache.toree.Main 
C:\ProgramData\jupyter\kernels\apache_toree_scala\lib\toree-assembly-0.2.0.dev1-incubating-SNAPSHOT.jar

This passes the previous error, but still getting errors. I guess related to 
some scala config see below. Any idea?

Besides the error, with the goal to create a windows version of the run.sh, is 
not clear to me how kernel.json var are passed to the run.bat and how can I 
refer to them in run.bat.
Any direction?





17/03/31 09:55:29 [WARN] o.a.h.u.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using buil tin-java classes where 
applicable
17/03/31 09:55:30 [INFO] o.a.t.b.l.StandardComponentInitialization$$anon$1 - 
Connecting to spark.master local[*] [init] error: error while loading Object, 
Missing dependency 'object scala in compiler mirror', required by C:\Program F
iles\Java\jdk1.8.0_121\jre\lib\rt.jar(java/lang/Object.class)

Failed to initialize compiler: object scala in compiler mirror not found.
** Note that as of 2.8 scala does not assume use of the java classpath.
** For the old behavior pass -usejavacp to scala, or if using a Settings
** object programmatically, settings.usejavacp.value = true.

Failed to initialize compiler: object scala in compiler mirror not found.
** Note that as of 2.8 scala does not assume use of the java classpath.
** For the old behavior pass -usejavacp to scala, or if using a Settings
** object programmatically, settings.usejavacp.value = true.
Exception in thread "main" java.lang.NullPointerException
at 
scala.reflect.internal.SymbolTable.exitingPhase(SymbolTable.scala:256)
at 
scala.tools.nsc.interpreter.IMain$Request.x$20$lzycompute(IMain.scala:896)
at scala.tools.nsc.interpreter.IMain$Request.x$20(IMain.scala:895)
at 
scala.tools.nsc.interpreter.IMain$Request.headerPreamble$lzycompute(IMain.scala:895)
at 
scala.tools.nsc.interpreter.IMain$Request.headerPreamble(IMain.scala:895)
at 
scala.tools.nsc.interpreter.IMain$Request$Wrapper.preamble(IMain.scala:918)
at 
scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1337)
at 
scala.tools.nsc.interpreter.IMain$CodeAssembler$$anonfun$apply$23.apply(IMain.scala:1336)
at scala.tools.nsc.util.package$.stringFromWriter(package.scala:64)
at 
scala.tools.nsc.interpreter.IMain$CodeAssembler$class.apply(IMain.scala:1336)
at 
scala.tools.nsc.interpreter.IMain$Request$Wrapper.apply(IMain.scala:908)
at 
scala.tools.nsc.interpreter.IMain$Request.compile$lzycompute(IMain.scala:1002)
at scala.tools.nsc.interpreter.IMain$Request.compile(IMain.scala:997)
at scala.tools.nsc.interpreter.IMain.compile(IMain.scala:579)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:567)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
at 
org.apache.toree.kernel.interpreter.scala.ScalaInterpreterSpecific$$anonfun$start$1.apply(ScalaInterpreterSpe
cific.scala:295)
at 
org.apache.toree.kernel.interpreter.scala.ScalaInterpreterSpecific$$anonfun$start$1.apply(ScalaInterpreterSpe
cific.scala:289)
at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
at 
org.apache.toree.kernel.interpreter.scala.ScalaInterpreterSpecific$class.start(ScalaInterpreterSpecific.scala
:289)
at 
org.apache.toree.kernel.interpreter.scala.ScalaInterpreter.start(ScalaInterpreter.scala:44)
at 
org.apache.toree.kernel.interpreter.scala.ScalaInterpreter.init(ScalaInterpreter.scala:87)
at 
org.apache.toree.boot.layer.InterpreterManager$$anonfun$initializeInterpreters$1.apply(InterpreterManager.sca
la:35)



was (Author: alpajj):
Hi Jakob,

I created a quick run.bat with hardcoded values

%SPARK_HOME%/bin/spark-submit --class org.apache.toree.Main 
C:\ProgramData\jupyter\kernels\apache_toree_scala\lib\toree-assembly-0.2.0.dev1-incubating-SNAPSHOT.jar

This passes the previous error, but still getting errors. I guess related to 
some scala config see below. Any idea?

Besides the error, with the goal to create a windows version of the run.sh, is 
not clear to me how kernel.json var are passed to the run.bat and how can I 
refer to them in run.bat.
Any direction?


> Make Spark Kernel work on Windows
> -
>
> Key: TOREE-399
> URL: https://issues.apache.org/jira/browse/TOREE-399
> Project: TOREE
>  Issue Type: New Feature
> Environment: Windows 7/8/10
>Reporter: aldo
>
> After a successful install of the Spark Kernel the error: 

[jira] [Updated] (TIKA-2248) How to set up the content encoding

2017-01-20 Thread Aldo (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aldo updated TIKA-2248:
---
Priority: Trivial  (was: Major)

> How to set up the content encoding
> --
>
> Key: TIKA-2248
> URL: https://issues.apache.org/jira/browse/TIKA-2248
> Project: Tika
>  Issue Type: Wish
>Reporter: Aldo
>Priority: Trivial
>
> If I try to set up content encoding with
> Metadata metadata = new Metadata();
> metadata.add(Metadata.CONTENT_ENCODING, DATAFILE_CHARSET);
> String parsedString = tika.parseToString(inputStream, metadata);
> metadata CONTENT_ENCODING is ignored;
> How I can force Tika to use CONTENT_ENCODING setted in metadata?
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2248) How to set up the content encoding

2017-01-20 Thread Aldo (JIRA)
Aldo created TIKA-2248:
--

 Summary: How to set up the content encoding
 Key: TIKA-2248
 URL: https://issues.apache.org/jira/browse/TIKA-2248
 Project: Tika
  Issue Type: Wish
Reporter: Aldo


If I try to set up content encoding with

Metadata metadata = new Metadata();
metadata.add(Metadata.CONTENT_ENCODING, DATAFILE_CHARSET);
String parsedString = tika.parseToString(inputStream, metadata);

metadata CONTENT_ENCODING is ignored;

How I can force Tika to use CONTENT_ENCODING setted in metadata?
Thank you.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)