Hi Michalis, It's been a while since I removed a DC for the last time, but I see there is now a protection to avoid accidentally leaving a DC without auth capability.
This was introduced in C* 4.1 through CASSANDRA-17478 ( https://issues.apache.org/jira/browse/CASSANDRA-17478). The process of dropping a data center might have been overlooked while doing this work. It's never correct for an operator to remove a DC from system_auth > replication settings while there are currently nodes up in that DC. > I believe this assertion is not correct. As Jon and Jeff mentioned, usually we remove the replication *before* decommissioning any node in the case of removing an entire DC, for reasons exposed by Jeff. The existing documentation is also clear about this: https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsDecomissionDC.html and https://thelastpickle.com/blog/2019/02/26/data-center-switch.html. Michalis, the solution you suggest seems to be the (good/only?) way to go, even though it looks like a workaround, not really "clean" and something we need to improve. It was also mentioned here: https://dba.stackexchange.com/questions/331732/not-a-able-to-decommission-the-old-datacenter#answer-334890. It should work quickly, but only because this keyspace has a fairly low amount of data, but it will still not be optimal and as fast as it should (it should be a near no-op as explained above by Jeff). It also obliges you to use "--force" option that could lead you to delete one of your nodes in another DC by mistake and in a loaded cluster or a 3-node cluster - RF = 3, this could hurt...). Having to operate using "nodetool decommission --force" cannot be standard, but for now I can't think of anything better for you. Maybe wait for someone else's confirmation, it's been a while since operated Cassandra :). I think it would make sense to fix this somehow in Cassandra. Maybe should we ensure that no other keyspaces has a RF > 0 for this data center instead of looking at active nodes, or that there is no client connected to the nodes, add a manual flag somewhere, or something else? Even though I understand the motivation to protect users against a wrongly distributed system_auth keyspace, I think this protection should not be kept with this implementation. If that makes sense I can create a ticket for this problem. C*heers, *Alain Rodriguezcasterix.fr <http://casterix.fr>* Le lun. 8 avr. 2024 à 16:26, Michalis Kotsiouros (EXT) via user < user@cassandra.apache.org> a écrit : > Hello Jon and Jeff, > > Thanks a lot for your replies. > > I completely get your points. > > Some more clarification about my issue. > > When trying to update the Replication before the decommission, I get the > following error message when I remove the replication for system_auth > kesypace. > > ConfigurationException: Following datacenters have active nodes and must > be present in replication options for keyspace system_auth: [datacenter1] > > > > This error message does not appear in the rest of the application > keyspaces. > > So, may I change the procedure to: > > 1. Make sure no clients are still writing to any nodes in the > datacenter. > 2. Run a full repair with nodetool repair. > 3. Change all keyspaces so they no longer reference the datacenter > being removed apart from system_auth keyspace. > 4. Run nodetool decommission using the --force option on every node in > the datacenter being removed. > 5. Change system_auth keyspace so they no longer reference the > datacenter being removed. > > BR > > MK > > > > > > > > *From:* Jeff Jirsa <jji...@gmail.com> > *Sent:* April 08, 2024 17:19 > *To:* cassandra <user@cassandra.apache.org> > *Cc:* Michalis Kotsiouros (EXT) <michalis.kotsiouros....@ericsson.com> > *Subject:* Re: Datacenter decommissioning on Cassandra 4.1.4 > > > > To Jon’s point, if you remove from replication after step 1 or step 2 > (probably step 2 if your goal is to be strictly correct), the nodetool > decommission phase becomes almost a no-op. > > > > If you use the order below, the last nodes to decommission will cause > those surviving machines to run out of space (assuming you have more than a > few nodes to start) > > > > > > > > On Apr 8, 2024, at 6:58 AM, Jon Haddad <j...@jonhaddad.com> wrote: > > > > You shouldn’t decom an entire DC before removing it from replication. > > > — > > > Jon Haddad > Rustyrazorblade Consulting > rustyrazorblade.com > <https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-1624a77accb6d839&q=1&e=8a954d2d-17da-40df-8732-bdcc7893179a&u=http%3A%2F%2Frustyrazorblade.com%2F> > > > > > > On Mon, Apr 8, 2024 at 6:26 AM Michalis Kotsiouros (EXT) via user < > user@cassandra.apache.org> wrote: > > Hello community, > > In our deployments, we usually rebuild the Cassandra datacenters for > maintenance or recovery operations. > > The procedure used since the days of Cassandra 3.x was the one documented > in datastax documentation. Decommissioning a datacenter | Apache > Cassandra 3.x (datastax.com) > <https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsDecomissionDC.html> > > After upgrading to Cassandra 4.1.4, we have realized that there are some > stricter rules that do not allo to remove the replication when active > Cassandra nodes still exist in a datacenter. > > This check makes the above-mentioned procedure obsolete. > > I am thinking to use the following as an alternative: > > 1. Make sure no clients are still writing to any nodes in the > datacenter. > 2. Run a full repair with nodetool repair. > 3. Run nodetool decommission using the --force option on every node in > the datacenter being removed. > 4. Change all keyspaces so they no longer reference the datacenter > being removed. > > > > What is the procedure followed by other users? Do you see any risk > following the proposed procedure? > > > > BR > > MK > > >