Re: node replacement failed

Alain RODRIGUEZ Fri, 14 Sep 2018 03:17:33 -0700

Hello

Also i could delete system_traces which is empty anyway, but there is a
> system_auth and system_distributed keyspace too and they are not empty,
> Could i delete them safely too?



I would say no, not safely, as I am not sure about some of them, but maybe
this would work. Here is what I know: 'system_traces' can be deleted.
'system_auth'
should be ok to delete if you do not use auth (but I am not 100% sure)
but system_distributed
looks like an important keyspace, I am really uncertain about this one. It
was added in 'recent' Cassandra version and I am not too sure what's in
there, to be honest.

It seems quite a complex situation. A possible way out would be to do the 3
nodes replacement without streaming and then repair (making sure no client
talk to or get information from these nodes meanwhile), but I think
'replace' and 'auto_bootstrap: false' do not work well together.

Another idea is that 'nodetool removenode' on the 3 nodes, 1 by 1, should
also ensure consistency is preserved (if not forced). Then you could
recreate the 3 nodes of this rack. I am afraid this operation might fail as
well though, for the same reason of having ranges unavailable as well.

I am still thinking about it, but before going deeper, is this still an
issue for you at the moment?

C*heers,
-----------------------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le lun. 10 sept. 2018 à 13:43, onmstester onmstester <onmstes...@zoho.com>
a écrit :

> Thanks Alain,
> First here it is more detail about my cluster:
>
>    - 10 racks + 3 nodes on each rack
>    - nodetool status: shows 27 nodes UN and 3 nodes all related to single
>    rack as DN
>    - version 3.11.2
>
> *Option 1: (Change schema and) use replace method (preferred method)*
> * Did you try to have the replace going, without any former repairs,
> ignoring the fact 'system_traces' might be inconsistent? You probably don't
> care about this table, so if Cassandra allows it with some of the nodes
> down, going this way is relatively safe probably. I really do not see what
> you could lose that matters in this table.
> * Another option, if the schema first change was accepted, is to make the
> second one, to drop this table. You can always rebuild it in case you need
> it I assume.
>
> I really love to let the replace going, but it stops with the error:
>
> java.lang.IllegalStateException: unable to find sufficient sources for
> streaming range in keyspace system_traces
>
>
> Also i could delete system_traces which is empty anyway, but there is a
> system_auth and system_distributed keyspace too and they are not empty,
> Could i delete them safely too?
> If i could just somehow skip streaming the system keyspaces from node
> replace phase, the option 1 would be great.
>
> P.S: Its clear to me that i should use at least RF=3 in production, but
> could not manage to acquire enough resources yet (i hope would be fixed in
> recent future)
>
> Again Thank you for your time
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
> ---- On Mon, 10 Sep 2018 16:20:10 +0430 *Alain RODRIGUEZ
> <arodr...@gmail.com <arodr...@gmail.com>>* wrote ----
>
> Hello,
>
> I am sorry it took us (the community) more than a day to answer to this
> rather critical situation. That being said, my recommendation at this point
> would be for you to make sure about the impacts of whatever you would try.
> Working on a broken cluster, as an emergency might lead you to a second
> mistake, possibly more destructive than the first one. It happened to me
> and around, for many clusters. Move forward even more carefuly in these
> situations as a global advice.
>
> Suddenly i lost all disks of cassandar-data on one of my racks
>
>
> With RF=2, I guess operations use LOCAL_ONE consistency, thus you should
> have all the data in the safe rack(s) with your configuration, you probably
> did not lose anything yet and have the service only using the nodes up,
> that got the right data.
>
>  tried to replace the nodes with same ip using this:
>
> https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
>
>
> As a side note, I would recommend you to use 'replace_address_first_boot'
> instead of 'replace_address'. This does basically the same but will be
> ignored after the first bootstrap. A detail, but hey, it's there and
> somewhat safer, I would use this one.
>
> java.lang.IllegalStateException: unable to find sufficient sources for
> streaming range in keyspace system_traces
>
>
> By default, non-user keyspace use 'SimpleStrategy' and a small RF.
> Ideally, this should be changed in a production cluster, and you're having
> an example of why.
>
> Now when i altered the system_traces keyspace startegy to
> NetworkTopologyStrategy and RF=2
> but then running nodetool repair failed: Endpoint not alive /IP of dead
> node that i'm trying to replace.
>
>
> Changing the replication strategy you made the dead rack owner of part of
> the token ranges, thus repairs just can't work as there will always be one
> of the nodes involved down as the whole rack is down. Repair won't work,
> but you probably do not need it! 'system_traces' is a temporary / debug
> table. It's probably empty or with irrelevant data.
>
> Here are some thoughts:
>
> * It would be awesome at this point for us (and for you if you did not) to
> see the status of the cluster:
> ** 'nodetool status'
> ** *'nodetool describecluster' *--> This one will tell if the nodes agree
> on the schema (nodes up). I have seen schema changes with nodes down
> inducing some issues.
> *** *Cassandra version
> ** Number of racks (I assumer #racks >= 2 in this email)
>
> *Option 1: (Change schema and) use replace method (preferred method)*
> * Did you try to have the replace going, without any former repairs,
> ignoring the fact 'system_traces' might be inconsistent? You probably don't
> care about this table, so if Cassandra allows it with some of the nodes
> down, going this way is relatively safe probably. I really do not see what
> you could lose that matters in this table.
> * Another option, if the schema first change was accepted, is to make the
> second one, to drop this table. You can always rebuild it in case you need
> it I assume.
>
>
> *Option 2: Remove all the dead nodes *(try to avoid this option 2, if
> option 1 works, it is better).
>
> Please do not take an apply this like this. It's a thought on how you
> could get rid of the issue, yet it's rather brutal and risky and I did not
> consider it deeply and have no clue about your architecture and the
> context. Consider it carefully on your side.
>
> * You can also 'nodetool removenode' on each of the dead nodes. This will
> have nodes streaming around and the rack isolation guarantee will no longer
> be valid. It's hard to reason about what would happen to the data and in
> terms of streaming.
> * Alternatively, if you don't have enough space, you can even '*force*'
> the 'nodetool removenode'. See the documentation. Forcing it will prevent
> streaming and remove the node (token ranges handover, but not the data). If
> that does not work you can use the 'nodetool assassinate' command as well.
>
> When adding nodes back to the broken DC, the first nodes will take
> probably 100% of the ownership, which is often too much. You can consider
> adding back all the nodes with 'auto_bootstrap: false' before repairing
> them once they have their final token ownership, the same ways we do when
> building a new data center.
>
> This option is not really clean, and have some caveats that* you need to
> consider before starting* as there are token range movements and nodes
> available that do not have the data. Yet this should work. I imagine it
> would work nicely with RF=3 and QUORUM and with RF=2 (if you have 2+
> racks), I guess it should work as well but you will have to pick one of
> availability or consistency while repairing the data.
>
> *Be aware that read requests hitting these nodes will not find data!* Plus,
> you are using an *RF=2*. Thus using consistency of 2+ (TWO, QUORUM, ALL),
> for at least one of reads or writes is needed to preserve consistency while
> re-adding the nodes in this case. Otherwise, reads will not detect the
> mismatch with certainty and might show inconsistent data the time for the
> nodes to be repaired.
>
> I must say, that I really prefer odd values for the RF, starting with
> RF=3. Using RF=2 you will have to pick. Consistency or Availability. With a
> consistency of ONE everywhere, the service is available, no single point of
> failure. using anything bigger than this, for writes or read, brings
> consistency but it creates single points of failures (actually any node
> becomes a point of failure). RF=3 and QUORUM for both write and reads
> take the best of the 2 worlds somehow. The tradeoff with RF=3 and quorum
> reads is the latency increase and the resource usage.
>
> Maybe is there a better approach, I am not too sure, but I think I would
> try option 1 first in any case. It's less destructive, less risky, no token
> range movements, no empty nodes available. I am not sure about limitation
> you might face though and that's why I suggest a second option for you to
> consider if the first is not actionable.
>
> Let us know how it goes,
> C*heers,
> -----------------------
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le lun. 10 sept. 2018 à 09:09, onmstester onmstester <onmstes...@zoho.com>
> a écrit :
>
>
> Any idea?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
> ---- On Sun, 09 Sep 2018 11:23:17 +0430 *onmstester onmstester
> <onmstes...@zoho.com <onmstes...@zoho.com>>* wrote ----
>
>
> Hi,
>
> Cluster Spec:
> 30 nodes
> RF = 2
> NetworkTopologyStrategy
> GossipingPropertyFileSnitch + rack aware
>
> Suddenly i lost all disks of cassandar-data on one of my racks, after
> replacing the disks, tried to replace the nodes with same ip using this:
>
> https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
>
> starting the to-be-replace-node fails with:
> java.lang.IllegalStateException: unable to find sufficient sources for
> streaming range in keyspace system_traces
>
> the problem is that i did not changed default replication config for
> System keyspaces, but Now when i altered the system_traces keyspace
> startegy to NetworkTopologyStrategy and RF=2
> but then running nodetool repair failed: Endpoint not alive /IP of dead
> node that i'm trying to replace.
>
> What should i do now?
> Can i just remove previous nodes, change dead nodes IPs and re-join them
> to cluster?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>
>
>

Re: node replacement failed

Reply via email to