Subrange repair of only the neighbors is sufficient Break the range covering the dead node into ~100 splits and repair those splits individually in sequence. You don’t have to repair the whole range all at once
-- Jeff Jirsa > On Mar 22, 2018, at 8:08 PM, Peng Xiao <2535...@qq.com> wrote: > > Hi Anthony, > > there is a problem with replacing dead node as per the blog,if the > replacement process takes longer than max_hint_window_in_ms,we must run > repair to make the replaced node consistent again, since it missed ongoing > writes during bootstrapping.but for a great cluster,repair is a painful > process. > > Thanks, > Peng Xiao > > > > ------------------ 原始邮件 ------------------ > 发件人: "Anthony Grasso"<anthony.gra...@gmail.com>; > 发送时间: 2018年3月22日(星期四) 晚上7:13 > 收件人: "user"<user@cassandra.apache.org>; > 主题: Re: replace dead node vs remove node > > Hi Peng, > > Depending on the hardware failure you can do one of two things: > > 1. If the disks are intact and uncorrupted you could just use the disks with > the current data on them in the new node. Even if the IP address changes for > the new node that is fine. In that case all you need to do is run repair on > the new node. The repair will fix any writes the node missed while it was > down. This process is similar to the scenario in this blog post: > http://thelastpickle.com/blog/2018/02/21/replace-node-without-bootstrapping.html > > 2. If the disks are inaccessible or corrupted, then use the method as > described in the blogpost you linked to. The operation is similar to > bootstrapping a new node. There is no need to perform any other remove or > join operation on the failed or new nodes. As per the blog post, you > definitely want to run repair on the new node as soon as it joins the > cluster. In this case here, the data on the failed node is effectively lost > and replaced with data from other nodes in the cluster. > > Hope this helps. > > Regards, > Anthony > > >> On Thu, 22 Mar 2018 at 20:52, Peng Xiao <2535...@qq.com> wrote: >> Dear All, >> >> when one node failure with hardware errors,it will be in DN status in the >> cluster.Then if we are not able to handle this error in three hours(max >> hints window),we will loss data,right?we have to run repair to keep the >> consistency. >> And as per >> https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html,we >> can replace this dead node,is it the same as bootstrap new node?that means >> we don't need to remove node and rejoin? >> Could anyone please advise? >> >> Thanks, >> Peng Xiao >> >> >> >>