Re: Reip(ing) riak node created two copies in the cluster

Nitish Sharma Wed, 02 May 2012 08:51:15 -0700

Hi Jon,
Thanks for your input. I've already started working on that lines. 
I stopped all the nodes, moved ring directory from one node, brought that one 
up, and issued join command to one other node (after moving the ring directory) 
- node2. While they were busy re-distributing the partitions, I started another 
node (node3) and issued join command (before risk_kv was running, since it 
takes some time to load existing data).
But after this, data handoffs are occurring only between node1 and node2. 
"member_status" says that node 3 owns 0% of the ring and 0% are pending.
We have a lot of data - each node serves around 200 million documents. Riak 
cluster is running 1.1.2.
Any suggestions?


Cheers
Nitish
On May 2, 2012, at 5:31 PM, Jon Meredith wrote:

> Hi Nitish,
> 
> If you rebuild the cluster with the same ring size, the data will eventually 
> get back to the right place.  While the rebuild is taking place you may have 
> notfounds for gets until the data has been handed off to the newly assigned 
> owner (as it will be secondary handoff, not primary ownership handoff to get 
> teh data back).  If you don't have a lot of data stored in the cluster it 
> shouldn't take too long.
> 
> The process would be to stop all nodes, move the files out of the ring 
> directory to a safe place, start all nodes and rejoin.  If you're using 1.1.x 
> and you have capacity in your hardware you may want to increase 
> handoff_concurrency to something like 4 to permit more transfers to happen 
> across the cluster.
> 
> 
> Jon.
> 
> 
> 
> On Wed, May 2, 2012 at 9:05 AM, Nitish Sharma <[email protected]> 
> wrote:
> Hi,
> We have a 12-node Riak cluster. Until now we were naming every new node as 
> riak@<ip_address>. We then decided to rename the all the nodes to 
> riak@<hostname>, which makes troubleshooting easier.
> After issuing reip command to two nodes, we noticed in the "status" that 
> those 2 nodes were now appearing in the cluster with the old name as well as 
> the new name. Other nodes were trying to handoff partitions to the "new" 
> nodes, but apparently they were not able to. After this the whole cluster 
> went down and completely stopped responding to any read/write requests.
> member_status displayed old Riak name in "legacy" mode. Since this is our 
> production cluster, we are desperately looking for some quick remedies. 
> Issuing "force-remove" to the old names, restarting all the nodes, changing 
> the riak names back to the old ones -  none of it helped.
> Currently, we are hosting limited amount of data. Whats an elegant way to 
> recover from this mess? Would shutting off all the nodes, deleting the ring 
> directory, and again forming the cluster work?
> 
> Cheers
> Nitish
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> 
> -- 
> Jon Meredith
> Platform Engineering Manager
> Basho Technologies, Inc.
> [email protected]
>

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Reip(ing) riak node created two copies in the cluster

Reply via email to