Re: why leaving riak cluster so slowly and how to accelerate the speed

Dmitri Zagidulin Mon, 10 Aug 2015 07:11:16 -0700

Hi Changmao,

The state of the cluster can be determined from running 'riak-admin
member-status' and 'riak-admin ring-status'.
If I understand the sequence of events, you:
1) Joined four new nodes to the cluster. (Which crashed due to not enough
disk space)
2) Removed them from the cluster via 'riak-admin cluster leave'.  This is a
"planned remove" command, and expects for the nodes to gradually hand off
their partitions (to transfer ownership) before actually leaving.  So this
is probably the main problem - the ring is stuck waiting for those nodes to
properly hand off.


3) Re-formatted those four nodes and re-installed Riak. Here is where it
gets tricky though. Several questions for you:
- Did you attempt to re-join those 4 reinstalled nodes into the cluster?
What was the output of the cluster join and cluster plan commands?
- Did the IP address change, after they were reformatted? If so, you
probably need to use something like 'reip' at this point:
http://docs.basho.com/riak/latest/ops/running/tools/riak-admin/#reip

The 'failed because of enotconn' error message is happening because the
cluster is waiting to hand off partitions to .94, but cannot connect to it.

Anyways, here's what I recommend. If you can lose the data, it's probably
easier to format and reinstall the whole cluster.
If not, you can 'force-remove' those four nodes, one by one (see
http://docs.basho.com/riak/latest/ops/running/cluster-admin/#force-remove )


On Thu, Aug 6, 2015 at 11:55 PM, changmao wang <wang.chang...@gmail.com>
wrote:

> Dmitri,
>
> Thanks for your quick reply.
> my question are as below:
> 1. what's the current status of the whole cluster? Is't doing data balance?
> 2. there's so many errors during one of the node error log. how to handle
> it?
> 2015-08-05 01:38:59.717 [error]
> <0.23000.298>@riak_core_handoff_sender:start_fold:262 ownership_transfer
> transfer of riak_kv_vnode from 'riak@10.21.136.81'
> 525227150915793236229449236757414210188850757632 to 'riak@10.21.136.94'
> 525227150915793236229449236757414210188850757632 failed because of enotconn
> 2015-08-05 01:38:59.718 [error]
> <0.195.0>@riak_core_handoff_manager:handle_info:289 An outbound handoff of
> partition riak_kv_vnode 525227150915793236229449236757414210188850757632
> was terminated for reason: {shutdown,{error,enotconn}}
>
> During the last 5 days, there's no changes of the "riak-admin member
> status" output.
> 3. how to accelerate the data balance?
>
>
> On Fri, Aug 7, 2015 at 6:41 AM, Dmitri Zagidulin <dzagidu...@basho.com>
> wrote:
>
>> Ok, I think I understand so far. So what's the question?
>>
>> On Thursday, August 6, 2015, Changmao.Wang <changmao.w...@datayes.com>
>> wrote:
>>
>>> Hi Riak users,
>>>
>>> Before adding new nodes, the cluster only have five nodes. The member
>>> list are as below:
>>> 10.21.136.66,10.21.136.71,10.21.136.76,10.21.136.81,10.21.136.86.
>>> We did not setup http proxy for the cluster, only one node of the
>>> cluster provide the http service.  so the CPU load is always high on this
>>> node.
>>>
>>> After that, I added four nodes (10.21.136.[91-94]) to those cluster.
>>> During the ring/data balance progress, each node failed(riak stopped)
>>> because of disk 100% full.
>>> I used multi-disk path to "data_root" parameter in
>>> '/etc/riak/app.config'. Each disk is only 580MB size.
>>> As you know, bitcask storage engine did not support multi-disk path.
>>> After one of the disks is 100% full, it can not switch next idle disk. So
>>> the "riak" service is down.
>>>
>>> After that, I removed the new add four nodes at active nodes with
>>> "riak-admin cluster leave riak@'10.21.136.91'".
>>> and then stop "riak" service on other active new nodes, reformat the
>>> above new nodes with LVM disk management (bind 6 disk with virtual disk
>>> group).
>>> Replace the "data-root" parameter with one folder, and then start "riak"
>>> service again. After that, the cluster began the data balance again.
>>> That's the whole story.
>>>
>>>
>>> Amao
>>>
>>> ------------------------------
>>> *From: *"Dmitri Zagidulin" <dzagidu...@basho.com>
>>> *To: *"Changmao.Wang" <changmao.w...@datayes.com>
>>> *Sent: *Thursday, August 6, 2015 10:46:59 PM
>>> *Subject: *Re: why leaving riak cluster so slowly and how to accelerate
>>> the speed
>>>
>>> Hi Amao,
>>>
>>> Can you explain a bit more which steps you've taken, and what the
>>> problem is?
>>>
>>> Which nodes have been added, and which nodes are leaving the cluster?
>>>
>>> On Tue, Jul 28, 2015 at 11:03 PM, Changmao.Wang <
>>> changmao.w...@datayes.com> wrote:
>>>
>>>> Hi Raik user group,
>>>>
>>>>  I'm using riak and riak-cs 1.4.2. Last weekend, I added four nodes to
>>>> cluster with 5 nodes. However, it's failed with one of disks 100% full.
>>>> As you know bitcask storage engine can not support multifolders.
>>>>
>>>> After that, I restarted the "riak" and leave the cluster with the
>>>> command "riak-admin cluster leave" and "riak-admin cluster plan", and the
>>>> commit.
>>>> However, riak is always doing KV balance after my submit leaving
>>>> command. I guess that it's doing join cluster progress.
>>>>
>>>> Could you show us how to accelerate the leaving progress? I have tuned
>>>> the "transfer-limit" parameters on 9 nodes.
>>>>
>>>> below is some commands output:
>>>> riak-admin member-status
>>>> ================================= Membership
>>>> ==================================
>>>> Status     Ring    Pending    Node
>>>>
>>>> -------------------------------------------------------------------------------
>>>> leaving     6.3%     10.9%    'riak@10.21.136.91'
>>>> leaving     9.4%     10.9%    'riak@10.21.136.92'
>>>> leaving     6.3%     10.9%    'riak@10.21.136.93'
>>>> leaving     6.3%     10.9%    'riak@10.21.136.94'
>>>> valid      10.9%     10.9%    'riak@10.21.136.66'
>>>> valid      12.5%     10.9%    'riak@10.21.136.71'
>>>> valid      18.8%     10.9%    'riak@10.21.136.76'
>>>> valid      18.8%     12.5%    'riak@10.21.136.81'
>>>> valid      10.9%     10.9%    'riak@10.21.136.86'
>>>>
>>>>  riak-admin transfer_limit
>>>> =============================== Transfer Limit
>>>> ================================
>>>> Limit        Node
>>>>
>>>> -------------------------------------------------------------------------------
>>>>   200        'riak@10.21.136.66'
>>>>   200        'riak@10.21.136.71'
>>>>   100        'riak@10.21.136.76'
>>>>   100        'riak@10.21.136.81'
>>>>   200        'riak@10.21.136.86'
>>>>   500        'riak@10.21.136.91'
>>>>   500        'riak@10.21.136.92'
>>>>   500        'riak@10.21.136.93'
>>>>   500        'riak@10.21.136.94'
>>>>
>>>> Any more details for your diagnosing the problem?
>>>>
>>>> Amao
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> riak-users@lists.basho.com
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>
>>>
>>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users@lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>
>
> --
> Amao Wang
> Best & Regards
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: why leaving riak cluster so slowly and how to accelerate the speed

Reply via email to