There are so many pending transfer on production server . That is different between production and developing, that is my concern.
Sent from my iPhone > On 2015年8月15日, at 上午3:00, Dmitri Zagidulin <dzagidu...@basho.com> wrote: > > Pending 0% just means no pending transfers, the cluster state is stable. > > If you've successfully tested the process on a test cluster, there's no > reason why it'd be different in production. > >> On Friday, August 14, 2015, changmao wang <wang.chang...@gmail.com> wrote: >> During last three days, I setup a developing riak cluster with five nodes, >> and used "s3cmd" to upload 18GB testing data(maybe 20 thousands of files). >> After that, I tried to let one node leaving the cluster, and then shutdown >> and mark down it. Replacing the IP address and joining the cluster again. >> The above whole processes were successful. However, I'm not sure whether no >> not it can be done on production environment. >> >> I followed below the docs to do above steps: >> >> http://docs.basho.com/riak/latest/ops/running/nodes/renaming/ >> >> After I run "riak-admin cluster leave riak@'x.x.x.x'" ,"riak-admin cluster >> plan", "riak-admin cluster commit", then checked the member-status, the main >> difference of leaving cluster on production and developing environment are >> as below: >> >> root@cluster-s3-dev-hd1:~# riak-admin member-status >> ================================= Membership >> ================================== >> Status Ring Pending Node >> ------------------------------------------------------------------------------- >> leaving 18.8% 0.0% 'riak@10.21.236.185' >> valid 21.9% 25.0% 'riak@10.21.236.181' >> valid 21.9% 25.0% 'riak@10.21.236.182' >> valid 18.8% 25.0% 'riak@10.21.236.183' >> valid 18.8% 25.0% 'riak@10.21.236.184' >> ------------------------------------------------------------------------------- >> >> several minutes elapsed, the then checking the status as below: >> >> >> root@cluster-s3-dev-hd1:~# riak-admin member-status >> ================================= Membership >> ================================== >> Status Ring Pending Node >> ------------------------------------------------------------------------------- >> leaving 12.5% 0.0% 'riak@10.21.236.185' >> valid 21.9% 25.0% 'riak@10.21.236.181' >> valid 28.1% 25.0% 'riak@10.21.236.182' >> valid 18.8% 25.0% 'riak@10.21.236.183' >> valid 18.8% 25.0% 'riak@10.21.236.184' >> ------------------------------------------------------------------------------- >> Valid:4 / Leaving:1 / Exiting:0 / Joining:0 / Down:0 >> >> After that, I shutdown riak with "riak stop", and mark down it on active >> nodes. >> My question is what's the meaning ot "Pending 0.0%"? >> >> On production cluster, the status are as below: >> root@cluster1-hd12:/root/scripts# riak-admin transfers >> 'riak@10.21.136.94' waiting to handoff 5 partitions >> 'riak@10.21.136.93' waiting to handoff 5 partitions >> 'riak@10.21.136.92' waiting to handoff 5 partitions >> 'riak@10.21.136.91' waiting to handoff 5 partitions >> 'riak@10.21.136.86' waiting to handoff 5 partitions >> 'riak@10.21.136.81' waiting to handoff 2 partitions >> 'riak@10.21.136.76' waiting to handoff 3 partitions >> 'riak@10.21.136.71' waiting to handoff 5 partitions >> 'riak@10.21.136.66' waiting to handoff 5 partitions >> >> And there're active transfers. On developing environment, there're no >> active transfers after my running of "riak-admin cluster commit". >> Can I follow the same steps as developing environment to run it on >> production cluster? >> >> >> >>> On Wed, Aug 12, 2015 at 10:39 PM, Dmitri Zagidulin <dzagidu...@basho.com> >>> wrote: >>> Responses inline. >>> >>> >>>> On Tue, Aug 11, 2015 at 12:53 PM, changmao wang <wang.chang...@gmail.com> >>>> wrote: >>>> 1. About backuping new nodes of four and then using 'riak-admin >>>> force-replace'. what's the status of new added nodes? >>>> as you know, we want to replace one of leaving nodes. >>> >>> I don't understand the question. Doing 'riak-admin force-replace' on one of >>> the nodes that's leaving should overwrite the leave request and tell it to >>> change its node id / ip address. (If that doesn't work, stop the leaving >>> node, and do a 'riak-admin reip' command instead). >>> >>> >>>> 2. what's the risk of 'riak-admin force-remove' 'riak@10.21.136.91' >>>> without backup? >>>> As you know, now the node(riak@10.21.136.91) is a member of the cluster, >>>> and keeping almost 2.5TB data, maybe 10 percent of the whole cluster. >>> >>> The only reason I asked about backup is because it sounded like you cleared >>> the disk on it. If it currently has the data, then it'll be fine. >>> Force-remove just changes the IP address, and doesn't delete the data or >>> anything. >>> >>> >>>> On Tue, Aug 11, 2015 at 7:32 PM, Dmitri Zagidulin <dzagidu...@basho.com> >>>> wrote: >>>> 1. How to force leave "leaving"'s nodes without data loss? >>>> >>>> This depends on - did you back up the data directory of the 4 new nodes, >>>> before you reformatted them? >>>> If you backed them up (and then restored the data directory once you >>>> reformatted them), you can try: >>>> >>>> riak-admin force-replace 'riak@10.21.136.91' 'riak@<whatever your new ip >>>> address is for that node>' >>>> (same for the other 3) >>>> >>>> If you did not back up those nodes, the only thing you can do is force >>>> them to leave, and then join the new ones. So, for each of the 4: >>>> >>>> riak-admin force-remove 'riak@10.21.136.91' 'riak@10.21.136.66' >>>> (same for the other 3) >>>> >>>> In either case, after force-replacing or force-removing, you have to join >>>> the new nodes to the cluster, before you commit. >>>> >>>> riak-admin join 'riak@new node' 'riak@10.21.136.66' >>>> (same for the other 3) >>>> and finally: >>>> riak-cluster plan >>>> riak-cluster commit >>>> >>>> As for the error, the reason you're seeing it, is because the other nodes >>>> can't contact the 4 that are supposed to be leaving. (Since you wiped >>>> them). >>>> The amount of time that passed doesn't matter, the cluster will be waiting >>>> for those nodes to leave indefinitely, unless you force-remove or >>>> force-replace. >>>> >>>> >>>> >>>>> On Tue, Aug 11, 2015 at 1:32 AM, changmao wang <wang.chang...@gmail.com> >>>>> wrote: >>>>> HI Dmitri, >>>>> >>>>> For your question, >>>>> 3) Re-formatted those four nodes and re-installed Riak. Here is where it >>>>> gets tricky though. Several questions for you: >>>>> - Did you attempt to re-join those 4 reinstalled nodes into the cluster? >>>>> What was the output of the cluster join and cluster plan commands? >>>>> - Did the IP address change, after they were reformatted? If so, you >>>>> probably need to use something like 'reip' at this point: >>>>> http://docs.basho.com/riak/latest/ops/running/tools/riak-admin/#reip >>>>> >>>>> I did NOT try to re-join those 4 re-join those 4 reinstalled nodes into >>>>> the cluster. As you know, member-status shows 'they're leaving" as below: >>>>> riak-admin member-status >>>>> ================================= Membership >>>>> ================================== >>>>> Status Ring Pending Node >>>>> ------------------------------------------------------------------------------- >>>>> leaving 10.9% 10.9% 'riak@10.21.136.91' >>>>> leaving 9.4% 10.9% 'riak@10.21.136.92' >>>>> leaving 7.8% 10.9% 'riak@10.21.136.93' >>>>> leaving 7.8% 10.9% 'riak@10.21.136.94' >>>>> valid 10.9% 10.9% 'riak@10.21.136.66' >>>>> valid 10.9% 10.9% 'riak@10.21.136.71' >>>>> valid 14.1% 10.9% 'riak@10.21.136.76' >>>>> valid 17.2% 12.5% 'riak@10.21.136.81' >>>>> valid 10.9% 10.9% 'riak@10.21.136.86' >>>>> ------------------------------------------------------------------------------- >>>>> Valid:5 / Leaving:4 / Exiting:0 / Joining:0 / Down:0 >>>>> >>>>> two weeks elapsed, 'riak-admin member-status' shows same result. I don't >>>>> know which step ring hand off? >>>>> >>>>> I did not changed the IP address of four newly adding nodes. >>>>> >>>>> My questions: >>>>> >>>>> 1. How to force leave "leaving"'s nodes without data loss? >>>>> 2. I have found some errors related to handoff of partition in >>>>> /etc/riak/log/errors. >>>>> Details are as below: >>>>> >>>>> 2015-07-30 16:04:33.643 [error] >>>>> <0.12872.15>@riak_core_handoff_sender:start_fold:262 ownership_transfer >>>>> transfer of riak_kv_vnode from 'riak@10.21.136.76' >>>>> 45671926166590716193865151022383844364247891968 to 'riak@10.21.136.93' >>>>> 45671926166590716193865151022383844364247891968 failed because of enotconn >>>>> 2015-07-30 16:04:33.643 [error] >>>>> <0.197.0>@riak_core_handoff_manager:handle_info:289 An outbound handoff >>>>> of partition riak_kv_vnode >>>>> 45671926166590716193865151022383844364247891968 was terminated for >>>>> reason: {shutdown,{error,enotconn}} >>>>> >>>>> >>>>> >>>>> I have searched it with google and found related articles. However, >>>>> there's no solution. >>>>> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2014-October/016052.html >>>>> >>>>> >>>>>> On Mon, Aug 10, 2015 at 10:09 PM, Dmitri Zagidulin >>>>>> <dzagidu...@basho.com> wrote: >>>>>> Hi Changmao, >>>>>> >>>>>> The state of the cluster can be determined from running 'riak-admin >>>>>> member-status' and 'riak-admin ring-status'. >>>>>> If I understand the sequence of events, you: >>>>>> 1) Joined four new nodes to the cluster. (Which crashed due to not >>>>>> enough disk space) >>>>>> 2) Removed them from the cluster via 'riak-admin cluster leave'. This >>>>>> is a "planned remove" command, and expects for the nodes to gradually >>>>>> hand off their partitions (to transfer ownership) before actually >>>>>> leaving. So this is probably the main problem - the ring is stuck >>>>>> waiting for those nodes to properly hand off. >>>>>> >>>>>> 3) Re-formatted those four nodes and re-installed Riak. Here is where it >>>>>> gets tricky though. Several questions for you: >>>>>> - Did you attempt to re-join those 4 reinstalled nodes into the cluster? >>>>>> What was the output of the cluster join and cluster plan commands? >>>>>> - Did the IP address change, after they were reformatted? If so, you >>>>>> probably need to use something like 'reip' at this point: >>>>>> http://docs.basho.com/riak/latest/ops/running/tools/riak-admin/#reip >>>>>> >>>>>> The 'failed because of enotconn' error message is happening because the >>>>>> cluster is waiting to hand off partitions to .94, but cannot connect to >>>>>> it. >>>>>> >>>>>> Anyways, here's what I recommend. If you can lose the data, it's >>>>>> probably easier to format and reinstall the whole cluster. >>>>>> If not, you can 'force-remove' those four nodes, one by one (see >>>>>> http://docs.basho.com/riak/latest/ops/running/cluster-admin/#force-remove >>>>>> ) >>>>>> >>>>>> >>>>>> >>>>>>> On Thu, Aug 6, 2015 at 11:55 PM, changmao wang >>>>>>> <wang.chang...@gmail.com> wrote: >>>>>>> Dmitri, >>>>>>> >>>>>>> Thanks for your quick reply. >>>>>>> my question are as below: >>>>>>> 1. what's the current status of the whole cluster? Is't doing data >>>>>>> balance? >>>>>>> 2. there's so many errors during one of the node error log. how to >>>>>>> handle it? >>>>>>> 2015-08-05 01:38:59.717 [error] >>>>>>> <0.23000.298>@riak_core_handoff_sender:start_fold:262 >>>>>>> ownership_transfer transfer of riak_kv_vnode from 'riak@10.21.136.81' >>>>>>> 525227150915793236229449236757414210188850757632 to 'riak@10.21.136.94' >>>>>>> 525227150915793236229449236757414210188850757632 failed because of >>>>>>> enotconn >>>>>>> 2015-08-05 01:38:59.718 [error] >>>>>>> <0.195.0>@riak_core_handoff_manager:handle_info:289 An outbound handoff >>>>>>> of partition riak_kv_vnode >>>>>>> 525227150915793236229449236757414210188850757632 was terminated for >>>>>>> reason: {shutdown,{error,enotconn}} >>>>>>> >>>>>>> During the last 5 days, there's no changes of the "riak-admin member >>>>>>> status" output. >>>>>>> 3. how to accelerate the data balance? >>>>>>> >>>>>>> >>>>>>>> On Fri, Aug 7, 2015 at 6:41 AM, Dmitri Zagidulin >>>>>>>> <dzagidu...@basho.com> wrote: >>>>>>>> Ok, I think I understand so far. So what's the question? >>>>>>>> >>>>>>>>> On Thursday, August 6, 2015, Changmao.Wang >>>>>>>>> <changmao.w...@datayes.com> wrote: >>>>>>>>> Hi Riak users, >>>>>>>>> >>>>>>>>> Before adding new nodes, the cluster only have five nodes. The member >>>>>>>>> list are as below: >>>>>>>>> 10.21.136.66,10.21.136.71,10.21.136.76,10.21.136.81,10.21.136.86. >>>>>>>>> We did not setup http proxy for the cluster, only one node of the >>>>>>>>> cluster provide the http service. so the CPU load is always high on >>>>>>>>> this node. >>>>>>>>> >>>>>>>>> After that, I added four nodes (10.21.136.[91-94]) to those cluster. >>>>>>>>> During the ring/data balance progress, each node failed(riak stopped) >>>>>>>>> because of disk 100% full. >>>>>>>>> I used multi-disk path to "data_root" parameter in >>>>>>>>> '/etc/riak/app.config'. Each disk is only 580MB size. >>>>>>>>> As you know, bitcask storage engine did not support multi-disk path. >>>>>>>>> After one of the disks is 100% full, it can not switch next idle >>>>>>>>> disk. So the "riak" service is down. >>>>>>>>> >>>>>>>>> After that, I removed the new add four nodes at active nodes with >>>>>>>>> "riak-admin cluster leave riak@'10.21.136.91'". >>>>>>>>> and then stop "riak" service on other active new nodes, reformat the >>>>>>>>> above new nodes with LVM disk management (bind 6 disk with virtual >>>>>>>>> disk group). >>>>>>>>> Replace the "data-root" parameter with one folder, and then start >>>>>>>>> "riak" service again. After that, the cluster began the data balance >>>>>>>>> again. >>>>>>>>> That's the whole story. >>>>>>>>> >>>>>>>>> >>>>>>>>> Amao >>>>>>>>> >>>>>>>>> From: "Dmitri Zagidulin" <dzagidu...@basho.com> >>>>>>>>> To: "Changmao.Wang" <changmao.w...@datayes.com> >>>>>>>>> Sent: Thursday, August 6, 2015 10:46:59 PM >>>>>>>>> Subject: Re: why leaving riak cluster so slowly and how to accelerate >>>>>>>>> the speed >>>>>>>>> >>>>>>>>> Hi Amao, >>>>>>>>> >>>>>>>>> Can you explain a bit more which steps you've taken, and what the >>>>>>>>> problem is? >>>>>>>>> >>>>>>>>> Which nodes have been added, and which nodes are leaving the cluster? >>>>>>>>> >>>>>>>>>> On Tue, Jul 28, 2015 at 11:03 PM, Changmao.Wang >>>>>>>>>> <changmao.w...@datayes.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Raik user group, >>>>>>>>>> >>>>>>>>>> I'm using riak and riak-cs 1.4.2. Last weekend, I added four nodes >>>>>>>>>> to cluster with 5 nodes. However, it's failed with one of disks 100% >>>>>>>>>> full. >>>>>>>>>> As you know bitcask storage engine can not support multifolders. >>>>>>>>>> >>>>>>>>>> After that, I restarted the "riak" and leave the cluster with the >>>>>>>>>> command "riak-admin cluster leave" and "riak-admin cluster plan", >>>>>>>>>> and the commit. >>>>>>>>>> However, riak is always doing KV balance after my submit leaving >>>>>>>>>> command. I guess that it's doing join cluster progress. >>>>>>>>>> >>>>>>>>>> Could you show us how to accelerate the leaving progress? I have >>>>>>>>>> tuned the "transfer-limit" parameters on 9 nodes. >>>>>>>>>> >>>>>>>>>> below is some commands output: >>>>>>>>>> riak-admin member-status >>>>>>>>>> ================================= Membership >>>>>>>>>> ================================== >>>>>>>>>> Status Ring Pending Node >>>>>>>>>> ------------------------------------------------------------------------------- >>>>>>>>>> leaving 6.3% 10.9% 'riak@10.21.136.91' >>>>>>>>>> leaving 9.4% 10.9% 'riak@10.21.136.92' >>>>>>>>>> leaving 6.3% 10.9% 'riak@10.21.136.93' >>>>>>>>>> leaving 6.3% 10.9% 'riak@10.21.136.94' >>>>>>>>>> valid 10.9% 10.9% 'riak@10.21.136.66' >>>>>>>>>> valid 12.5% 10.9% 'riak@10.21.136.71' >>>>>>>>>> valid 18.8% 10.9% 'riak@10.21.136.76' >>>>>>>>>> valid 18.8% 12.5% 'riak@10.21.136.81' >>>>>>>>>> valid 10.9% 10.9% 'riak@10.21.136.86' >>>>>>>>>> >>>>>>>>>> riak-admin transfer_limit >>>>>>>>>> =============================== Transfer Limit >>>>>>>>>> ================================ >>>>>>>>>> Limit Node >>>>>>>>>> ------------------------------------------------------------------------------- >>>>>>>>>> 200 'riak@10.21.136.66' >>>>>>>>>> 200 'riak@10.21.136.71' >>>>>>>>>> 100 'riak@10.21.136.76' >>>>>>>>>> 100 'riak@10.21.136.81' >>>>>>>>>> 200 'riak@10.21.136.86' >>>>>>>>>> 500 'riak@10.21.136.91' >>>>>>>>>> 500 'riak@10.21.136.92' >>>>>>>>>> 500 'riak@10.21.136.93' >>>>>>>>>> 500 'riak@10.21.136.94' >>>>>>>>>> >>>>>>>>>> Any more details for your diagnosing the problem? >>>>>>>>>> >>>>>>>>>> Amao >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> riak-users mailing list >>>>>>>>>> riak-users@lists.basho.com >>>>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> riak-users mailing list >>>>>>>> riak-users@lists.basho.com >>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Amao Wang >>>>>>> Best & Regards >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> riak-users mailing list >>>>>> riak-users@lists.basho.com >>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>>> >>>>> >>>>> >>>>> -- >>>>> Amao Wang >>>>> Best & Regards >>> >>> >>> >>> -- >>> Amao Wang >>> Best & Regards >> >> >> >> -- >> Amao Wang >> Best & Regards
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com