Re: Stalled handoffs on a prod cluster after server crash
Hi Ivaylo, Is there anything useful in console.log of any (or all) the nodes? If so, throw it in a gist and we'll take a look at it. Mark On Tue, Dec 10, 2013 at 1:13 PM, Jeppe Toustrup wrote: > Try to take a look at this thread from November where I experienced a > similar problem: > http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-November/014027.html > > The following mails in the thread mentions things you try to correct > the problem, and what I ended up doing with the help of Basho > employees. > > -- > Jeppe Fihl Toustrup > Operations Engineer > Falcon Social > > On 10 December 2013 22:03, Ivaylo Panitchkov wrote: >> Hello, >> Below is the transfers info: >> >> ~# riak-admin transfers >> >> Attempting to restart script through sudo -u riak >> 'r...@ccc.ccc.ccc.ccc' waiting to handoff 7 partitions >> 'r...@bbb.bbb.bbb.bbb' waiting to handoff 7 partitions >> 'r...@aaa.aaa.aaa.aaa' waiting to handoff 5 partitions >> >> >> ~# riak-admin member_status >> Attempting to restart script through sudo -u riak >> = Membership >> == >> Status RingPendingNode >> --- >> valid 45.3% 34.4%'r...@aaa.aaa.aaa.aaa' >> valid 26.6% 32.8%'r...@bbb.bbb.bbb.bbb' >> valid 28.1% 32.8%'r...@ccc.ccc.ccc.ccc' >> --- >> >> It's stuck with all those handoffs for few days now. >> riak-admin ring_status gives me the same info like the one I mentioned when >> opened the case. >> I noticed AAA.AAA.AAA.AAA experience more load than other servers as it's >> responsible for almost half of the data. >> Is it safe to add another machine to the cluster in order to relief >> AAA.AAA.AAA.AAA even when the issue with handoffs is not yet resolved? >> >> Thanks, >> Ivaylo > > ___ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Stalled handoffs on a prod cluster after server crash
Try to take a look at this thread from November where I experienced a similar problem: http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-November/014027.html The following mails in the thread mentions things you try to correct the problem, and what I ended up doing with the help of Basho employees. -- Jeppe Fihl Toustrup Operations Engineer Falcon Social On 10 December 2013 22:03, Ivaylo Panitchkov wrote: > Hello, > Below is the transfers info: > > ~# riak-admin transfers > > Attempting to restart script through sudo -u riak > 'r...@ccc.ccc.ccc.ccc' waiting to handoff 7 partitions > 'r...@bbb.bbb.bbb.bbb' waiting to handoff 7 partitions > 'r...@aaa.aaa.aaa.aaa' waiting to handoff 5 partitions > > > ~# riak-admin member_status > Attempting to restart script through sudo -u riak > = Membership > == > Status RingPendingNode > --- > valid 45.3% 34.4%'r...@aaa.aaa.aaa.aaa' > valid 26.6% 32.8%'r...@bbb.bbb.bbb.bbb' > valid 28.1% 32.8%'r...@ccc.ccc.ccc.ccc' > --- > > It's stuck with all those handoffs for few days now. > riak-admin ring_status gives me the same info like the one I mentioned when > opened the case. > I noticed AAA.AAA.AAA.AAA experience more load than other servers as it's > responsible for almost half of the data. > Is it safe to add another machine to the cluster in order to relief > AAA.AAA.AAA.AAA even when the issue with handoffs is not yet resolved? > > Thanks, > Ivaylo ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Stalled handoffs on a prod cluster after server crash
Hello, Below is the transfers info: ~# riak-admin transfers Attempting to restart script through sudo -u riak 'r...@ccc.ccc.ccc.ccc' waiting to handoff 7 partitions 'r...@bbb.bbb.bbb.bbb' waiting to handoff 7 partitions 'r...@aaa.aaa.aaa.aaa' waiting to handoff 5 partitions ~# riak-admin member_status Attempting to restart script through sudo -u riak = Membership == Status RingPendingNode --- valid 45.3% 34.4%'r...@aaa.aaa.aaa.aaa' valid 26.6% 32.8%'r...@bbb.bbb.bbb.bbb' valid 28.1% 32.8%'r...@ccc.ccc.ccc.ccc' --- It's stuck with all those handoffs for few days now. riak-admin ring_status gives me the same info like the one I mentioned when opened the case. I noticed AAA.AAA.AAA.AAA experience more load than other servers as it's responsible for almost half of the data. Is it safe to add another machine to the cluster in order to relief AAA.AAA.AAA.AAA even when the issue with handoffs is not yet resolved? Thanks, Ivaylo On Tue, Dec 10, 2013 at 3:04 PM, Jeppe Toustrup wrote: > What does "riak-admin transfers" tell you? Are there any transfers in > progress? > You can try to set the amount of allowed transfers per host to 0 and > then back to 2 (the default) or whatever you want, in order to restart > any transfers which may be in progress. You can do that with the > "riak-admin transfer-limit " command. > > -- > Jeppe Fihl Toustrup > Operations Engineer > Falcon Social > > ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Stalled handoffs on a prod cluster after server crash
What does "riak-admin transfers" tell you? Are there any transfers in progress? You can try to set the amount of allowed transfers per host to 0 and then back to 2 (the default) or whatever you want, in order to restart any transfers which may be in progress. You can do that with the "riak-admin transfer-limit " command. -- Jeppe Fihl Toustrup Operations Engineer Falcon Social On 9 December 2013 15:48, Ivaylo Panitchkov wrote: > > > Hello, > > We have a prod cluster of four machines running riak (1.1.4 2012-06-19) > Debian x86_64. > Two days ago one of the servers went down because of a hardware failure. > I force-removed the machine in question to re-balance the cluster before > adding the new machine. > Since then the cluster is operating properly, but I noticed some handoffs are > stalled now. > I had similar situation awhile ago that was solved by simply forcing the > handoffs, but this time the same approach didn't work. > Any ideas, solutions or just hints are greatly appreciated. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Re: Stalled handoffs on a prod cluster after server crash
I had something like that once but with version 1.2 or 1.3 .. a rolling restart helped in my case. /s On Mon, 9 Dec 2013 09:48:12 -0500 Ivaylo Panitchkov wrote: > Hello, > > We have a prod cluster of four machines running riak (1.1.4 2012-06-19) > Debian x86_64. > Two days ago one of the servers went down because of a hardware failure. > I force-removed the machine in question to re-balance the cluster before > adding the new machine. > Since then the cluster is operating properly, but I noticed some handoffs > are stalled now. > I had similar situation awhile ago that was solved by simply forcing the > handoffs, but this time the same approach didn't work. > Any ideas, solutions or just hints are greatly appreciated. > Below are cluster statuses. Replaced the IP addresses for security reason. > > > > ~# riak-admin member_status > Attempting to restart script through sudo -u riak > = Membership > == > Status RingPendingNode > --- > valid 45.3% 34.4%'r...@aaa.aaa.aaa.aaa' > valid 26.6% 32.8%'r...@bbb.bbb.bbb.bbb' > valid 28.1% 32.8%'r...@ccc.ccc.ccc.ccc' > --- > Valid:3 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 > > > > ~# riak-admin ring_status > Attempting to restart script through sudo -u riak > == Claimant > === > Claimant: 'r...@aaa.aaa.aaa.aaa' > Status: up > Ring Ready: true > > == Ownership Handoff > == > Owner: r...@aaa.aaa.aaa.aaa > Next Owner: r...@bbb.bbb.bbb.bbb > > Index: 22835963083295358096932575511191922182123945984 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > Index: 570899077082383952423314387779798054553098649600 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > Index: 1118962191081472546749696200048404186924073353216 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > Index: 1392993748081016843912887106182707253109560705024 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > --- > Owner: r...@aaa.aaa.aaa.aaa > Next Owner: r...@ccc.ccc.ccc.ccc > > Index: 114179815416476790484662877555959610910619729920 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > Index: 662242929415565384811044689824565743281594433536 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > Index: 1210306043414653979137426502093171875652569137152 > Waiting on: [riak_kv_vnode] > Complete: [riak_pipe_vnode] > > --- > > == Unreachable Nodes > == > All nodes are up and reachable > > > > Thanks in advance, > Ivaylo > > > > -- > Ivaylo Panitchkov > Software developer > Hibernum Creations Inc. > > Ce courriel est confidentiel et peut aussi être protégé par la loi.Si vous > avez reçu ce courriel par erreur, veuillez nous en aviser immédiatement en > y répondant, puis supprimer ce message de votre système. Veuillez ne pas le > copier, l’utiliser pour quelque raison que ce soit ni divulguer son contenu > à quiconque. > This email is confidential and may also be legally privileged. If you have > received this email in error, please notify us immediately by reply email > and then delete this message from your system. Please do not copy it or use > it for any purpose or disclose its content. -- Simon Effenberg | Site Ops Engineer | mobile.international GmbH Fon: + 49-(0)30-8109 - 7173 Fax: + 49-(0)30-8109 - 7131 Mail: seffenb...@team.mobile.de Web:www.mobile.de Marktplatz 1 | 14532 Europarc Dreilinden | Germany Geschäftsführer: Malte Krüger HRB Nr.: 18517 P, Amtsgericht Potsdam Sitz der Gesellschaft: Kleinmachnow ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Stalled handoffs on a prod cluster after server crash
Hello, We have a prod cluster of four machines running riak (1.1.4 2012-06-19) Debian x86_64. Two days ago one of the servers went down because of a hardware failure. I force-removed the machine in question to re-balance the cluster before adding the new machine. Since then the cluster is operating properly, but I noticed some handoffs are stalled now. I had similar situation awhile ago that was solved by simply forcing the handoffs, but this time the same approach didn't work. Any ideas, solutions or just hints are greatly appreciated. Below are cluster statuses. Replaced the IP addresses for security reason. ~# riak-admin member_status Attempting to restart script through sudo -u riak = Membership == Status RingPendingNode --- valid 45.3% 34.4%'r...@aaa.aaa.aaa.aaa' valid 26.6% 32.8%'r...@bbb.bbb.bbb.bbb' valid 28.1% 32.8%'r...@ccc.ccc.ccc.ccc' --- Valid:3 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 ~# riak-admin ring_status Attempting to restart script through sudo -u riak == Claimant === Claimant: 'r...@aaa.aaa.aaa.aaa' Status: up Ring Ready: true == Ownership Handoff == Owner: r...@aaa.aaa.aaa.aaa Next Owner: r...@bbb.bbb.bbb.bbb Index: 22835963083295358096932575511191922182123945984 Waiting on: [riak_kv_vnode] Complete: [riak_pipe_vnode] Index: 570899077082383952423314387779798054553098649600 Waiting on: [riak_kv_vnode] Complete: [riak_pipe_vnode] Index: 1118962191081472546749696200048404186924073353216 Waiting on: [riak_kv_vnode] Complete: [riak_pipe_vnode] Index: 1392993748081016843912887106182707253109560705024 Waiting on: [riak_kv_vnode] Complete: [riak_pipe_vnode] --- Owner: r...@aaa.aaa.aaa.aaa Next Owner: r...@ccc.ccc.ccc.ccc Index: 114179815416476790484662877555959610910619729920 Waiting on: [riak_kv_vnode] Complete: [riak_pipe_vnode] Index: 662242929415565384811044689824565743281594433536 Waiting on: [riak_kv_vnode] Complete: [riak_pipe_vnode] Index: 1210306043414653979137426502093171875652569137152 Waiting on: [riak_kv_vnode] Complete: [riak_pipe_vnode] --- == Unreachable Nodes == All nodes are up and reachable Thanks in advance, Ivaylo -- Ivaylo Panitchkov Software developer Hibernum Creations Inc. Ce courriel est confidentiel et peut aussi être protégé par la loi.Si vous avez reçu ce courriel par erreur, veuillez nous en aviser immédiatement en y répondant, puis supprimer ce message de votre système. Veuillez ne pas le copier, l’utiliser pour quelque raison que ce soit ni divulguer son contenu à quiconque. This email is confidential and may also be legally privileged. If you have received this email in error, please notify us immediately by reply email and then delete this message from your system. Please do not copy it or use it for any purpose or disclose its content. ___ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com