Re: Stalled handoffs on a prod cluster after server crash

2013-12-10 Thread Mark Phillips
Hi Ivaylo,

Is there anything useful in console.log of any (or all) the nodes? If
so, throw it in a gist and we'll take a look at it.

Mark

On Tue, Dec 10, 2013 at 1:13 PM, Jeppe Toustrup  wrote:
> Try to take a look at this thread from November where I experienced a
> similar problem:
> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-November/014027.html
>
> The following mails in the thread mentions things you try to correct
> the problem, and what I ended up doing with the help of Basho
> employees.
>
> --
> Jeppe Fihl Toustrup
> Operations Engineer
> Falcon Social
>
> On 10 December 2013 22:03, Ivaylo Panitchkov  wrote:
>> Hello,
>> Below is the transfers info:
>>
>> ~# riak-admin transfers
>>
>> Attempting to restart script through sudo -u riak
>> 'r...@ccc.ccc.ccc.ccc' waiting to handoff 7 partitions
>> 'r...@bbb.bbb.bbb.bbb' waiting to handoff 7 partitions
>> 'r...@aaa.aaa.aaa.aaa' waiting to handoff 5 partitions
>>
>>
>> ~# riak-admin member_status
>> Attempting to restart script through sudo -u riak
>> = Membership
>> ==
>> Status RingPendingNode
>> ---
>> valid  45.3% 34.4%'r...@aaa.aaa.aaa.aaa'
>> valid  26.6% 32.8%'r...@bbb.bbb.bbb.bbb'
>> valid  28.1% 32.8%'r...@ccc.ccc.ccc.ccc'
>> ---
>>
>> It's stuck with all those handoffs for few days now.
>> riak-admin ring_status gives me the same info like the one I mentioned when
>> opened the case.
>> I noticed AAA.AAA.AAA.AAA experience more load than other servers as it's
>> responsible for almost half of the data.
>> Is it safe to add another machine to the cluster in order to relief
>> AAA.AAA.AAA.AAA even when the issue with handoffs is not yet resolved?
>>
>> Thanks,
>> Ivaylo
>
> ___
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Stalled handoffs on a prod cluster after server crash

2013-12-10 Thread Jeppe Toustrup
Try to take a look at this thread from November where I experienced a
similar problem:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2013-November/014027.html

The following mails in the thread mentions things you try to correct
the problem, and what I ended up doing with the help of Basho
employees.

-- 
Jeppe Fihl Toustrup
Operations Engineer
Falcon Social

On 10 December 2013 22:03, Ivaylo Panitchkov  wrote:
> Hello,
> Below is the transfers info:
>
> ~# riak-admin transfers
>
> Attempting to restart script through sudo -u riak
> 'r...@ccc.ccc.ccc.ccc' waiting to handoff 7 partitions
> 'r...@bbb.bbb.bbb.bbb' waiting to handoff 7 partitions
> 'r...@aaa.aaa.aaa.aaa' waiting to handoff 5 partitions
>
>
> ~# riak-admin member_status
> Attempting to restart script through sudo -u riak
> = Membership
> ==
> Status RingPendingNode
> ---
> valid  45.3% 34.4%'r...@aaa.aaa.aaa.aaa'
> valid  26.6% 32.8%'r...@bbb.bbb.bbb.bbb'
> valid  28.1% 32.8%'r...@ccc.ccc.ccc.ccc'
> ---
>
> It's stuck with all those handoffs for few days now.
> riak-admin ring_status gives me the same info like the one I mentioned when
> opened the case.
> I noticed AAA.AAA.AAA.AAA experience more load than other servers as it's
> responsible for almost half of the data.
> Is it safe to add another machine to the cluster in order to relief
> AAA.AAA.AAA.AAA even when the issue with handoffs is not yet resolved?
>
> Thanks,
> Ivaylo

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Stalled handoffs on a prod cluster after server crash

2013-12-10 Thread Ivaylo Panitchkov
Hello,
Below is the transfers info:

~# riak-admin transfers
Attempting to restart script through sudo -u riak
'r...@ccc.ccc.ccc.ccc' waiting to handoff 7 partitions
'r...@bbb.bbb.bbb.bbb' waiting to handoff 7 partitions
'r...@aaa.aaa.aaa.aaa' waiting to handoff 5 partitions

~# riak-admin member_status
Attempting to restart script through sudo -u riak
= Membership
==
Status RingPendingNode
---
valid  45.3% 34.4%'r...@aaa.aaa.aaa.aaa'
valid  26.6% 32.8%'r...@bbb.bbb.bbb.bbb'
valid  28.1% 32.8%'r...@ccc.ccc.ccc.ccc'
---

It's stuck with all those handoffs for few days now.
riak-admin ring_status gives me the same info like the one I mentioned when
opened the case.
I noticed AAA.AAA.AAA.AAA experience more load than other servers as it's
responsible for almost half of the data.
Is it safe to add another machine to the cluster in order to relief
AAA.AAA.AAA.AAA even when the issue with handoffs is not yet resolved?

Thanks,
Ivaylo



On Tue, Dec 10, 2013 at 3:04 PM, Jeppe Toustrup wrote:

> What does "riak-admin transfers" tell you? Are there any transfers in
> progress?
> You can try to set the amount of allowed transfers per host to 0 and
> then back to 2 (the default) or whatever you want, in order to restart
> any transfers which may be in progress. You can do that with the
> "riak-admin transfer-limit " command.
>
> --
> Jeppe Fihl Toustrup
> Operations Engineer
> Falcon Social
>
>
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Stalled handoffs on a prod cluster after server crash

2013-12-10 Thread Jeppe Toustrup
What does "riak-admin transfers" tell you? Are there any transfers in progress?
You can try to set the amount of allowed transfers per host to 0 and
then back to 2 (the default) or whatever you want, in order to restart
any transfers which may be in progress. You can do that with the
"riak-admin transfer-limit " command.

-- 
Jeppe Fihl Toustrup
Operations Engineer
Falcon Social

On 9 December 2013 15:48, Ivaylo Panitchkov  wrote:
>
>
> Hello,
>
> We have a prod cluster of four machines running riak (1.1.4 2012-06-19) 
> Debian x86_64.
> Two days ago one of the servers went down because of a hardware failure.
> I force-removed the machine in question to re-balance the cluster before 
> adding the new machine.
> Since then the cluster is operating properly, but I noticed some handoffs are 
> stalled now.
> I had similar situation awhile ago that was solved by simply forcing the 
> handoffs, but this time the same approach didn't work.
> Any ideas, solutions or just hints are greatly appreciated.

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Re: Stalled handoffs on a prod cluster after server crash

2013-12-10 Thread Simon Effenberg
I had something like that once but with version 1.2 or 1.3 .. a rolling
restart helped in my case.

/s

On Mon, 9 Dec 2013 09:48:12 -0500
Ivaylo Panitchkov  wrote:

> Hello,
> 
> We have a prod cluster of four machines running riak (1.1.4 2012-06-19)
> Debian x86_64.
> Two days ago one of the servers went down because of a hardware failure.
> I force-removed the machine in question to re-balance the cluster before
> adding the new machine.
> Since then the cluster is operating properly, but I noticed some handoffs
> are stalled now.
> I had similar situation awhile ago that was solved by simply forcing the
> handoffs, but this time the same approach didn't work.
> Any ideas, solutions or just hints are greatly appreciated.
> Below are cluster statuses. Replaced the IP addresses for security reason.
> 
> 
> 
> ~# riak-admin member_status
> Attempting to restart script through sudo -u riak
> = Membership
> ==
> Status RingPendingNode
> ---
> valid  45.3% 34.4%'r...@aaa.aaa.aaa.aaa'
> valid  26.6% 32.8%'r...@bbb.bbb.bbb.bbb'
> valid  28.1% 32.8%'r...@ccc.ccc.ccc.ccc'
> ---
> Valid:3 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
> 
> 
> 
> ~# riak-admin ring_status
> Attempting to restart script through sudo -u riak
> == Claimant
> ===
> Claimant:  'r...@aaa.aaa.aaa.aaa'
> Status: up
> Ring Ready: true
> 
> == Ownership Handoff
> ==
> Owner:  r...@aaa.aaa.aaa.aaa
> Next Owner: r...@bbb.bbb.bbb.bbb
> 
> Index: 22835963083295358096932575511191922182123945984
>   Waiting on: [riak_kv_vnode]
>   Complete:   [riak_pipe_vnode]
> 
> Index: 570899077082383952423314387779798054553098649600
>   Waiting on: [riak_kv_vnode]
>   Complete:   [riak_pipe_vnode]
> 
> Index: 1118962191081472546749696200048404186924073353216
>   Waiting on: [riak_kv_vnode]
>   Complete:   [riak_pipe_vnode]
> 
> Index: 1392993748081016843912887106182707253109560705024
>   Waiting on: [riak_kv_vnode]
>   Complete:   [riak_pipe_vnode]
> 
> ---
> Owner:  r...@aaa.aaa.aaa.aaa
> Next Owner: r...@ccc.ccc.ccc.ccc
> 
> Index: 114179815416476790484662877555959610910619729920
>   Waiting on: [riak_kv_vnode]
>   Complete:   [riak_pipe_vnode]
> 
> Index: 662242929415565384811044689824565743281594433536
>   Waiting on: [riak_kv_vnode]
>   Complete:   [riak_pipe_vnode]
> 
> Index: 1210306043414653979137426502093171875652569137152
>   Waiting on: [riak_kv_vnode]
>   Complete:   [riak_pipe_vnode]
> 
> ---
> 
> == Unreachable Nodes
> ==
> All nodes are up and reachable
> 
> 
> 
> Thanks in advance,
> Ivaylo
> 
> 
> 
> -- 
> Ivaylo Panitchkov
> Software developer
> Hibernum Creations Inc.
> 
> Ce courriel est confidentiel et peut aussi être protégé par la loi.Si vous
> avez reçu ce courriel par erreur, veuillez nous en aviser immédiatement en
> y répondant, puis supprimer ce message de votre système. Veuillez ne pas le
> copier, l’utiliser pour quelque raison que ce soit ni divulguer son contenu
> à quiconque.
> This email is confidential and may also be legally privileged. If you have
> received this email in error, please notify us immediately by reply email
> and then delete this message from your system. Please do not copy it or use
> it for any purpose or disclose its content.


-- 
Simon Effenberg | Site Ops Engineer | mobile.international GmbH
Fon: + 49-(0)30-8109 - 7173
Fax: + 49-(0)30-8109 - 7131

Mail: seffenb...@team.mobile.de
Web:www.mobile.de

Marktplatz 1 | 14532 Europarc Dreilinden | Germany


Geschäftsführer: Malte Krüger
HRB Nr.: 18517 P, Amtsgericht Potsdam
Sitz der Gesellschaft: Kleinmachnow 

___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Stalled handoffs on a prod cluster after server crash

2013-12-09 Thread Ivaylo Panitchkov
Hello,

We have a prod cluster of four machines running riak (1.1.4 2012-06-19)
Debian x86_64.
Two days ago one of the servers went down because of a hardware failure.
I force-removed the machine in question to re-balance the cluster before
adding the new machine.
Since then the cluster is operating properly, but I noticed some handoffs
are stalled now.
I had similar situation awhile ago that was solved by simply forcing the
handoffs, but this time the same approach didn't work.
Any ideas, solutions or just hints are greatly appreciated.
Below are cluster statuses. Replaced the IP addresses for security reason.



~# riak-admin member_status
Attempting to restart script through sudo -u riak
= Membership
==
Status RingPendingNode
---
valid  45.3% 34.4%'r...@aaa.aaa.aaa.aaa'
valid  26.6% 32.8%'r...@bbb.bbb.bbb.bbb'
valid  28.1% 32.8%'r...@ccc.ccc.ccc.ccc'
---
Valid:3 / Leaving:0 / Exiting:0 / Joining:0 / Down:0



~# riak-admin ring_status
Attempting to restart script through sudo -u riak
== Claimant
===
Claimant:  'r...@aaa.aaa.aaa.aaa'
Status: up
Ring Ready: true

== Ownership Handoff
==
Owner:  r...@aaa.aaa.aaa.aaa
Next Owner: r...@bbb.bbb.bbb.bbb

Index: 22835963083295358096932575511191922182123945984
  Waiting on: [riak_kv_vnode]
  Complete:   [riak_pipe_vnode]

Index: 570899077082383952423314387779798054553098649600
  Waiting on: [riak_kv_vnode]
  Complete:   [riak_pipe_vnode]

Index: 1118962191081472546749696200048404186924073353216
  Waiting on: [riak_kv_vnode]
  Complete:   [riak_pipe_vnode]

Index: 1392993748081016843912887106182707253109560705024
  Waiting on: [riak_kv_vnode]
  Complete:   [riak_pipe_vnode]

---
Owner:  r...@aaa.aaa.aaa.aaa
Next Owner: r...@ccc.ccc.ccc.ccc

Index: 114179815416476790484662877555959610910619729920
  Waiting on: [riak_kv_vnode]
  Complete:   [riak_pipe_vnode]

Index: 662242929415565384811044689824565743281594433536
  Waiting on: [riak_kv_vnode]
  Complete:   [riak_pipe_vnode]

Index: 1210306043414653979137426502093171875652569137152
  Waiting on: [riak_kv_vnode]
  Complete:   [riak_pipe_vnode]

---

== Unreachable Nodes
==
All nodes are up and reachable



Thanks in advance,
Ivaylo



-- 
Ivaylo Panitchkov
Software developer
Hibernum Creations Inc.

Ce courriel est confidentiel et peut aussi être protégé par la loi.Si vous
avez reçu ce courriel par erreur, veuillez nous en aviser immédiatement en
y répondant, puis supprimer ce message de votre système. Veuillez ne pas le
copier, l’utiliser pour quelque raison que ce soit ni divulguer son contenu
à quiconque.
This email is confidential and may also be legally privileged. If you have
received this email in error, please notify us immediately by reply email
and then delete this message from your system. Please do not copy it or use
it for any purpose or disclose its content.
___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com