Re: hit suicide timeout message after upgrade to 0.56

2013-01-08 Thread Gregory Farnum
I'm confused. Isn't the HeartbeatMap all about local thread heartbeating (so, not pings with other OSDs)? I would assume the upgrade and restart just caused a bunch of work and the CPUs got overloaded. -Greg On Thu, Jan 3, 2013 at 8:52 AM, Sage Weil s...@inktank.com wrote: Hi Wido, On Thu, 3

Re: hit suicide timeout message after upgrade to 0.56

2013-01-08 Thread Sage Weil
On Tue, 8 Jan 2013, Gregory Farnum wrote: I'm confused. Isn't the HeartbeatMap all about local thread heartbeating (so, not pings with other OSDs)? I would assume the upgrade and restart just caused a bunch of work and the CPUs got overloaded. It is. In #3714's case, the OSD was down for a

hit suicide timeout message after upgrade to 0.56

2013-01-03 Thread Wido den Hollander
Hi, I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening and found out this morning that I had 23 OSDs still up and in. Investigating some logs I found these messages: * -8 2013-01-02 21:13:40.528936

Re: hit suicide timeout message after upgrade to 0.56

2013-01-03 Thread Sage Weil
Hi Wido, On Thu, 3 Jan 2013, Wido den Hollander wrote: Hi, I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening and found out this morning that I had 23 OSDs still up and in. Investigating some logs I found these messages: This sounds quite a bit #3714. You might give

Re: hit suicide timeout message after upgrade to 0.56

2013-01-03 Thread Wido den Hollander
Hi, On 01/03/2013 05:52 PM, Sage Weil wrote: Hi Wido, On Thu, 3 Jan 2013, Wido den Hollander wrote: Hi, I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening and found out this morning that I had 23 OSDs still up and in. Investigating some logs I found these messages:

Re: hit suicide timeout message after upgrade to 0.56

2013-01-03 Thread Sage Weil
On Thu, 3 Jan 2013, Wido den Hollander wrote: Hi, On 01/03/2013 05:52 PM, Sage Weil wrote: Hi Wido, On Thu, 3 Jan 2013, Wido den Hollander wrote: Hi, I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening and found out this morning that I had 23 OSDs

Re: hit suicide timeout message after upgrade to 0.56

2013-01-03 Thread Wido den Hollander
On 01/03/2013 10:05 PM, Sage Weil wrote: On Thu, 3 Jan 2013, Wido den Hollander wrote: Hi, On 01/03/2013 05:52 PM, Sage Weil wrote: Hi Wido, On Thu, 3 Jan 2013, Wido den Hollander wrote: Hi, I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening and found out this