Re: [Pacemaker] Pacemaker/corosync freeze

2014-09-04 Thread Sreenivasa
Hi Attila,

Did you try compiling libqb 0.17.0 ? Wondering if that solved your issue ?
I also have the same issue. Please suggest if you already solved it.

Thanks
Sreenivasa 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-18 Thread Attila Megyeri
Hi Andrew,


> -Original Message-
> From: Andrew Beekhof [mailto:and...@beekhof.net]
> Sent: Tuesday, March 18, 2014 11:40 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
> On 18 Mar 2014, at 6:03 pm, Attila Megyeri 
> wrote:
> 
> > Hello,
> >
> >> -Original Message-
> >> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >> Sent: Tuesday, March 18, 2014 2:43 AM
> >> To: Attila Megyeri
> >> Cc: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >>
> >> On 13 Mar 2014, at 11:44 pm, Attila Megyeri  soft.com>
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>>> -Original Message-
> >>>> From: Jan Friesse [mailto:jfrie...@redhat.com]
> >>>> Sent: Thursday, March 13, 2014 10:03 AM
> >>>> To: The Pacemaker cluster resource manager
> >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>
> >>>> ...
> >>>>
> >>>> Attila,
> >>>> what seems to be interesting is
> >>>>
> >>>> Configuration ERRORs found during PE processing.  Please run
> "crm_verify
> >> -L"
> >>>> to identify issues.
> >>>>
> >>>> I'm unsure how much is this problem but I'm really not pacemaker
> expert.
> >>>
> >>> Perhaps Andrew could comment on that. Any idea?
> >>
> >> Did you run the command?  What did it say?
> >
> > Yes, all was fine. This is why I found it strange.
> 
> If you still have /var/lib/pacemaker/pengine/pe-error-7.bz2 from ctdb2, then
> I should be able to figure out what it was complaining about.
> (You can also run: crm_verify --xml-file /var/lib/pacemaker/pengine/pe-
> error-7.bz2 -V )

The file is still there, and crm_veryfy check is successful (error 0) and no 
output. The file is full of confidential data but if you think you can find 
something useful in it I can send it in a direct mail.

thanks!





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-18 Thread Andrew Beekhof

On 18 Mar 2014, at 6:03 pm, Attila Megyeri  wrote:

> Hello,
> 
>> -Original Message-
>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>> Sent: Tuesday, March 18, 2014 2:43 AM
>> To: Attila Megyeri
>> Cc: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>> 
>> 
>> On 13 Mar 2014, at 11:44 pm, Attila Megyeri 
>> wrote:
>> 
>>> Hello,
>>> 
>>>> -Original Message-
>>>> From: Jan Friesse [mailto:jfrie...@redhat.com]
>>>> Sent: Thursday, March 13, 2014 10:03 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>> 
>>>> ...
>>>> 
>>>> Attila,
>>>> what seems to be interesting is
>>>> 
>>>> Configuration ERRORs found during PE processing.  Please run "crm_verify
>> -L"
>>>> to identify issues.
>>>> 
>>>> I'm unsure how much is this problem but I'm really not pacemaker expert.
>>> 
>>> Perhaps Andrew could comment on that. Any idea?
>> 
>> Did you run the command?  What did it say?
> 
> Yes, all was fine. This is why I found it strange.

If you still have /var/lib/pacemaker/pengine/pe-error-7.bz2 from ctdb2, then I 
should be able to figure out what it was complaining about.
(You can also run: crm_verify --xml-file 
/var/lib/pacemaker/pengine/pe-error-7.bz2 -V )


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-18 Thread Attila Megyeri
Hello,

> -Original Message-
> From: Andrew Beekhof [mailto:and...@beekhof.net]
> Sent: Tuesday, March 18, 2014 2:43 AM
> To: Attila Megyeri
> Cc: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
> On 13 Mar 2014, at 11:44 pm, Attila Megyeri 
> wrote:
> 
> > Hello,
> >
> >> -Original Message-
> >> From: Jan Friesse [mailto:jfrie...@redhat.com]
> >> Sent: Thursday, March 13, 2014 10:03 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >> ...
> >>
> >>>>>>
> >>>>>> Also can you please try to set debug: on in corosync.conf and
> >>>>>> paste full corosync.log then?
> >>>>>
> >>>>> I set debug to on, and did a few restarts but could not reproduce
> >>>>> the issue
> >>>> yet - will post the logs as soon as I manage to reproduce.
> >>>>>
> >>>>
> >>>> Perfect.
> >>>>
> >>>> Another option you can try to set is netmtu (1200 is usually safe).
> >>>
> >>> Finally I was able to reproduce the issue.
> >>> I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
> >>> (not
> >> when node was up again).
> >>>
> >>> The corosync log with debug on is available at:
> >>> http://pastebin.com/kTpDqqtm
> >>>
> >>>
> >>> To be honest, I had to wait much longer for this reproduction as
> >>> before,
> >> even though there was no change in the corosync configuration - just
> >> potentially some system updates. But anyway, the issue is
> >> unfortunately still there.
> >>> Previously, when this issue came, cpu was at 100% on all nodes -
> >>> this time
> >> only on ctmgr, which was the DC...
> >>>
> >>> I hope you can find some useful details in the log.
> >>>
> >>
> >> Attila,
> >> what seems to be interesting is
> >>
> >> Configuration ERRORs found during PE processing.  Please run "crm_verify
> -L"
> >> to identify issues.
> >>
> >> I'm unsure how much is this problem but I'm really not pacemaker expert.
> >
> > Perhaps Andrew could comment on that. Any idea?
> 
> Did you run the command?  What did it say?

Yes, all was fine. This is why I found it strange.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-17 Thread Andrew Beekhof

On 13 Mar 2014, at 11:44 pm, Attila Megyeri  wrote:

> Hello,
> 
>> -Original Message-
>> From: Jan Friesse [mailto:jfrie...@redhat.com]
>> Sent: Thursday, March 13, 2014 10:03 AM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>> 
>> ...
>> 
>>>>>> 
>>>>>> Also can you please try to set debug: on in corosync.conf and paste
>>>>>> full corosync.log then?
>>>>> 
>>>>> I set debug to on, and did a few restarts but could not reproduce
>>>>> the issue
>>>> yet - will post the logs as soon as I manage to reproduce.
>>>>> 
>>>> 
>>>> Perfect.
>>>> 
>>>> Another option you can try to set is netmtu (1200 is usually safe).
>>> 
>>> Finally I was able to reproduce the issue.
>>> I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not
>> when node was up again).
>>> 
>>> The corosync log with debug on is available at:
>>> http://pastebin.com/kTpDqqtm
>>> 
>>> 
>>> To be honest, I had to wait much longer for this reproduction as before,
>> even though there was no change in the corosync configuration - just
>> potentially some system updates. But anyway, the issue is unfortunately still
>> there.
>>> Previously, when this issue came, cpu was at 100% on all nodes - this time
>> only on ctmgr, which was the DC...
>>> 
>>> I hope you can find some useful details in the log.
>>> 
>> 
>> Attila,
>> what seems to be interesting is
>> 
>> Configuration ERRORs found during PE processing.  Please run "crm_verify -L"
>> to identify issues.
>> 
>> I'm unsure how much is this problem but I'm really not pacemaker expert.
> 
> Perhaps Andrew could comment on that. Any idea?

Did you run the command?  What did it say?



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-17 Thread Attila Megyeri
Hi David, Jan,

For the time being corosync 2.3.3 looks stable with libqb 0.17.0 with both 
build from source.
Thank you very much for the guidance!

Attila

> -Original Message-
> From: David Vossel [mailto:dvos...@redhat.com]
> Sent: Thursday, March 13, 2014 9:22 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
> 
> 
> 
> - Original Message -
> > From: "Jan Friesse" 
> > To: "The Pacemaker cluster resource manager"
> > 
> > Sent: Thursday, March 13, 2014 4:03:28 AM
> > Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >
> > ...
> >
> > >>>>
> > >>>> Also can you please try to set debug: on in corosync.conf and
> > >>>> paste full corosync.log then?
> > >>>
> > >>> I set debug to on, and did a few restarts but could not reproduce
> > >>> the issue
> > >> yet - will post the logs as soon as I manage to reproduce.
> > >>>
> > >>
> > >> Perfect.
> > >>
> > >> Another option you can try to set is netmtu (1200 is usually safe).
> > >
> > > Finally I was able to reproduce the issue.
> > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
> > > (not when node was up again).
> > >
> > > The corosync log with debug on is available at:
> > > http://pastebin.com/kTpDqqtm
> > >
> > >
> > > To be honest, I had to wait much longer for this reproduction as
> > > before, even though there was no change in the corosync
> > > configuration - just potentially some system updates. But anyway,
> > > the issue is unfortunately still there.
> > > Previously, when this issue came, cpu was at 100% on all nodes -
> > > this time only on ctmgr, which was the DC...
> > >
> > > I hope you can find some useful details in the log.
> > >
> >
> > Attila,
> > what seems to be interesting is
> >
> > Configuration ERRORs found during PE processing.  Please run
> > "crm_verify -L" to identify issues.
> >
> > I'm unsure how much is this problem but I'm really not pacemaker expert.
> >
> > Anyway, I have theory what may happening and it looks like related
> > with IPC (and probably not related to network). But to make sure we
> > will not try fixing already fixed bug, can you please build:
> > - New libqb (0.17.0). There are plenty of fixes in IPC
> > - Corosync 2.3.3 (already plenty IPC fixes)
> 
> yes, there was a libqb/corosync interoperation problem that showed these
> same symptoms last year. Updating to the latest corosync and libqb will likely
> resolve this.
> 
> > - And maybe also newer pacemaker
> >
> > I know you were not very happy using hand-compiled sources, but please
> > give them at least a try.
> >
> > Thanks,
> >   Honza
> >
> > > Thanks,
> > > Attila
> > >
> > >
> > >
> > >>
> > >> Regards,
> > >>   Honza
> > >>
> > >>>
> > >>> There are also a few things that might or might not be related:
> > >>>
> > >>> 1) Whenever I want to edit the configuration with "crm configure
> > >>> edit",
> >
> > ...
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-14 Thread Jan Friesse
Attila Megyeri napsal(a):
> Hi Honza,
> 
> What I also found in the log related to the freeze at 12:22:26:
> 
> 
> Corosync main process was not scheduled for  ... Can It be the general 
> cause of the issue?
> 

I don't think it will cause issue you are hitting BUT keep in mind that
if corosync is not scheduled for long time, it's probably fenced by
other node. So increase timeout is vital.

Honza

> 
> 
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
> [10.9.1.5]:58597->[10.9.1.3]:161
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
> [10.9.1.5]:47943->[10.9.1.3]:161
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
> [10.9.1.5]:47943->[10.9.1.3]:161
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
> [10.9.1.5]:59647->[10.9.1.3]:161
> 
> 
> Mar 13 12:22:26 ctmgr corosync[3024]:   [MAIN  ] Corosync main process was 
> not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token 
> timeout increase.
> 
> 
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] The token was lost in the 
> OPERATIONAL state.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] A processor failed, forming 
> new configuration.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering GATHER state from 
> 2(The token was lost in the OPERATIONAL state.).
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Creating commit token 
> because I am the rep.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Saving state aru 6a8c high 
> seq received 6a8c
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Storing new sequence id for 
> ring 7dc
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering COMMIT state.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] got commit token
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering RECOVERY state.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [0] member 10.9.1.3:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [1] member 10.9.1.41:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [2] member 10.9.1.42:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [3] member 10.9.1.71:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [4] member 10.9.1.72:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [5] member 10.9.2.11:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [6] member 10.9.2.12:
> 
> 
> 
> 
> Regards,
> Attila
> 
>> -Original Message-
>> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
>> Sent: Thursday, March 13, 2014 2:27 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>>> -Original Message-
>>> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
>>> Sent: Thursday, March 13, 2014 1:45 PM
>>> To: The Pacemaker cluster resource manager; Andrew Beekhof
>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>
>>> Hello,
>>>
>>>> -Original Message-
>>>> From: Jan Friesse [mailto:jfrie...@redhat.com]
>>>> Sent: Thursday, March 13, 2014 10:03 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>
>>>> ...
>>>>
>>>>>>>>
>>>>>>>> Also can you please try to set debug: on in corosync.conf and
>>>>>>>> paste full corosync.log then?
>>>>>>>
>>>>>>> I set debug to on, and did a few restarts but could not
>>>>>>> reproduce the issue
>>>>>> yet - will post the logs as soon as I manage to reproduce.
>>>>>>>
>>>>>>
>>>>>> Perfect.
>>>>>>
>>>>>> Another option you can try to set is netmtu (1200 is usually safe).
>>>>>
>>>>> Finally I was able to reproduce the issue.
>>>>> I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
>>>>> (not
>>>> when node was up again).
>>>>>
>>>>> The corosync log with debug on is available at:
>>>>> http://pastebin.com/kTpDqqtm
>>>>>
>>>>>
>>>>> To be honest, I had to wait much longer for this reproduction as
>>>>> before,
>>>> even though there was no change in the corosync configuration - just
>>>> potentially some system updates. But anyway, the issue is
>>>> unfortunately still there.
>>>>> Previously, when this issue

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-14 Thread Jan Friesse
Attila Megyeri napsal(a):
> Hi Honza,
> 
> What I also found in the log related to the freeze at 12:22:26:
> 
> 
> Corosync main process was not scheduled for  ... Can It be the general 
> cause of the issue?
> 
> 
> 
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
> [10.9.1.5]:58597->[10.9.1.3]:161
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
> [10.9.1.5]:47943->[10.9.1.3]:161
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
> [10.9.1.5]:47943->[10.9.1.3]:161
> Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
> [10.9.1.5]:59647->[10.9.1.3]:161
> 
> 
> Mar 13 12:22:26 ctmgr corosync[3024]:   [MAIN  ] Corosync main process was 
> not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token 
> timeout increase.
> 
> 
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] The token was lost in the 
> OPERATIONAL state.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] A processor failed, forming 
> new configuration.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering GATHER state from 
> 2(The token was lost in the OPERATIONAL state.).
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Creating commit token 
> because I am the rep.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Saving state aru 6a8c high 
> seq received 6a8c
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Storing new sequence id for 
> ring 7dc
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering COMMIT state.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] got commit token
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering RECOVERY state.
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [0] member 10.9.1.3:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [1] member 10.9.1.41:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [2] member 10.9.1.42:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [3] member 10.9.1.71:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [4] member 10.9.1.72:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [5] member 10.9.2.11:
> Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [6] member 10.9.2.12:
> 
> 
> 
> 
> Regards,
> Attila
> 
>> -Original Message-
>> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
>> Sent: Thursday, March 13, 2014 2:27 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>>> -Original Message-
>>> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
>>> Sent: Thursday, March 13, 2014 1:45 PM
>>> To: The Pacemaker cluster resource manager; Andrew Beekhof
>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>
>>> Hello,
>>>
>>>> -Original Message-
>>>> From: Jan Friesse [mailto:jfrie...@redhat.com]
>>>> Sent: Thursday, March 13, 2014 10:03 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>
>>>> ...
>>>>
>>>>>>>>
>>>>>>>> Also can you please try to set debug: on in corosync.conf and
>>>>>>>> paste full corosync.log then?
>>>>>>>
>>>>>>> I set debug to on, and did a few restarts but could not
>>>>>>> reproduce the issue
>>>>>> yet - will post the logs as soon as I manage to reproduce.
>>>>>>>
>>>>>>
>>>>>> Perfect.
>>>>>>
>>>>>> Another option you can try to set is netmtu (1200 is usually safe).
>>>>>
>>>>> Finally I was able to reproduce the issue.
>>>>> I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
>>>>> (not
>>>> when node was up again).
>>>>>
>>>>> The corosync log with debug on is available at:
>>>>> http://pastebin.com/kTpDqqtm
>>>>>
>>>>>
>>>>> To be honest, I had to wait much longer for this reproduction as
>>>>> before,
>>>> even though there was no change in the corosync configuration - just
>>>> potentially some system updates. But anyway, the issue is
>>>> unfortunately still there.
>>>>> Previously, when this issue came, cpu was at 100% on all nodes -
>>>>> this time
>>>> only on ctmgr, which was the DC...
>>>>>
>>>>> I hope you can find some useful details in 

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-14 Thread Attila Megyeri
Hello David,


> -Original Message-
> From: David Vossel [mailto:dvos...@redhat.com]
> Sent: Thursday, March 13, 2014 9:22 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
> 
> 
> 
> - Original Message -
> > From: "Jan Friesse" 
> > To: "The Pacemaker cluster resource manager"
> > 
> > Sent: Thursday, March 13, 2014 4:03:28 AM
> > Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >
> > ...
> >
> > >>>>
> > >>>> Also can you please try to set debug: on in corosync.conf and
> > >>>> paste full corosync.log then?
> > >>>
> > >>> I set debug to on, and did a few restarts but could not reproduce
> > >>> the issue
> > >> yet - will post the logs as soon as I manage to reproduce.
> > >>>
> > >>
> > >> Perfect.
> > >>
> > >> Another option you can try to set is netmtu (1200 is usually safe).
> > >
> > > Finally I was able to reproduce the issue.
> > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
> > > (not when node was up again).
> > >
> > > The corosync log with debug on is available at:
> > > http://pastebin.com/kTpDqqtm
> > >
> > >
> > > To be honest, I had to wait much longer for this reproduction as
> > > before, even though there was no change in the corosync
> > > configuration - just potentially some system updates. But anyway,
> > > the issue is unfortunately still there.
> > > Previously, when this issue came, cpu was at 100% on all nodes -
> > > this time only on ctmgr, which was the DC...
> > >
> > > I hope you can find some useful details in the log.
> > >
> >
> > Attila,
> > what seems to be interesting is
> >
> > Configuration ERRORs found during PE processing.  Please run
> > "crm_verify -L" to identify issues.
> >
> > I'm unsure how much is this problem but I'm really not pacemaker expert.
> >
> > Anyway, I have theory what may happening and it looks like related
> > with IPC (and probably not related to network). But to make sure we
> > will not try fixing already fixed bug, can you please build:
> > - New libqb (0.17.0). There are plenty of fixes in IPC
> > - Corosync 2.3.3 (already plenty IPC fixes)
> 
> yes, there was a libqb/corosync interoperation problem that showed these
> same symptoms last year. Updating to the latest corosync and libqb will likely
> resolve this.

I have upgraded all nodes to these version and we are testing. So far no issues.
Thank you very much for your help.

Regards,
Attila





> 
> > - And maybe also newer pacemaker
> >
> > I know you were not very happy using hand-compiled sources, but please
> > give them at least a try.
> >
> > Thanks,
> >   Honza
> >
> > > Thanks,
> > > Attila
> > >
> > >
> > >
> > >>
> > >> Regards,
> > >>   Honza
> > >>
> > >>>
> > >>> There are also a few things that might or might not be related:
> > >>>
> > >>> 1) Whenever I want to edit the configuration with "crm configure
> > >>> edit",
> >
> > ...
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread David Vossel




- Original Message -
> From: "Jan Friesse" 
> To: "The Pacemaker cluster resource manager" 
> Sent: Thursday, March 13, 2014 4:03:28 AM
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> ...
> 
> >>>>
> >>>> Also can you please try to set debug: on in corosync.conf and paste
> >>>> full corosync.log then?
> >>>
> >>> I set debug to on, and did a few restarts but could not reproduce the
> >>> issue
> >> yet - will post the logs as soon as I manage to reproduce.
> >>>
> >>
> >> Perfect.
> >>
> >> Another option you can try to set is netmtu (1200 is usually safe).
> > 
> > Finally I was able to reproduce the issue.
> > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not
> > when node was up again).
> > 
> > The corosync log with debug on is available at:
> > http://pastebin.com/kTpDqqtm
> > 
> > 
> > To be honest, I had to wait much longer for this reproduction as before,
> > even though there was no change in the corosync configuration - just
> > potentially some system updates. But anyway, the issue is unfortunately
> > still there.
> > Previously, when this issue came, cpu was at 100% on all nodes - this time
> > only on ctmgr, which was the DC...
> > 
> > I hope you can find some useful details in the log.
> > 
> 
> Attila,
> what seems to be interesting is
> 
> Configuration ERRORs found during PE processing.  Please run "crm_verify
> -L" to identify issues.
> 
> I'm unsure how much is this problem but I'm really not pacemaker expert.
> 
> Anyway, I have theory what may happening and it looks like related with
> IPC (and probably not related to network). But to make sure we will not
> try fixing already fixed bug, can you please build:
> - New libqb (0.17.0). There are plenty of fixes in IPC
> - Corosync 2.3.3 (already plenty IPC fixes)

yes, there was a libqb/corosync interoperation problem that showed these same 
symptoms last year. Updating to the latest corosync and libqb will likely 
resolve this.

> - And maybe also newer pacemaker
> 
> I know you were not very happy using hand-compiled sources, but please
> give them at least a try.
> 
> Thanks,
>   Honza
> 
> > Thanks,
> > Attila
> > 
> > 
> > 
> >>
> >> Regards,
> >>   Honza
> >>
> >>>
> >>> There are also a few things that might or might not be related:
> >>>
> >>> 1) Whenever I want to edit the configuration with "crm configure edit",
> 
> ...
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread Attila Megyeri
Hi Honza,

What I also found in the log related to the freeze at 12:22:26:


Corosync main process was not scheduled for  ... Can It be the general 
cause of the issue?



Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
[10.9.1.5]:58597->[10.9.1.3]:161
Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
[10.9.1.5]:47943->[10.9.1.3]:161
Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
[10.9.1.5]:47943->[10.9.1.3]:161
Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
[10.9.1.5]:59647->[10.9.1.3]:161


Mar 13 12:22:26 ctmgr corosync[3024]:   [MAIN  ] Corosync main process was not 
scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token timeout 
increase.


Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] The token was lost in the 
OPERATIONAL state.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] A processor failed, forming 
new configuration.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering GATHER state from 
2(The token was lost in the OPERATIONAL state.).
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Creating commit token because 
I am the rep.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Saving state aru 6a8c high seq 
received 6a8c
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Storing new sequence id for 
ring 7dc
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering COMMIT state.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] got commit token
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering RECOVERY state.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [0] member 10.9.1.3:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [1] member 10.9.1.41:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [2] member 10.9.1.42:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [3] member 10.9.1.71:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [4] member 10.9.1.72:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [5] member 10.9.2.11:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [6] member 10.9.2.12:




Regards,
Attila

> -Original Message-
> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> Sent: Thursday, March 13, 2014 2:27 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
> > -Original Message-
> > From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> > Sent: Thursday, March 13, 2014 1:45 PM
> > To: The Pacemaker cluster resource manager; Andrew Beekhof
> > Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >
> > Hello,
> >
> > > -Original Message-
> > > From: Jan Friesse [mailto:jfrie...@redhat.com]
> > > Sent: Thursday, March 13, 2014 10:03 AM
> > > To: The Pacemaker cluster resource manager
> > > Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> > >
> > > ...
> > >
> > > >>>>
> > > >>>> Also can you please try to set debug: on in corosync.conf and
> > > >>>> paste full corosync.log then?
> > > >>>
> > > >>> I set debug to on, and did a few restarts but could not
> > > >>> reproduce the issue
> > > >> yet - will post the logs as soon as I manage to reproduce.
> > > >>>
> > > >>
> > > >> Perfect.
> > > >>
> > > >> Another option you can try to set is netmtu (1200 is usually safe).
> > > >
> > > > Finally I was able to reproduce the issue.
> > > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
> > > > (not
> > > when node was up again).
> > > >
> > > > The corosync log with debug on is available at:
> > > > http://pastebin.com/kTpDqqtm
> > > >
> > > >
> > > > To be honest, I had to wait much longer for this reproduction as
> > > > before,
> > > even though there was no change in the corosync configuration - just
> > > potentially some system updates. But anyway, the issue is
> > > unfortunately still there.
> > > > Previously, when this issue came, cpu was at 100% on all nodes -
> > > > this time
> > > only on ctmgr, which was the DC...
> > > >
> > > > I hope you can find some useful details in the log.
> > > >
> > >
> > > Attila,
> > > what seems to be interesting is
> > >
> > > Configuration ERRORs found during PE processing.  Please run
> > > "crm_verify -
> > L"
> > > to identify issues.
> > >
> > > I'm unsure how much is this problem but I'm really not pacemaker
&

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread Attila Megyeri

> -Original Message-
> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> Sent: Thursday, March 13, 2014 1:45 PM
> To: The Pacemaker cluster resource manager; Andrew Beekhof
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> Hello,
> 
> > -Original Message-
> > From: Jan Friesse [mailto:jfrie...@redhat.com]
> > Sent: Thursday, March 13, 2014 10:03 AM
> > To: The Pacemaker cluster resource manager
> > Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >
> > ...
> >
> > >>>>
> > >>>> Also can you please try to set debug: on in corosync.conf and
> > >>>> paste full corosync.log then?
> > >>>
> > >>> I set debug to on, and did a few restarts but could not reproduce
> > >>> the issue
> > >> yet - will post the logs as soon as I manage to reproduce.
> > >>>
> > >>
> > >> Perfect.
> > >>
> > >> Another option you can try to set is netmtu (1200 is usually safe).
> > >
> > > Finally I was able to reproduce the issue.
> > > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
> > > (not
> > when node was up again).
> > >
> > > The corosync log with debug on is available at:
> > > http://pastebin.com/kTpDqqtm
> > >
> > >
> > > To be honest, I had to wait much longer for this reproduction as
> > > before,
> > even though there was no change in the corosync configuration - just
> > potentially some system updates. But anyway, the issue is
> > unfortunately still there.
> > > Previously, when this issue came, cpu was at 100% on all nodes -
> > > this time
> > only on ctmgr, which was the DC...
> > >
> > > I hope you can find some useful details in the log.
> > >
> >
> > Attila,
> > what seems to be interesting is
> >
> > Configuration ERRORs found during PE processing.  Please run "crm_verify -
> L"
> > to identify issues.
> >
> > I'm unsure how much is this problem but I'm really not pacemaker expert.
> 
> Perhaps Andrew could comment on that. Any idea?
> 
> 
> >
> > Anyway, I have theory what may happening and it looks like related
> > with IPC (and probably not related to network). But to make sure we
> > will not try fixing already fixed bug, can you please build:
> > - New libqb (0.17.0). There are plenty of fixes in IPC
> > - Corosync 2.3.3 (already plenty IPC fixes)
> > - And maybe also newer pacemaker
> >
> 
> I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from
> Ubuntu package.
> I am currently building libqb 0.17.0, will update you on the results.
> 
> In the meantime we had another freeze, which did not seem to be related to
> any restarts, but brought all coroync processes to 100%.
> Please check out the corosync.log, perhaps it is a different cause:
> http://pastebin.com/WMwzv0Rr
> 
> 
> In the meantime I will install the new libqb and send logs if we have further
> issues.
> 
> Thank you very much for your help!
> 
> Regards,
> Attila
> 

One more question:

If I install libqb 0.17.0 from source, do I need to rebuild corosync as well, 
or if it was built with libqb 0.16.0 it will be fine?

BTW, in the meantime I installed the new libqb on 3 of the 7 hosts, so I can 
see if it makes a difference. If I see crashes on the outdated ones, but not on 
the new ones, we are fine. :)

Thanks,

Attila







> 
> 
> > I know you were not very happy using hand-compiled sources, but please
> > give them at least a try.
> >
> > Thanks,
> >   Honza
> >
> > > Thanks,
> > > Attila
> > >
> > >
> > >
> > >>
> > >> Regards,
> > >>   Honza
> > >>
> > >>>
> > >>> There are also a few things that might or might not be related:
> > >>>
> > >>> 1) Whenever I want to edit the configuration with "crm configure
> > >>> edit",
> >
> > ...
> >
> > ___
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread Attila Megyeri
Hello,

> -Original Message-
> From: Jan Friesse [mailto:jfrie...@redhat.com]
> Sent: Thursday, March 13, 2014 10:03 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> ...
> 
> >>>>
> >>>> Also can you please try to set debug: on in corosync.conf and paste
> >>>> full corosync.log then?
> >>>
> >>> I set debug to on, and did a few restarts but could not reproduce
> >>> the issue
> >> yet - will post the logs as soon as I manage to reproduce.
> >>>
> >>
> >> Perfect.
> >>
> >> Another option you can try to set is netmtu (1200 is usually safe).
> >
> > Finally I was able to reproduce the issue.
> > I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not
> when node was up again).
> >
> > The corosync log with debug on is available at:
> > http://pastebin.com/kTpDqqtm
> >
> >
> > To be honest, I had to wait much longer for this reproduction as before,
> even though there was no change in the corosync configuration - just
> potentially some system updates. But anyway, the issue is unfortunately still
> there.
> > Previously, when this issue came, cpu was at 100% on all nodes - this time
> only on ctmgr, which was the DC...
> >
> > I hope you can find some useful details in the log.
> >
> 
> Attila,
> what seems to be interesting is
> 
> Configuration ERRORs found during PE processing.  Please run "crm_verify -L"
> to identify issues.
> 
> I'm unsure how much is this problem but I'm really not pacemaker expert.

Perhaps Andrew could comment on that. Any idea?


> 
> Anyway, I have theory what may happening and it looks like related with IPC
> (and probably not related to network). But to make sure we will not try fixing
> already fixed bug, can you please build:
> - New libqb (0.17.0). There are plenty of fixes in IPC
> - Corosync 2.3.3 (already plenty IPC fixes)
> - And maybe also newer pacemaker
> 

I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu 
package.
I am currently building libqb 0.17.0, will update you on the results.

In the meantime we had another freeze, which did not seem to be related to any 
restarts, but brought all coroync processes to 100%.
Please check out the corosync.log, perhaps it is a different cause: 
http://pastebin.com/WMwzv0Rr 


In the meantime I will install the new libqb and send logs if we have further 
issues.

Thank you very much for your help!

Regards,
Attila



> I know you were not very happy using hand-compiled sources, but please
> give them at least a try.
> 
> Thanks,
>   Honza
> 
> > Thanks,
> > Attila
> >
> >
> >
> >>
> >> Regards,
> >>   Honza
> >>
> >>>
> >>> There are also a few things that might or might not be related:
> >>>
> >>> 1) Whenever I want to edit the configuration with "crm configure
> >>> edit",
> 
> ...
> 
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread Jan Friesse
...


 Also can you please try to set debug: on in corosync.conf and paste
 full corosync.log then?
>>>
>>> I set debug to on, and did a few restarts but could not reproduce the issue
>> yet - will post the logs as soon as I manage to reproduce.
>>>
>>
>> Perfect.
>>
>> Another option you can try to set is netmtu (1200 is usually safe).
> 
> Finally I was able to reproduce the issue.
> I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when 
> node was up again).
> 
> The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm
> 
> 
> To be honest, I had to wait much longer for this reproduction as before, even 
> though there was no change in the corosync configuration - just potentially 
> some system updates. But anyway, the issue is unfortunately still there.
> Previously, when this issue came, cpu was at 100% on all nodes - this time 
> only on ctmgr, which was the DC...
> 
> I hope you can find some useful details in the log.
> 

Attila,
what seems to be interesting is

Configuration ERRORs found during PE processing.  Please run "crm_verify
-L" to identify issues.

I'm unsure how much is this problem but I'm really not pacemaker expert.

Anyway, I have theory what may happening and it looks like related with
IPC (and probably not related to network). But to make sure we will not
try fixing already fixed bug, can you please build:
- New libqb (0.17.0). There are plenty of fixes in IPC
- Corosync 2.3.3 (already plenty IPC fixes)
- And maybe also newer pacemaker

I know you were not very happy using hand-compiled sources, but please
give them at least a try.

Thanks,
  Honza

> Thanks,
> Attila
> 
> 
> 
>>
>> Regards,
>>   Honza
>>
>>>
>>> There are also a few things that might or might not be related:
>>>
>>> 1) Whenever I want to edit the configuration with "crm configure edit",

...

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri


> -Original Message-
> From: Jan Friesse [mailto:jfrie...@redhat.com]
> Sent: Wednesday, March 12, 2014 4:31 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>
> Attila Megyeri napsal(a):
> >> -Original Message-
> >> From: Jan Friesse [mailto:jfrie...@redhat.com]
> >> Sent: Wednesday, March 12, 2014 2:27 PM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >> Attila Megyeri napsal(a):
> >>> Hello Jan,
> >>>
> >>> Thank you very much for your help so far.
> >>>
> >>>> -Original Message-
> >>>> From: Jan Friesse [mailto:jfrie...@redhat.com]
> >>>> Sent: Wednesday, March 12, 2014 9:51 AM
> >>>> To: The Pacemaker cluster resource manager
> >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>
> >>>> Attila Megyeri napsal(a):
> >>>>>
> >>>>>> -Original Message-
> >>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >>>>>> Sent: Tuesday, March 11, 2014 10:27 PM
> >>>>>> To: The Pacemaker cluster resource manager
> >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>>>
> >>>>>>
> >>>>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri
> >>>>>> 
> >>>>>> wrote:
> >>>>>>
> >>>>>>>>
> >>>>>>>> -Original Message-
> >>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >>>>>>>> Sent: Tuesday, March 11, 2014 12:48 AM
> >>>>>>>> To: The Pacemaker cluster resource manager
> >>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri
> >>>>>>>> 
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Thanks for the quick response!
> >>>>>>>>>
> >>>>>>>>>> -Original Message-
> >>>>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >>>>>>>>>> Sent: Friday, March 07, 2014 3:48 AM
> >>>>>>>>>> To: The Pacemaker cluster resource manager
> >>>>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
> >>>>>>>>>> 
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hello,
> >>>>>>>>>>>
> >>>>>>>>>>> We have a strange issue with Corosync/Pacemaker.
> >>>>>>>>>>> From time to time, something unexpected happens and
> >> suddenly
> >>>> the
> >>>>>>>>>> crm_mon output remains static.
> >>>>>>>>>>> When I check the cpu usage, I see that one of the cores uses
> >>>>>>>>>>> 100% cpu, but
> >>>>>>>>>> cannot actually match it to either the corosync or one of the
> >>>>>>>>>> pacemaker processes.
> >>>>>>>>>>>
> >>>>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
> >>>>>>>>>>> I have to manually go to each node, stop pacemaker, restart
> >>>>>>>>>>> corosync, then
> >>>>>>>>>> start pacemeker. Stoping pacemaker and corosync does not
> work
> >>>>>>>>>> in most of the cases, usually a kill -9 is needed.
> >>>>>>>>>>>
> >>>>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >>>>>>>>>>>
> >>>>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode
> >>>

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
>> -Original Message-
>> From: Jan Friesse [mailto:jfrie...@redhat.com]
>> Sent: Wednesday, March 12, 2014 2:27 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>> Attila Megyeri napsal(a):
>>> Hello Jan,
>>>
>>> Thank you very much for your help so far.
>>>
>>>> -Original Message-
>>>> From: Jan Friesse [mailto:jfrie...@redhat.com]
>>>> Sent: Wednesday, March 12, 2014 9:51 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>
>>>> Attila Megyeri napsal(a):
>>>>>
>>>>>> -----Original Message-
>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>>>>>> Sent: Tuesday, March 11, 2014 10:27 PM
>>>>>> To: The Pacemaker cluster resource manager
>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>
>>>>>>
>>>>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri
>>>>>> 
>>>>>> wrote:
>>>>>>
>>>>>>>>
>>>>>>>> -Original Message-
>>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>>>>>>>> Sent: Tuesday, March 11, 2014 12:48 AM
>>>>>>>> To: The Pacemaker cluster resource manager
>>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri
>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the quick response!
>>>>>>>>>
>>>>>>>>>> -Original Message-
>>>>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>>>>>>>>>> Sent: Friday, March 07, 2014 3:48 AM
>>>>>>>>>> To: The Pacemaker cluster resource manager
>>>>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
>>>>>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello,
>>>>>>>>>>>
>>>>>>>>>>> We have a strange issue with Corosync/Pacemaker.
>>>>>>>>>>> From time to time, something unexpected happens and
>> suddenly
>>>> the
>>>>>>>>>> crm_mon output remains static.
>>>>>>>>>>> When I check the cpu usage, I see that one of the cores uses
>>>>>>>>>>> 100% cpu, but
>>>>>>>>>> cannot actually match it to either the corosync or one of the
>>>>>>>>>> pacemaker processes.
>>>>>>>>>>>
>>>>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
>>>>>>>>>>> I have to manually go to each node, stop pacemaker, restart
>>>>>>>>>>> corosync, then
>>>>>>>>>> start pacemeker. Stoping pacemaker and corosync does not work
>>>>>>>>>> in most of the cases, usually a kill -9 is needed.
>>>>>>>>>>>
>>>>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>>>>>>>>>>>
>>>>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode
>>>> passive.
>>>>>>>>>>>
>>>>>>>>>>> Logs are usually flooded with CPG related messages, such as:
>>>>>>>>>>>
>>>>>>>>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
>> Sent
>>>> 0
>>>>>>>> CPG
>>>>>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>>>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
>> Sent
>>>> 0
>>>>

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri
> -Original Message-
> From: Jan Friesse [mailto:jfrie...@redhat.com]
> Sent: Wednesday, March 12, 2014 2:27 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>
> Attila Megyeri napsal(a):
> > Hello Jan,
> >
> > Thank you very much for your help so far.
> >
> >> -Original Message-
> >> From: Jan Friesse [mailto:jfrie...@redhat.com]
> >> Sent: Wednesday, March 12, 2014 9:51 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >> Attila Megyeri napsal(a):
> >>>
> >>>> -Original Message-
> >>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >>>> Sent: Tuesday, March 11, 2014 10:27 PM
> >>>> To: The Pacemaker cluster resource manager
> >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>
> >>>>
> >>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri
> >>>> 
> >>>> wrote:
> >>>>
> >>>>>>
> >>>>>> -Original Message-
> >>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >>>>>> Sent: Tuesday, March 11, 2014 12:48 AM
> >>>>>> To: The Pacemaker cluster resource manager
> >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>>>
> >>>>>>
> >>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri
> >>>>>> 
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Thanks for the quick response!
> >>>>>>>
> >>>>>>>> -Original Message-
> >>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >>>>>>>> Sent: Friday, March 07, 2014 3:48 AM
> >>>>>>>> To: The Pacemaker cluster resource manager
> >>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
> >>>>>>>> 
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hello,
> >>>>>>>>>
> >>>>>>>>> We have a strange issue with Corosync/Pacemaker.
> >>>>>>>>> From time to time, something unexpected happens and
> suddenly
> >> the
> >>>>>>>> crm_mon output remains static.
> >>>>>>>>> When I check the cpu usage, I see that one of the cores uses
> >>>>>>>>> 100% cpu, but
> >>>>>>>> cannot actually match it to either the corosync or one of the
> >>>>>>>> pacemaker processes.
> >>>>>>>>>
> >>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
> >>>>>>>>> I have to manually go to each node, stop pacemaker, restart
> >>>>>>>>> corosync, then
> >>>>>>>> start pacemeker. Stoping pacemaker and corosync does not work
> >>>>>>>> in most of the cases, usually a kill -9 is needed.
> >>>>>>>>>
> >>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >>>>>>>>>
> >>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode
> >> passive.
> >>>>>>>>>
> >>>>>>>>> Logs are usually flooded with CPG related messages, such as:
> >>>>>>>>>
> >>>>>>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
> Sent
> >> 0
> >>>>>> CPG
> >>>>>>>> messages  (1 remaining, last=8): Try again (6)
> >>>>>>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
> Sent
> >> 0
> >>>>>> CPG
> >>>>>>>> messages  (1 remaining, last=8): Try again (6)
> >>>>>>>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
> Sent
> >> 0
> >>>>>> CPG
> >>>>>&g

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
> Hello Jan,
> 
> Thank you very much for your help so far.
> 
>> -Original Message-
>> From: Jan Friesse [mailto:jfrie...@redhat.com]
>> Sent: Wednesday, March 12, 2014 9:51 AM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>> Attila Megyeri napsal(a):
>>>
>>>> -Original Message-
>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>>>> Sent: Tuesday, March 11, 2014 10:27 PM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>
>>>>
>>>> On 12 Mar 2014, at 1:54 am, Attila Megyeri
>>>> 
>>>> wrote:
>>>>
>>>>>>
>>>>>> -Original Message-
>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>>>>>> Sent: Tuesday, March 11, 2014 12:48 AM
>>>>>> To: The Pacemaker cluster resource manager
>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>
>>>>>>
>>>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri
>>>>>> 
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks for the quick response!
>>>>>>>
>>>>>>>> -Original Message-
>>>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>>>>>>>> Sent: Friday, March 07, 2014 3:48 AM
>>>>>>>> To: The Pacemaker cluster resource manager
>>>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
>>>>>>>> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> We have a strange issue with Corosync/Pacemaker.
>>>>>>>>> From time to time, something unexpected happens and suddenly
>> the
>>>>>>>> crm_mon output remains static.
>>>>>>>>> When I check the cpu usage, I see that one of the cores uses
>>>>>>>>> 100% cpu, but
>>>>>>>> cannot actually match it to either the corosync or one of the
>>>>>>>> pacemaker processes.
>>>>>>>>>
>>>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
>>>>>>>>> I have to manually go to each node, stop pacemaker, restart
>>>>>>>>> corosync, then
>>>>>>>> start pacemeker. Stoping pacemaker and corosync does not work in
>>>>>>>> most of the cases, usually a kill -9 is needed.
>>>>>>>>>
>>>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>>>>>>>>>
>>>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode
>> passive.
>>>>>>>>>
>>>>>>>>> Logs are usually flooded with CPG related messages, such as:
>>>>>>>>>
>>>>>>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
>>>>>>>>>   Sent
>> 0
>>>>>> CPG
>>>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
>>>>>>>>>   Sent
>> 0
>>>>>> CPG
>>>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
>>>>>>>>>   Sent
>> 0
>>>>>> CPG
>>>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
>>>>>>>>>   Sent
>> 0
>>>>>> CPG
>>>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>>>>
>>>>>>>>> OR
>>>>>>>>>
>>>>>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_c

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri
Hello Jan,

Thank you very much for your help so far.

> -Original Message-
> From: Jan Friesse [mailto:jfrie...@redhat.com]
> Sent: Wednesday, March 12, 2014 9:51 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> Attila Megyeri napsal(a):
> >
> >> -Original Message-
> >> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >> Sent: Tuesday, March 11, 2014 10:27 PM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >>
> >> On 12 Mar 2014, at 1:54 am, Attila Megyeri
> >> 
> >> wrote:
> >>
> >>>>
> >>>> -Original Message-
> >>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >>>> Sent: Tuesday, March 11, 2014 12:48 AM
> >>>> To: The Pacemaker cluster resource manager
> >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>
> >>>>
> >>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri
> >>>> 
> >>>> wrote:
> >>>>
> >>>>> Thanks for the quick response!
> >>>>>
> >>>>>> -Original Message-
> >>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >>>>>> Sent: Friday, March 07, 2014 3:48 AM
> >>>>>> To: The Pacemaker cluster resource manager
> >>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>>>
> >>>>>>
> >>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
> >>>>>> 
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> We have a strange issue with Corosync/Pacemaker.
> >>>>>>> From time to time, something unexpected happens and suddenly
> the
> >>>>>> crm_mon output remains static.
> >>>>>>> When I check the cpu usage, I see that one of the cores uses
> >>>>>>> 100% cpu, but
> >>>>>> cannot actually match it to either the corosync or one of the
> >>>>>> pacemaker processes.
> >>>>>>>
> >>>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
> >>>>>>> I have to manually go to each node, stop pacemaker, restart
> >>>>>>> corosync, then
> >>>>>> start pacemeker. Stoping pacemaker and corosync does not work in
> >>>>>> most of the cases, usually a kill -9 is needed.
> >>>>>>>
> >>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >>>>>>>
> >>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode
> passive.
> >>>>>>>
> >>>>>>> Logs are usually flooded with CPG related messages, such as:
> >>>>>>>
> >>>>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
> >>>>>>>   Sent
> 0
> >>>> CPG
> >>>>>> messages  (1 remaining, last=8): Try again (6)
> >>>>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
> >>>>>>>   Sent
> 0
> >>>> CPG
> >>>>>> messages  (1 remaining, last=8): Try again (6)
> >>>>>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
> >>>>>>>   Sent
> 0
> >>>> CPG
> >>>>>> messages  (1 remaining, last=8): Try again (6)
> >>>>>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
> >>>>>>>   Sent
> 0
> >>>> CPG
> >>>>>> messages  (1 remaining, last=8): Try again (6)
> >>>>>>>
> >>>>>>> OR
> >>>>>>>
> >>>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
> >>>>>>>   Sent 0
> >> CPG
> >>>>>> messages  (1 remaining, last=10933): Try again (
> >>>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
> >>>>>>>   Sent 0
> >> CPG
> >>>>>&

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
> 
>> -Original Message-
>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>> Sent: Tuesday, March 11, 2014 10:27 PM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>
>>
>> On 12 Mar 2014, at 1:54 am, Attila Megyeri 
>> wrote:
>>
>>>>
>>>> -Original Message-
>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>>>> Sent: Tuesday, March 11, 2014 12:48 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>
>>>>
>>>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri 
>>>> wrote:
>>>>
>>>>> Thanks for the quick response!
>>>>>
>>>>>> -Original Message-
>>>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>>>>>> Sent: Friday, March 07, 2014 3:48 AM
>>>>>> To: The Pacemaker cluster resource manager
>>>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>>>>
>>>>>>
>>>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
>>>>>> 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> We have a strange issue with Corosync/Pacemaker.
>>>>>>> From time to time, something unexpected happens and suddenly the
>>>>>> crm_mon output remains static.
>>>>>>> When I check the cpu usage, I see that one of the cores uses 100%
>>>>>>> cpu, but
>>>>>> cannot actually match it to either the corosync or one of the
>>>>>> pacemaker processes.
>>>>>>>
>>>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
>>>>>>> I have to manually go to each node, stop pacemaker, restart
>>>>>>> corosync, then
>>>>>> start pacemeker. Stoping pacemaker and corosync does not work in
>>>>>> most of the cases, usually a kill -9 is needed.
>>>>>>>
>>>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>>>>>>>
>>>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
>>>>>>>
>>>>>>> Logs are usually flooded with CPG related messages, such as:
>>>>>>>
>>>>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>>>>>> Sent 0
>>>> CPG
>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>>>>>> Sent 0
>>>> CPG
>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>>>>>> Sent 0
>>>> CPG
>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>>>>>> Sent 0
>>>> CPG
>>>>>> messages  (1 remaining, last=8): Try again (6)
>>>>>>>
>>>>>>> OR
>>>>>>>
>>>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>>>>>>> Sent 0
>> CPG
>>>>>> messages  (1 remaining, last=10933): Try again (
>>>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>>>>>>> Sent 0
>> CPG
>>>>>> messages  (1 remaining, last=10933): Try again (
>>>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>>>>>>> Sent 0
>> CPG
>>>>>> messages  (1 remaining, last=10933): Try again (
>>>>>>
>>>>>> That is usually a symptom of corosync getting into a horribly
>>>>>> confused
>>>> state.
>>>>>> Version? Distro? Have you checked for an update?
>>>>>> Odd that the user of all that CPU isn't showing up though.
>>>>>>
>>>>>>>
>>>>>
>>>>> As I wrote I use Ubuntu trusty, the exact package versions are:
>>>>

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri

> -Original Message-
> From: Andrew Beekhof [mailto:and...@beekhof.net]
> Sent: Tuesday, March 11, 2014 10:27 PM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
> On 12 Mar 2014, at 1:54 am, Attila Megyeri 
> wrote:
> 
> >>
> >> -Original Message-
> >> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >> Sent: Tuesday, March 11, 2014 12:48 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >>
> >> On 7 Mar 2014, at 5:54 pm, Attila Megyeri 
> >> wrote:
> >>
> >>> Thanks for the quick response!
> >>>
> >>>> -Original Message-
> >>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >>>> Sent: Friday, March 07, 2014 3:48 AM
> >>>> To: The Pacemaker cluster resource manager
> >>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>>>
> >>>>
> >>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri
> >>>> 
> >>>> wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> We have a strange issue with Corosync/Pacemaker.
> >>>>> From time to time, something unexpected happens and suddenly the
> >>>> crm_mon output remains static.
> >>>>> When I check the cpu usage, I see that one of the cores uses 100%
> >>>>> cpu, but
> >>>> cannot actually match it to either the corosync or one of the
> >>>> pacemaker processes.
> >>>>>
> >>>>> In such a case, this high CPU usage is happening on all 7 nodes.
> >>>>> I have to manually go to each node, stop pacemaker, restart
> >>>>> corosync, then
> >>>> start pacemeker. Stoping pacemaker and corosync does not work in
> >>>> most of the cases, usually a kill -9 is needed.
> >>>>>
> >>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >>>>>
> >>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
> >>>>>
> >>>>> Logs are usually flooded with CPG related messages, such as:
> >>>>>
> >>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
> >>>>> Sent 0
> >> CPG
> >>>> messages  (1 remaining, last=8): Try again (6)
> >>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
> >>>>> Sent 0
> >> CPG
> >>>> messages  (1 remaining, last=8): Try again (6)
> >>>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
> >>>>> Sent 0
> >> CPG
> >>>> messages  (1 remaining, last=8): Try again (6)
> >>>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
> >>>>> Sent 0
> >> CPG
> >>>> messages  (1 remaining, last=8): Try again (6)
> >>>>>
> >>>>> OR
> >>>>>
> >>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> >>>>> Sent 0
> CPG
> >>>> messages  (1 remaining, last=10933): Try again (
> >>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> >>>>> Sent 0
> CPG
> >>>> messages  (1 remaining, last=10933): Try again (
> >>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> >>>>> Sent 0
> CPG
> >>>> messages  (1 remaining, last=10933): Try again (
> >>>>
> >>>> That is usually a symptom of corosync getting into a horribly
> >>>> confused
> >> state.
> >>>> Version? Distro? Have you checked for an update?
> >>>> Odd that the user of all that CPU isn't showing up though.
> >>>>
> >>>>>
> >>>
> >>> As I wrote I use Ubuntu trusty, the exact package versions are:
> >>>
> >>> corosync 2.3.0-1ubuntu5
> >>> pacemaker 1.1.10+git20130802-1ubuntu2
> >>
> >> Ah sorry, I seem to have missed that part.
> >>
> >>>
> >>> There are no updates available. The only option is to install from
> >>> sources,
> >>

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-11 Thread Andrew Beekhof

On 12 Mar 2014, at 1:54 am, Attila Megyeri  wrote:

>> 
>> -Original Message-
>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>> Sent: Tuesday, March 11, 2014 12:48 AM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>> 
>> 
>> On 7 Mar 2014, at 5:54 pm, Attila Megyeri 
>> wrote:
>> 
>>> Thanks for the quick response!
>>> 
>>>> -Original Message-
>>>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>>>> Sent: Friday, March 07, 2014 3:48 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>>>> 
>>>> 
>>>> On 7 Mar 2014, at 5:31 am, Attila Megyeri 
>>>> wrote:
>>>> 
>>>>> Hello,
>>>>> 
>>>>> We have a strange issue with Corosync/Pacemaker.
>>>>> From time to time, something unexpected happens and suddenly the
>>>> crm_mon output remains static.
>>>>> When I check the cpu usage, I see that one of the cores uses 100%
>>>>> cpu, but
>>>> cannot actually match it to either the corosync or one of the
>>>> pacemaker processes.
>>>>> 
>>>>> In such a case, this high CPU usage is happening on all 7 nodes.
>>>>> I have to manually go to each node, stop pacemaker, restart
>>>>> corosync, then
>>>> start pacemeker. Stoping pacemaker and corosync does not work in most
>>>> of the cases, usually a kill -9 is needed.
>>>>> 
>>>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>>>>> 
>>>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
>>>>> 
>>>>> Logs are usually flooded with CPG related messages, such as:
>>>>> 
>>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>>>> Sent 0
>> CPG
>>>> messages  (1 remaining, last=8): Try again (6)
>>>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>>>> Sent 0
>> CPG
>>>> messages  (1 remaining, last=8): Try again (6)
>>>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>>>> Sent 0
>> CPG
>>>> messages  (1 remaining, last=8): Try again (6)
>>>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>>>> Sent 0
>> CPG
>>>> messages  (1 remaining, last=8): Try again (6)
>>>>> 
>>>>> OR
>>>>> 
>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>>>>> Sent 0 CPG
>>>> messages  (1 remaining, last=10933): Try again (
>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>>>>> Sent 0 CPG
>>>> messages  (1 remaining, last=10933): Try again (
>>>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>>>>> Sent 0 CPG
>>>> messages  (1 remaining, last=10933): Try again (
>>>> 
>>>> That is usually a symptom of corosync getting into a horribly confused
>> state.
>>>> Version? Distro? Have you checked for an update?
>>>> Odd that the user of all that CPU isn't showing up though.
>>>> 
>>>>> 
>>> 
>>> As I wrote I use Ubuntu trusty, the exact package versions are:
>>> 
>>> corosync 2.3.0-1ubuntu5
>>> pacemaker 1.1.10+git20130802-1ubuntu2
>> 
>> Ah sorry, I seem to have missed that part.
>> 
>>> 
>>> There are no updates available. The only option is to install from sources,
>> but that would be very difficult to maintain and I'm not sure I would get 
>> rid of
>> this issue.
>>> 
>>> What do you recommend?
>> 
>> The same thing as Lars, or switch to a distro that stays current with 
>> upstream
>> (git shows 5 newer releases for that branch since it was released 3 years
>> ago).
>> If you do build from source, its probably best to go with v1.4.6
> 
> Hm, I am a bit confused here. We are using 2.3.0,

I swapped the 2 for a 1 somehow. A bit distracted, sorry.

> which was released approx. a year ago (you mention 3 years) and you recommend 
> 1.4.6, which is a rather old version.
> Could you please clarify a bit? :)
> Lars recommends 2.3.3 git 

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-11 Thread Attila Megyeri

> -Original Message-
> From: Andrew Beekhof [mailto:and...@beekhof.net]
> Sent: Tuesday, March 11, 2014 12:48 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
> On 7 Mar 2014, at 5:54 pm, Attila Megyeri 
> wrote:
> 
> > Thanks for the quick response!
> >
> >> -Original Message-
> >> From: Andrew Beekhof [mailto:and...@beekhof.net]
> >> Sent: Friday, March 07, 2014 3:48 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >>
> >>
> >> On 7 Mar 2014, at 5:31 am, Attila Megyeri 
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>> We have a strange issue with Corosync/Pacemaker.
> >>> From time to time, something unexpected happens and suddenly the
> >> crm_mon output remains static.
> >>> When I check the cpu usage, I see that one of the cores uses 100%
> >>> cpu, but
> >> cannot actually match it to either the corosync or one of the
> >> pacemaker processes.
> >>>
> >>> In such a case, this high CPU usage is happening on all 7 nodes.
> >>> I have to manually go to each node, stop pacemaker, restart
> >>> corosync, then
> >> start pacemeker. Stoping pacemaker and corosync does not work in most
> >> of the cases, usually a kill -9 is needed.
> >>>
> >>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >>>
> >>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
> >>>
> >>> Logs are usually flooded with CPG related messages, such as:
> >>>
> >>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
> >>> Sent 0
> CPG
> >> messages  (1 remaining, last=8): Try again (6)
> >>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
> >>> Sent 0
> CPG
> >> messages  (1 remaining, last=8): Try again (6)
> >>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
> >>> Sent 0
> CPG
> >> messages  (1 remaining, last=8): Try again (6)
> >>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
> >>> Sent 0
> CPG
> >> messages  (1 remaining, last=8): Try again (6)
> >>>
> >>> OR
> >>>
> >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> >>> Sent 0 CPG
> >> messages  (1 remaining, last=10933): Try again (
> >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> >>> Sent 0 CPG
> >> messages  (1 remaining, last=10933): Try again (
> >>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> >>> Sent 0 CPG
> >> messages  (1 remaining, last=10933): Try again (
> >>
> >> That is usually a symptom of corosync getting into a horribly confused
> state.
> >> Version? Distro? Have you checked for an update?
> >> Odd that the user of all that CPU isn't showing up though.
> >>
> >>>
> >
> > As I wrote I use Ubuntu trusty, the exact package versions are:
> >
> > corosync 2.3.0-1ubuntu5
> > pacemaker 1.1.10+git20130802-1ubuntu2
> 
> Ah sorry, I seem to have missed that part.
> 
> >
> > There are no updates available. The only option is to install from sources,
> but that would be very difficult to maintain and I'm not sure I would get rid 
> of
> this issue.
> >
> > What do you recommend?
> 
> The same thing as Lars, or switch to a distro that stays current with upstream
> (git shows 5 newer releases for that branch since it was released 3 years
> ago).
> If you do build from source, its probably best to go with v1.4.6

Hm, I am a bit confused here. We are using 2.3.0, which was released approx. a 
year ago (you mention 3 years) and you recommend 1.4.6, which is a rather old 
version.
Could you please clarify a bit? :)
Lars recommends 2.3.3 git tree.

I might end up trying both, but just want to make sure I am not 
misunderstanding something badly.

Thank you!








> 
> >
> >
> >>>
> >>> HTOP show something like this (sorted by TIME+ descending):
> >>>
> >>>
> >>>
> >>>  1  [100.0%] Tasks: 59, 4
> >> thr; 2 running
> >>>  2  [| 0.7%] Load average

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-10 Thread Andrew Beekhof

On 7 Mar 2014, at 5:54 pm, Attila Megyeri  wrote:

> Thanks for the quick response!
> 
>> -Original Message-
>> From: Andrew Beekhof [mailto:and...@beekhof.net]
>> Sent: Friday, March 07, 2014 3:48 AM
>> To: The Pacemaker cluster resource manager
>> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
>> 
>> 
>> On 7 Mar 2014, at 5:31 am, Attila Megyeri 
>> wrote:
>> 
>>> Hello,
>>> 
>>> We have a strange issue with Corosync/Pacemaker.
>>> From time to time, something unexpected happens and suddenly the
>> crm_mon output remains static.
>>> When I check the cpu usage, I see that one of the cores uses 100% cpu, but
>> cannot actually match it to either the corosync or one of the pacemaker
>> processes.
>>> 
>>> In such a case, this high CPU usage is happening on all 7 nodes.
>>> I have to manually go to each node, stop pacemaker, restart corosync, then
>> start pacemeker. Stoping pacemaker and corosync does not work in most of
>> the cases, usually a kill -9 is needed.
>>> 
>>> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>>> 
>>> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
>>> 
>>> Logs are usually flooded with CPG related messages, such as:
>>> 
>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>> Sent 0 CPG
>> messages  (1 remaining, last=8): Try again (6)
>>> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>> Sent 0 CPG
>> messages  (1 remaining, last=8): Try again (6)
>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>> Sent 0 CPG
>> messages  (1 remaining, last=8): Try again (6)
>>> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
>>> Sent 0 CPG
>> messages  (1 remaining, last=8): Try again (6)
>>> 
>>> OR
>>> 
>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>>> Sent 0 CPG
>> messages  (1 remaining, last=10933): Try again (
>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>>> Sent 0 CPG
>> messages  (1 remaining, last=10933): Try again (
>>> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
>>> Sent 0 CPG
>> messages  (1 remaining, last=10933): Try again (
>> 
>> That is usually a symptom of corosync getting into a horribly confused state.
>> Version? Distro? Have you checked for an update?
>> Odd that the user of all that CPU isn't showing up though.
>> 
>>> 
> 
> As I wrote I use Ubuntu trusty, the exact package versions are:
> 
> corosync 2.3.0-1ubuntu5
> pacemaker 1.1.10+git20130802-1ubuntu2

Ah sorry, I seem to have missed that part.

> 
> There are no updates available. The only option is to install from sources, 
> but that would be very difficult to maintain and I'm not sure I would get rid 
> of this issue.
> 
> What do you recommend?

The same thing as Lars, or switch to a distro that stays current with upstream 
(git shows 5 newer releases for that branch since it was released 3 years ago).
If you do build from source, its probably best to go with v1.4.6

> 
> 
>>> 
>>> HTOP show something like this (sorted by TIME+ descending):
>>> 
>>> 
>>> 
>>>  1  [100.0%] Tasks: 59, 4
>> thr; 2 running
>>>  2  [| 0.7%] Load average: 1.00 
>>> 0.99 1.02
>>>  Mem[ 165/994MB] Uptime: 1
>> day, 10:22:03
>>>  Swp[   0/509MB]
>>> 
>>>  PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>>  921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58 
>>> /usr/sbin/corosync
>>> 1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
>>> /usr/sbin/snmpd -
>> Lsd -Lf /dev/null -u snmp -g snm
>>> 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
>> /usr/lib/pacemaker/cib
>>> 1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
>> /usr/lib/pacemaker/stonithd
>>> 1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
>>> /usr/sbin/watchdog
>>> 1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
>> /usr/lib/pacemaker/crmd
>>> 1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
>> /usr/lib/pacemaker/lrmd
>

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-09 Thread Lars Marowsky-Bree
On 2014-03-07T09:08:41, Attila Megyeri  wrote:

> One more thing to add. I did an apt-get upgrade on one of the nodes, and then 
> restarted the node. It resulted in this state on all other nodes again...

2.3.0 is not the most recent corosync version. 2.3.3 (and possibly the
git tree) contain quite a number of important fixes.

I'd suggest to ask Ubuntu for an update - or to submit one yourself,
community distributions welcome contributors ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-07 Thread Attila Megyeri
One more thing to add. I did an apt-get upgrade on one of the nodes, and then 
restarted the node. It resulted in this state on all other nodes again...

> -Original Message-
> From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
> Sent: Friday, March 07, 2014 7:54 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> Thanks for the quick response!
> 
> > -Original Message-
> > From: Andrew Beekhof [mailto:and...@beekhof.net]
> > Sent: Friday, March 07, 2014 3:48 AM
> > To: The Pacemaker cluster resource manager
> > Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> >
> >
> > On 7 Mar 2014, at 5:31 am, Attila Megyeri 
> > wrote:
> >
> > > Hello,
> > >
> > > We have a strange issue with Corosync/Pacemaker.
> > > From time to time, something unexpected happens and suddenly the
> > crm_mon output remains static.
> > > When I check the cpu usage, I see that one of the cores uses 100%
> > > cpu, but
> > cannot actually match it to either the corosync or one of the
> > pacemaker processes.
> > >
> > > In such a case, this high CPU usage is happening on all 7 nodes.
> > > I have to manually go to each node, stop pacemaker, restart
> > > corosync, then
> > start pacemeker. Stoping pacemaker and corosync does not work in most
> > of the cases, usually a kill -9 is needed.
> > >
> > > Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> > >
> > > Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
> > >
> > > Logs are usually flooded with CPG related messages, such as:
> > >
> > > Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
> > > Sent 0
> CPG
> > messages  (1 remaining, last=8): Try again (6)
> > > Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
> > > Sent 0
> CPG
> > messages  (1 remaining, last=8): Try again (6)
> > > Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
> > > Sent 0
> CPG
> > messages  (1 remaining, last=8): Try again (6)
> > > Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
> > > Sent 0
> CPG
> > messages  (1 remaining, last=8): Try again (6)
> > >
> > > OR
> > >
> > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> > > Sent 0 CPG
> > messages  (1 remaining, last=10933): Try again (
> > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> > > Sent 0 CPG
> > messages  (1 remaining, last=10933): Try again (
> > > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> > > Sent 0 CPG
> > messages  (1 remaining, last=10933): Try again (
> >
> > That is usually a symptom of corosync getting into a horribly confused 
> > state.
> > Version? Distro? Have you checked for an update?
> > Odd that the user of all that CPU isn't showing up though.
> >
> > >
> 
> As I wrote I use Ubuntu trusty, the exact package versions are:
> 
> corosync 2.3.0-1ubuntu5
> pacemaker 1.1.10+git20130802-1ubuntu2
> 
> There are no updates available. The only option is to install from sources, 
> but
> that would be very difficult to maintain and I'm not sure I would get rid of 
> this
> issue.
> 
> What do you recommend?
> 
> 
> > >
> > > HTOP show something like this (sorted by TIME+ descending):
> > >
> > >
> > >
> > >   1  [100.0%] Tasks: 59, 4
> > thr; 2 running
> > >   2  [| 0.7%] Load average: 
> > > 1.00 0.99 1.02
> > >   Mem[ 165/994MB] Uptime: 1
> > day, 10:22:03
> > >   Swp[   0/509MB]
> > >
> > >   PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> > >   921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58
> /usr/sbin/corosync
> > > 1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
> > > /usr/sbin/snmpd -
> > Lsd -Lf /dev/null -u snmp -g snm
> > > 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
> > /usr/lib/pacemaker/cib
> > > 1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
> > /usr/lib/pacemaker/stonithd
> > > 1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
&

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-06 Thread Attila Megyeri
Thanks for the quick response!

> -Original Message-
> From: Andrew Beekhof [mailto:and...@beekhof.net]
> Sent: Friday, March 07, 2014 3:48 AM
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Pacemaker/corosync freeze
> 
> 
> On 7 Mar 2014, at 5:31 am, Attila Megyeri 
> wrote:
> 
> > Hello,
> >
> > We have a strange issue with Corosync/Pacemaker.
> > From time to time, something unexpected happens and suddenly the
> crm_mon output remains static.
> > When I check the cpu usage, I see that one of the cores uses 100% cpu, but
> cannot actually match it to either the corosync or one of the pacemaker
> processes.
> >
> > In such a case, this high CPU usage is happening on all 7 nodes.
> > I have to manually go to each node, stop pacemaker, restart corosync, then
> start pacemeker. Stoping pacemaker and corosync does not work in most of
> the cases, usually a kill -9 is needed.
> >
> > Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
> >
> > Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
> >
> > Logs are usually flooded with CPG related messages, such as:
> >
> > Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
> > Sent 0 CPG
> messages  (1 remaining, last=8): Try again (6)
> > Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
> > Sent 0 CPG
> messages  (1 remaining, last=8): Try again (6)
> > Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
> > Sent 0 CPG
> messages  (1 remaining, last=8): Try again (6)
> > Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
> > Sent 0 CPG
> messages  (1 remaining, last=8): Try again (6)
> >
> > OR
> >
> > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> > Sent 0 CPG
> messages  (1 remaining, last=10933): Try again (
> > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> > Sent 0 CPG
> messages  (1 remaining, last=10933): Try again (
> > Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
> > Sent 0 CPG
> messages  (1 remaining, last=10933): Try again (
> 
> That is usually a symptom of corosync getting into a horribly confused state.
> Version? Distro? Have you checked for an update?
> Odd that the user of all that CPU isn't showing up though.
> 
> >

As I wrote I use Ubuntu trusty, the exact package versions are:

corosync 2.3.0-1ubuntu5
pacemaker 1.1.10+git20130802-1ubuntu2

There are no updates available. The only option is to install from sources, but 
that would be very difficult to maintain and I'm not sure I would get rid of 
this issue.

What do you recommend?


> >
> > HTOP show something like this (sorted by TIME+ descending):
> >
> >
> >
> >   1  [100.0%] Tasks: 59, 4
> thr; 2 running
> >   2  [| 0.7%] Load average: 
> > 1.00 0.99 1.02
> >   Mem[ 165/994MB] Uptime: 1
> day, 10:22:03
> >   Swp[   0/509MB]
> >
> >   PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
> >   921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58 
> > /usr/sbin/corosync
> > 1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
> > /usr/sbin/snmpd -
> Lsd -Lf /dev/null -u snmp -g snm
> > 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
> /usr/lib/pacemaker/cib
> > 1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
> /usr/lib/pacemaker/stonithd
> > 1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
> > /usr/sbin/watchdog
> > 1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
> /usr/lib/pacemaker/crmd
> > 1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
> /usr/lib/pacemaker/lrmd
> > 1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
> /usr/lib/pacemaker/attrd
> > 1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
> > 1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: read 
> > process
> > 1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25
> /usr/lib/pacemaker/pengine
> > 1252 root   20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: 
> > write process
> > 1835 ntp20   0 27216  1980  1408 S  0.0  0.2  0:11.80 
> > /usr/sbin/ntpd -p
> /var/run/ntpd.pid -g -u 105:112
> >   899 root   20   0 19168   700   488 S  0.0  0.1  0:09.75 
> &

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-06 Thread Andrew Beekhof

On 7 Mar 2014, at 5:31 am, Attila Megyeri  wrote:

> Hello,
>  
> We have a strange issue with Corosync/Pacemaker.
> From time to time, something unexpected happens and suddenly the crm_mon 
> output remains static.
> When I check the cpu usage, I see that one of the cores uses 100% cpu, but 
> cannot actually match it to either the corosync or one of the pacemaker 
> processes.
>  
> In such a case, this high CPU usage is happening on all 7 nodes.
> I have to manually go to each node, stop pacemaker, restart corosync, then 
> start pacemeker. Stoping pacemaker and corosync does not work in most of the 
> cases, usually a kill -9 is needed.
>  
> Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
>  
> Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
>  
> Logs are usually flooded with CPG related messages, such as:
>  
> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 
> 0 CPG messages  (1 remaining, last=8): Try again (6)
> Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 
> 0 CPG messages  (1 remaining, last=8): Try again (6)
> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 
> 0 CPG messages  (1 remaining, last=8): Try again (6)
> Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 
> 0 CPG messages  (1 remaining, last=8): Try again (6)
>  
> OR
>  
> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 
> 0 CPG messages  (1 remaining, last=10933): Try again (
> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 
> 0 CPG messages  (1 remaining, last=10933): Try again (
> Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 
> 0 CPG messages  (1 remaining, last=10933): Try again (

That is usually a symptom of corosync getting into a horribly confused state.  
Version? Distro? Have you checked for an update?
Odd that the user of all that CPU isn't showing up though.

>  
>  
> HTOP show something like this (sorted by TIME+ descending):
>  
>  
>  
>   1  [100.0%] Tasks: 59, 4 thr; 2 
> running
>   2  [| 0.7%] Load average: 1.00 
> 0.99 1.02
>   Mem[ 165/994MB] Uptime: 1 day, 
> 10:22:03
>   Swp[   0/509MB]
>  
>   PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>   921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58 
> /usr/sbin/corosync
> 1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 /usr/sbin/snmpd 
> -Lsd -Lf /dev/null -u snmp -g snm
> 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71 
> /usr/lib/pacemaker/cib
> 1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06 
> /usr/lib/pacemaker/stonithd
> 1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
> /usr/sbin/watchdog
> 1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62 
> /usr/lib/pacemaker/crmd
> 1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64 
> /usr/lib/pacemaker/lrmd
> 1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01 
> /usr/lib/pacemaker/attrd
> 1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
> 1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: read 
> process
> 1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25 
> /usr/lib/pacemaker/pengine
> 1252 root   20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: write 
> process
> 1835 ntp20   0 27216  1980  1408 S  0.0  0.2  0:11.80 /usr/sbin/ntpd 
> -p /var/run/ntpd.pid -g -u 105:112
>   899 root   20   0 19168   700   488 S  0.0  0.1  0:09.75 
> /usr/sbin/irqbalance
> 1642 root   20   0 30696  1556   912 S  0.0  0.2  0:06.49 /usr/bin/monit 
> -c /etc/monit/monitrc
> 4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77 
> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> 3079 root0 -20 16864  4592  3508 S  0.0  0.5  0:01.51 /usr/bin/atop 
> -a -w /var/log/atop/atop_20140306 6
>   445 syslog 20   0  249M  6276   976 S  0.0  0.6  0:01.16 rsyslogd
> 4373 kamailio   20   0  291M  7492  2396 S  0.0  0.7  0:01.03 
> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> 1 root   20   0 33376  2632  1404 S  0.0  0.3  0:00.63 /sbin/init
>   453 syslog 20   0  249M  6276   976 S  0.0  0.6  0:00.63 rsyslogd
>   451 syslog 20   0  249M  6276   976 S  0.0  0.6  0:00.53 rsyslogd
> 4379 kamailio   20   0  291M  6224  1132 S  0.0  0.6  0:00.38 
> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> 4380 kamailio   20   0  291M  8516  3084 S  0.0  0.8  0:00.38 
> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> 4381 kamailio   20   0  291M  8252  2828 S  0.0  0.8  0:00.37 
> /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
> 23315 root   20   0 24872  2476  1412 R  0.7  0.2  0:00.37 htop
> 4367 kamailio   20   0