On Mon, 14 Apr 2014 14:40:43 +1000 Andrew Beekhof <and...@beekhof.net> wrote:
> > On 11 Apr 2014, at 10:54 pm, Marco Felettigh <ma...@nucleus.it> wrote: > > > On Fri, 11 Apr 2014 17:17:57 +1000 > > Andrew Beekhof <and...@beekhof.net> wrote: > > > >> > >> On 8 Apr 2014, at 8:37 pm, ma...@nucleus.it wrote: > >> > >>> On Tue, 8 Apr 2014 10:49:16 +1000 > >>> Andrew Beekhof <and...@beekhof.net> wrote: > >>> > >>>> > >>>> On 7 Apr 2014, at 8:46 pm, ma...@nucleus.it wrote: > >>>> > >>>>> Hi, > >>>>> in a production environment with 2 nodes ( nodeA , nodeB ) we > >>>>> had an hardware failure so we restart the nodeB. > >>>>> After the restarted nodeB came up we restart corosync/pacemaker > >>>>> on it but for 2 days till now che corosync/pacemaker stuff is > >>>>> looping. > >>>>> > >>>>> crm_mon NodeA: > >>>>> > >>>>> Stack: openais > >>>>> Current DC: nodeA - partition with quorum > >>>>> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 > >>>>> 2 Nodes configured, 2 expected votes > >>>>> 17 Resources configured. > >>>>> ============ > >>>>> > >>>>> Online: [ nodeA ] > >>>>> OFFLINE: [ nodeB ] > >>>>> > >>>>> > >>>>> crm_mon NodeB: > >>>>> > >>>>> Stack: openais > >>>>> Current DC: NONE > >>>>> 2 Nodes configured, 2 expected votes > >>>>> 17 Resources configured. > >>>>> ============ > >>>>> > >>>>> OFFLINE: [ nodeA nodeB ] > >>>>> > >>>>> This loop on nodeB reports: > >>>>> crmd: [7149]: debug: do_election_count_vote: Election 3 (owner: > >>>>> nodeA) lost: vote from nodeA (Age) > >>>>> > >>>>> So investigating around i found these message on nodeA: > >>>>> cib: [28755]: ERROR: send_ais_message: Not connected to AIS > >>>>> > >>>>> now this message is repeating for every operation. > >>>>> Is it a corosync problem or a cib/pacemaker one ? > >>>>> Any suggestion on what is happened ? > >>>> > >>>> For some reason the cib can't connect to corosync anymore. > >>>> No software got upgraded recently? > >>>> > >>>> Are there any logs from corosync? > >>>> Which distro is this? > >>>> > >>>>> And why the start of a cluster node crasched the DC suff ? :( > >>>>> > >>>>> > >>>>> Bye Marco > >>>>> > >>>>> _______________________________________________ > >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >>>>> > >>>>> Project Home: http://www.clusterlabs.org > >>>>> Getting started: > >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: > >>>>> http://bugs.clusterlabs.org > >>>> > >>> > >>> Hi, > >>> the distro in an opensuse 11.1 and there is no updates also > >>> because the distro is out of maintenance. > >> > >> A good reason to be using SLES (or RHEL/CentOS). > > > > Better Gentoo ;) > > > >> > >>> We are planning and upgrade but the interesting thing is to figure > >>> out the reasons of the problem. > >>> The log in attachment, thanks for the support > >> > >> There's nothing obvious in the logs. Just that as far as pacemaker > >> could tell, corosync suddenly went away. Was the corosync process > >> still running? > >> > > > > Yes , corosync was still running . > > Stopping pacemaker and restarting it didnt help? > At the end we restarted the two server and then start the corosync/pacemaker stuff. Thanks for the support Marco _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org