Re: [Pacemaker] Node fails to rejoin cluster
We are experiencing the same issue. Did the build from latest source resolve it? Thanks for letting us know. Ales On Thu, Feb 7, 2013 at 10:05 PM, Tal Yalon wrote: > Thanks Andrew for all your help, will do! > On Feb 8, 2013 3:00 AM, "Andrew Beekhof" wrote: > >> On Thu, Feb 7, 2013 at 7:06 PM, Tal Yalon wrote: >> > Thanks for replying Andrew. >> > >> > Here's the other node's log (the one that fenced the non-responsive >> node) - >> > please let me know if there's any other information that may help. It's >> a >> > bit long, but it captures the moment node-1 finds out that node-2 is >> > non-responsive, then fences it and then gets stuck in an endless >> election >> > loop. >> >> This log: >> >> Feb 6 01:40:59 node-1 crmd[22715]: info: join_make_offer: Peer >> process on node-2 is not active (yet?): 0001 2 >> >> Suggests it s a bug that got fixed recently. Keep an eye out for >> 1.1.9 in the next week or so (or you could try building from source if >> you're in a hurry). >> >> ___ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Node fails to rejoin cluster
On Thu, Feb 14, 2013 at 9:34 PM, Proskurin Kirill wrote: > On 02/08/2013 04:59 AM, Andrew Beekhof wrote: > >> Suggests it s a bug that got fixed recently. Keep an eye out for >> 1.1.9 in the next week or so (or you could try building from source if >> you're in a hurry). > > > Is 1.1.9 will be centos 5.x friendly? Yep ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Node fails to rejoin cluster
On 02/08/2013 04:59 AM, Andrew Beekhof wrote: Suggests it s a bug that got fixed recently. Keep an eye out for 1.1.9 in the next week or so (or you could try building from source if you're in a hurry). Is 1.1.9 will be centos 5.x friendly? -- Best regards, Proskurin Kirill ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Node fails to rejoin cluster
Thanks Andrew for all your help, will do! On Feb 8, 2013 3:00 AM, "Andrew Beekhof" wrote: > On Thu, Feb 7, 2013 at 7:06 PM, Tal Yalon wrote: > > Thanks for replying Andrew. > > > > Here's the other node's log (the one that fenced the non-responsive > node) - > > please let me know if there's any other information that may help. It's a > > bit long, but it captures the moment node-1 finds out that node-2 is > > non-responsive, then fences it and then gets stuck in an endless election > > loop. > > This log: > > Feb 6 01:40:59 node-1 crmd[22715]: info: join_make_offer: Peer > process on node-2 is not active (yet?): 0001 2 > > Suggests it s a bug that got fixed recently. Keep an eye out for > 1.1.9 in the next week or so (or you could try building from source if > you're in a hurry). > > ___ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Node fails to rejoin cluster
On Thu, Feb 7, 2013 at 7:06 PM, Tal Yalon wrote: > Thanks for replying Andrew. > > Here's the other node's log (the one that fenced the non-responsive node) - > please let me know if there's any other information that may help. It's a > bit long, but it captures the moment node-1 finds out that node-2 is > non-responsive, then fences it and then gets stuck in an endless election > loop. This log: Feb 6 01:40:59 node-1 crmd[22715]: info: join_make_offer: Peer process on node-2 is not active (yet?): 0001 2 Suggests it s a bug that got fixed recently. Keep an eye out for 1.1.9 in the next week or so (or you could try building from source if you're in a hurry). ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] Node fails to rejoin cluster
On Wed, Feb 6, 2013 at 9:11 PM, Tal Yalon wrote: > Hi all, > > I have a 2-node cluster, where node-2 got fenced and now after reboot tries > to rejoin the cluster but fails and gets stuck in a loop for hours and never > joins back. > > After another reboot it managed to join, and there was no time difference > between the nodes. > > Below is corosync/pacemaker log of node-2 (the one that was stuck in the > loop). Unfortunately we need the other one. >Any help would be appreciated, since I have no clue as to what > happened. > > Thanks, > Tal > > > Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Corosync Cluster Engine > ('1.4.1'): started and ready to provide service. > Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Corosync built-in > features: nss dbus rdma snmp > Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Successfully read main > configuration file '/etc/corosync/corosync.conf'. > Feb 6 01:39:32 node-2 corosync[27428]: [TOTEM ] Initializing transport > (UDP/IP Unicast). > Feb 6 01:39:32 node-2 corosync[27428]: [TOTEM ] Initializing > transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). > Feb 6 01:39:32 node-2 corosync[27428]: [TOTEM ] The network interface > [9.151.142.20] is now up. > Feb 6 01:39:32 node-2 corosync[27428]: [pcmk ] Logging: Initialized > pcmk_startup > Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: > Pacemaker Cluster Manager 1.1.6 > Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: > corosync extended virtual synchrony service > Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: > corosync configuration service > Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: > corosync cluster closed process group service v1.01 > Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: > corosync cluster config database access v1.01 > Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: > corosync profile loading service > Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: > corosync cluster quorum service v0.1 > Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Compatibility mode set to > whitetank. Using V1 and V2 of the synchronization engine. > Feb 6 01:39:32 node-2 corosync[27428]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Feb 6 01:39:32 node-2 corosync[27428]: [CPG ] chosen downlist: sender > r(0) ip(9.151.142.20) ; members(old:0 left:0) > Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Completed service > synchronization, ready to provide service. > Feb 6 01:39:37 node-2 pacemakerd[27466]: info: crm_log_init_worker: > Changed active directory to /var/lib/heartbeat/cores/root > Feb 6 01:39:37 node-2 pacemakerd[27466]: notice: main: Starting Pacemaker > 1.1.7-6.el6 (Build: 148fccfd5985c5590cc601123c6c16e966b85d14): > generated-manpages agent-manpages ascii-docs publican-docs ncurses > trace-logging libqb corosync-plugin cman > Feb 6 01:39:37 node-2 pacemakerd[27466]: info: main: Maximum core file > size is: 18446744073709551615 > Feb 6 01:39:37 node-2 pacemakerd[27466]: notice: update_node_processes: > 0xb31fe0 Node 2 now known as node-2, was: > Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked > child 27470 for process cib > Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked > child 27471 for process stonith-ng > Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked > child 27472 for process lrmd > Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked > child 27473 for process attrd > Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked > child 27474 for process pengine > Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked > child 27475 for process crmd > Feb 6 01:39:37 node-2 pacemakerd[27466]: info: main: Starting mainloop > Feb 6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added > signal handler for signal 15 > Feb 6 01:39:37 node-2 stonith-ng[27471]: info: crm_log_init_worker: > Changed active directory to /var/lib/heartbeat/cores/root > Feb 6 01:39:37 node-2 stonith-ng[27471]: info: get_cluster_type: > Cluster type is: 'openais' > Feb 6 01:39:37 node-2 stonith-ng[27471]: notice: crm_cluster_connect: > Connecting to cluster infrastructure: classic openais (with plugin) > Feb 6 01:39:37 node-2 stonith-ng[27471]: info: > init_ais_connection_classic: Creating connection to our Corosync plugin > Feb 6 01:39:37 node-2 stonith-ng[27471]: info: > init_ais_connection_classic: AIS connection established > Feb 6 01:39:37 node-2 stonith-ng[27471]: info: get_ais_nodeid: Server > details: id=2 uname=node-2 cname=pcmk > Feb 6 01:39:37 node-2 stonith-ng[27471]: info: > init_ais_connection_once: Connection to 'classic openais (with plugin)': > established >
[Pacemaker] Node fails to rejoin cluster
Hi all, I have a 2-node cluster, where node-2 got fenced and now after reboot tries to rejoin the cluster but fails and gets stuck in a loop for hours and never joins back. After another reboot it managed to join, and there was no time difference between the nodes. Below is corosync/pacemaker log of node-2 (the one that was stuck in the loop). Any help would be appreciated, since I have no clue as to what happened. Thanks, Tal Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Corosync Cluster Engine ('1.4.1'): started and ready to provide service. Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Corosync built-in features: nss dbus rdma snmp Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Successfully read main configuration file '/etc/corosync/corosync.conf'. Feb 6 01:39:32 node-2 corosync[27428]: [TOTEM ] Initializing transport (UDP/IP Unicast). Feb 6 01:39:32 node-2 corosync[27428]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Feb 6 01:39:32 node-2 corosync[27428]: [TOTEM ] The network interface [9.151.142.20] is now up. Feb 6 01:39:32 node-2 corosync[27428]: [pcmk ] Logging: Initialized pcmk_startup Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: Pacemaker Cluster Manager 1.1.6 Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: corosync extended virtual synchrony service Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: corosync configuration service Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: corosync cluster closed process group service v1.01 Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: corosync cluster config database access v1.01 Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: corosync profile loading service Feb 6 01:39:32 node-2 corosync[27428]: [SERV ] Service engine loaded: corosync cluster quorum service v0.1 Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Feb 6 01:39:32 node-2 corosync[27428]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Feb 6 01:39:32 node-2 corosync[27428]: [CPG ] chosen downlist: sender r(0) ip(9.151.142.20) ; members(old:0 left:0) Feb 6 01:39:32 node-2 corosync[27428]: [MAIN ] Completed service synchronization, ready to provide service. Feb 6 01:39:37 node-2 pacemakerd[27466]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root Feb 6 01:39:37 node-2 pacemakerd[27466]: notice: main: Starting Pacemaker 1.1.7-6.el6 (Build: 148fccfd5985c5590cc601123c6c16e966b85d14): generated-manpages agent-manpages ascii-docs publican-docs ncurses trace-logging libqb corosync-plugin cman Feb 6 01:39:37 node-2 pacemakerd[27466]: info: main: Maximum core file size is: 18446744073709551615 Feb 6 01:39:37 node-2 pacemakerd[27466]: notice: update_node_processes: 0xb31fe0 Node 2 now known as node-2, was: Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked child 27470 for process cib Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked child 27471 for process stonith-ng Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked child 27472 for process lrmd Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked child 27473 for process attrd Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked child 27474 for process pengine Feb 6 01:39:37 node-2 pacemakerd[27466]: info: start_child: Forked child 27475 for process crmd Feb 6 01:39:37 node-2 pacemakerd[27466]: info: main: Starting mainloop Feb 6 01:39:37 node-2 lrmd: [27472]: info: G_main_add_SignalHandler: Added signal handler for signal 15 Feb 6 01:39:37 node-2 stonith-ng[27471]: info: crm_log_init_worker: Changed active directory to /var/lib/heartbeat/cores/root Feb 6 01:39:37 node-2 stonith-ng[27471]: info: get_cluster_type: Cluster type is: 'openais' Feb 6 01:39:37 node-2 stonith-ng[27471]: notice: crm_cluster_connect: Connecting to cluster infrastructure: classic openais (with plugin) Feb 6 01:39:37 node-2 stonith-ng[27471]: info: init_ais_connection_classic: Creating connection to our Corosync plugin Feb 6 01:39:37 node-2 stonith-ng[27471]: info: init_ais_connection_classic: AIS connection established Feb 6 01:39:37 node-2 stonith-ng[27471]: info: get_ais_nodeid: Server details: id=2 uname=node-2 cname=pcmk Feb 6 01:39:37 node-2 stonith-ng[27471]: info: init_ais_connection_once: Connection to 'classic openais (with plugin)': established Feb 6 01:39:37 node-2 stonith-ng[27471]: info: crm_new_peer: Node node-2 now has id: 2 Feb 6 01:39:37 node-2 stonith-ng[27471]: info: crm_new_peer: Node 2 is now known as node-2 Feb 6 01:39:37 node-2 crmd[27475]: info: crm_log_init_worker: Changed active