Re: [Pacemaker] Problems with Pacemaker + Corosync after reboot
HI, Your configuration is straightforward, nothing out of the ordinary. Make sure that when your other box comes up from offline, syslog-ng is started before corosync. Because it appears that when you kill all the process and restart by that time syslog-ng has started and everything comes up properly. Your resource will migrate back because there is no reason for it to to stick there i.e. resource-stickiness. You might want to look into how to get resource stickiness which may mean enhancing your config a little more than what you have now. Configuration manual explains it very nicely. There is a tool called ptest you can use it to get the scores which determines the stickiness for e.g. you can experiment with different resource-stickiness values and then do ptest -sL to look at the score. You will have to go a bit deeper than your vanilla config to understand and also read the manual. Thanks -Shravan O n Thu, Dec 23, 2010 at 6:12 PM, Daniel Bareiro wrote: > On Wednesday, 22 December 2010 08:29:02 -0500, > Shravan Mishra wrote: > >> Hi, > > Hi, Shravan. > >> What's happening is that corosync is forking but the exec is not >> happening. > > And do you think that what is shown in the logs is consistent with what > is shown using ps? > >> I used to see this problem in my case when syslog-ng process was not >> running. >> >> Try checking that and starting it and then start corosync. > > Now I see that if I do a shutdown of the node that has the resource > (failover-ip), then this does not migrate to another node. By doing the > test I made sure Pacemaker + Corosync are functioning correctly on both > nodes before doing a shutdown of Atlantis. > > Before making a shutdown of Atlantis: > > --- > daedalus:~# crm_mon --one-shot > > Last updated: Thu Dec 23 19:24:09 2010 > Stack: openais > Current DC: atlantis - partition with quorum > Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b > 2 Nodes configured, 2 expected votes > 1 Resources configured. > > > Online: [ atlantis daedalus ] > > failover-ip(ocf::heartbeat:IPaddr):Started atlantis > --- > > After doing a shutdown of Atlantis: > > --- > daedalus:~# crm_mon --one-shot > > Last updated: Thu Dec 23 19:25:44 2010 > Stack: openais > Current DC: daedalus - partition WITHOUT quorum > Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b > 2 Nodes configured, 2 expected votes > 1 Resources configured. > > > Online: [ daedalus ] > OFFLINE: [ atlantis ] > --- > > Here I'm using a configuration like the one presented in the wiki [1]. > > I am also noting that after the Atlantis launch, corosync makes the fork > without exec (as we assume from what I showed in the previous mail) and > only now is when the resource migrates to Daedalus: > > --- > daedalus:~# crm_mon --one-shot > > Last updated: Thu Dec 23 19:49:11 2010 > Stack: openais > Current DC: daedalus - partition with quorum > Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b > 2 Nodes configured, 2 expected votes > 1 Resources configured. > > > Online: [ daedalus ] > OFFLINE: [ atlantis ] > > failover-ip(ocf::heartbeat:IPaddr):Started daedalus > --- > > > --- > atlantis:~# crm_mon --one-shot > > Connection to cluster failed: connection failed > --- > > I tried doing a "corosync stop", but the processes are not closed: > > atlantis:~# ps auxf > [...] > root 1564 0.0 1.2 168144 3240 ?S19:38 0:00 /usr/sbin/corosync > root 1565 0.0 1.2 168144 3240 ?S19:38 0:00 /usr/sbin/corosync > root 1566 0.0 1.2 168144 3240 ?S19:38 0:00 /usr/sbin/corosync > root 1567 0.0 1.2 168144 3240 ?S19:38 0:00 /usr/sbin/corosync > root 1568 0.0 1.2 168144 3240 ?S19:38 0:00 /usr/sbin/corosync > root 1569 0.0 1.2 168144 3240 ?S19:38 0:00 /usr/sbin/corosync > > > The only way I found to correctly start corosync is doing a "pkill -9 > corosync" and "corosync start": > > > atlantis:~# ps auxf > [...] > root 2120 0.2 1.9 134288 5060 ?Ssl 19:59 0:00 /usr/sbin/corosync > root 2128 0.0 4.5 76028 11600 ?SLs 19:59 0:00 \_ /usr/lib/heartbeat/stonithd > 105 2129 0.1 2.0 79104 5120 ?S19:59 0:00 \_ /usr/lib/heartbeat/cib > root 2130 0.0 0.8 71580 2108 ?S19:59 0:00 \_ /usr/lib/he
Re: [Pacemaker] Problems with Pacemaker + Corosync after reboot
On Wednesday, 22 December 2010 08:29:02 -0500, Shravan Mishra wrote: > Hi, Hi, Shravan. > What's happening is that corosync is forking but the exec is not > happening. And do you think that what is shown in the logs is consistent with what is shown using ps? > I used to see this problem in my case when syslog-ng process was not > running. > > Try checking that and starting it and then start corosync. Now I see that if I do a shutdown of the node that has the resource (failover-ip), then this does not migrate to another node. By doing the test I made sure Pacemaker + Corosync are functioning correctly on both nodes before doing a shutdown of Atlantis. Before making a shutdown of Atlantis: --- daedalus:~# crm_mon --one-shot Last updated: Thu Dec 23 19:24:09 2010 Stack: openais Current DC: atlantis - partition with quorum Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ atlantis daedalus ] failover-ip(ocf::heartbeat:IPaddr):Started atlantis --- After doing a shutdown of Atlantis: --- daedalus:~# crm_mon --one-shot Last updated: Thu Dec 23 19:25:44 2010 Stack: openais Current DC: daedalus - partition WITHOUT quorum Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ daedalus ] OFFLINE: [ atlantis ] --- Here I'm using a configuration like the one presented in the wiki [1]. I am also noting that after the Atlantis launch, corosync makes the fork without exec (as we assume from what I showed in the previous mail) and only now is when the resource migrates to Daedalus: --- daedalus:~# crm_mon --one-shot Last updated: Thu Dec 23 19:49:11 2010 Stack: openais Current DC: daedalus - partition with quorum Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ daedalus ] OFFLINE: [ atlantis ] failover-ip(ocf::heartbeat:IPaddr):Started daedalus --- --- atlantis:~# crm_mon --one-shot Connection to cluster failed: connection failed --- I tried doing a "corosync stop", but the processes are not closed: atlantis:~# ps auxf [...] root 1564 0.0 1.2 168144 3240 ?S19:38 0:00 /usr/sbin/corosync root 1565 0.0 1.2 168144 3240 ?S19:38 0:00 /usr/sbin/corosync root 1566 0.0 1.2 168144 3240 ?S19:38 0:00 /usr/sbin/corosync root 1567 0.0 1.2 168144 3240 ?S19:38 0:00 /usr/sbin/corosync root 1568 0.0 1.2 168144 3240 ?S19:38 0:00 /usr/sbin/corosync root 1569 0.0 1.2 168144 3240 ?S19:38 0:00 /usr/sbin/corosync The only way I found to correctly start corosync is doing a "pkill -9 corosync" and "corosync start": atlantis:~# ps auxf [...] root 2120 0.2 1.9 134288 5060 ?Ssl 19:59 0:00 /usr/sbin/corosync root 2128 0.0 4.5 76028 11600 ?SLs 19:59 0:00 \_ /usr/lib/heartbeat/stonithd 105 2129 0.1 2.0 79104 5120 ?S19:59 0:00 \_ /usr/lib/heartbeat/cib root 2130 0.0 0.8 71580 2108 ?S19:59 0:00 \_ /usr/lib/heartbeat/lrmd 105 2131 0.0 1.3 79968 3340 ?S19:59 0:00 \_ /usr/lib/heartbeat/attrd 105 2132 0.0 1.1 80332 2892 ?S19:59 0:00 \_ /usr/lib/heartbeat/pengine 105 2133 0.0 1.4 86216 3764 ?S19:59 0:00 \_ /usr/lib/heartbeat/crmd After this, the resource automatically migrates back to Atlantis: --- daedalus:~# crm_mon --one-shot Last updated: Thu Dec 23 20:03:18 2010 Stack: openais Current DC: daedalus - partition with quorum Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b 2 Nodes configured, 2 expected votes 1 Resources configured. Online: [ atlantis daedalus ] failover-ip(ocf::heartbeat:IPaddr):Started atlantis --- Any idea how to fix this problem with Corosync? Why to do a shutdown of Atlantis the resource does not migrate to Daedalus? Thanks for your reply. Regards, Daniel [1] http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo -- Daniel Bareiro - GNU/Linux registered user #188
Re: [Pacemaker] Problems with Pacemaker + Corosync after reboot
Hi, What's happening is that corosync is forking but the exec is not happening. I used to see this problem in my case when syslog-ng process was not running. Try checking that and starting it and then start corosync. Sincerely Shravan On Wed, Dec 22, 2010 at 4:43 AM, Daniel Bareiro wrote: > Hi all! > > I hope this is the right group to discuss my problem. > > I'm beginning to test HA clusters with Debian GNU/Linux and for that I > decided to try Pacemaker + Corosync with Debian Lenny amd64 following > this [1] howto. > > Both packages were installed from the Backports repositories. But I am > observing that if after configuration I reboot a node, it fails to join > to the cluster after the boot. > > This is what I see in /var/log/daemon.log: > > -- > Dec 19 17:13:13 atlantis corosync[1508]: [pcmk ] WARN: route_ais_message: > Sending message to local.crmd failed: unknown (rc=-2) > Dec 19 17:13:13 atlantis corosync[1508]: [pcmk ] WARN: route_ais_message: > Sending message to local.cib failed: unknown (rc=-2) > Dec 19 17:13:13 atlantis corosync[1508]: [pcmk ] WARN: route_ais_message: > Sending message to local.attrd failed: unknown (rc=-2) > Dec 19 17:13:13 atlantis corosync[1508]: [pcmk ] WARN: route_ais_message: > Sending message to local.cib failed: unknown (rc=-2) > Dec 19 17:13:14 atlantis corosync[1508]: [pcmk ] WARN: route_ais_message: > Sending message to local.cib failed: unknown (rc=-2) > Dec 19 17:13:14 atlantis corosync[1508]: [pcmk ] WARN: route_ais_message: > Sending message to local.cib failed: unknown (rc=-2) > Dec 19 17:13:21 atlantis corosync[1508]: [TOTEM ] A processor failed, > forming new configuration. > Dec 19 17:13:25 atlantis corosync[1508]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 72: memb=1, new=0, lost=1 > Dec 19 17:13:25 atlantis corosync[1508]: [pcmk ] info: pcmk_peer_update: > memb: atlantis 335544586 > Dec 19 17:13:25 atlantis corosync[1508]: [pcmk ] info: pcmk_peer_update: > lost: daedalus 369099018 > Dec 19 17:13:25 atlantis corosync[1508]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 72: memb=1, new=0, lost=0 > Dec 19 17:13:25 atlantis corosync[1508]: [pcmk ] info: pcmk_peer_update: > MEMB: atlantis 335544586 > Dec 19 17:13:25 atlantis corosync[1508]: [pcmk ] info: > ais_mark_unseen_peer_dead: Node daedalus was not seen in the previous > transition > Dec 19 17:13:25 atlantis corosync[1508]: [pcmk ] info: update_member: Node > 369099018/daedalus is now: lost > Dec 19 17:13:25 atlantis corosync[1508]: [pcmk ] info: > send_member_notification: Sending membership update 72 to 0 children > Dec 19 17:13:25 atlantis corosync[1508]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Dec 19 17:13:25 atlantis corosync[1508]: [MAIN ] Completed service > synchronization, ready to provide service. > -- > > # ps auxf > [...] > root 1508 0.1 1.9 182624 4880 ? Ssl 15:52 0:22 > /usr/sbin/corosync > root 1539 0.0 1.2 168144 3240 ? S 15:52 0:00 \_ > /usr/sbin/corosync > root 1540 0.0 1.2 168144 3240 ? S 15:52 0:00 \_ > /usr/sbin/corosync > root 1541 0.0 1.2 168144 3240 ? S 15:52 0:00 \_ > /usr/sbin/corosync > root 1542 0.0 1.2 168144 3240 ? S 15:52 0:00 \_ > /usr/sbin/corosync > root 1543 0.0 1.2 168144 3240 ? S 15:52 0:00 \_ > /usr/sbin/corosync > root 1544 0.0 1.2 168144 3240 ? S 15:52 0:00 \_ > /usr/sbin/corosync > > > From what I see in the howto, the output should be something like this: > > > root 29980 0.0 0.8 44304 3808 ? Ssl 20:55 0:00 > /usr/sbin/corosync > root 29986 0.0 2.4 10812 10812 ? SLs 20:55 0:00 \_ > /usr/lib/heartbeat/stonithd > 102 29987 0.0 0.8 13012 3804 ? S 20:55 0:00 \_ > /usr/lib/heartbeat/cib > root 29988 0.0 0.4 5444 1800 ? S 20:55 0:00 \_ > /usr/lib/heartbeat/lrmd > 102 29989 0.0 0.5 12364 2368 ? S 20:55 0:00 \_ > /usr/lib/heartbeat/attrd > 102 29990 0.0 0.5 8604 2304 ? S 20:55 0:00 \_ > /usr/lib/heartbeat/pengine > 102 29991 0.0 0.6 12648 3080 ? S 20:55 0:00 \_ > /usr/lib/heartbeat/crmd > > > > I also tried compiling Pacemaker using these [2] steps, but I get the > same result. > > > > Thanks in advance for your reply. > > Regards, > Daniel > > [1] http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo > [2] http://www.clusterlabs.org/wiki/Install#Building_from_Source > -- > Daniel Bareiro - GNU/Linux registered user #188.598 > Proudly running Debian GNU/Linux with uptime: > 06:39:43 up 70 days, 7:06, 10 users, load average: 0.27, 0.16, 0.10 > > -BEGIN PGP SI