Re: [Pacemaker] Problems with Pacemaker + Corosync after reboot

2010-12-23 Thread Daniel Bareiro
On Wednesday, 22 December 2010 08:29:02 -0500,
Shravan Mishra wrote:

 Hi,

Hi, Shravan.

 What's happening is that corosync is forking but the exec is not
 happening.

And do you think that what is shown in the logs is consistent with what
is shown using ps?

 I used to see this problem in my case when syslog-ng process was not
 running.
 
 Try checking that and starting it and then start corosync.

Now I see that if I do a shutdown of the node that has the resource
(failover-ip), then this does not migrate to another node. By doing the
test I made sure Pacemaker + Corosync are functioning correctly on both
nodes before doing a shutdown of Atlantis.

Before making a shutdown of Atlantis:

---
daedalus:~# crm_mon --one-shot

Last updated: Thu Dec 23 19:24:09 2010
Stack: openais
Current DC: atlantis - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.


Online: [ atlantis daedalus ]

 failover-ip(ocf::heartbeat:IPaddr):Started atlantis
---

After doing a shutdown of Atlantis:

---
daedalus:~# crm_mon --one-shot

Last updated: Thu Dec 23 19:25:44 2010
Stack: openais
Current DC: daedalus - partition WITHOUT quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.


Online: [ daedalus ]
OFFLINE: [ atlantis ]
---

Here I'm using a configuration like the one presented in the wiki [1].

I am also noting that after the Atlantis launch, corosync makes the fork
without exec (as we assume from what I showed in the previous mail) and
only now is when the resource migrates to Daedalus:

---
daedalus:~# crm_mon --one-shot

Last updated: Thu Dec 23 19:49:11 2010
Stack: openais
Current DC: daedalus - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.


Online: [ daedalus ]
OFFLINE: [ atlantis ]

 failover-ip(ocf::heartbeat:IPaddr):Started daedalus
---


---
atlantis:~# crm_mon --one-shot

Connection to cluster failed: connection failed
---

I tried doing a corosync stop, but the processes are not closed:

atlantis:~# ps auxf
[...]
root  1564  0.0  1.2 168144  3240 ?S19:38   0:00 
/usr/sbin/corosync
root  1565  0.0  1.2 168144  3240 ?S19:38   0:00 
/usr/sbin/corosync
root  1566  0.0  1.2 168144  3240 ?S19:38   0:00 
/usr/sbin/corosync
root  1567  0.0  1.2 168144  3240 ?S19:38   0:00 
/usr/sbin/corosync
root  1568  0.0  1.2 168144  3240 ?S19:38   0:00 
/usr/sbin/corosync
root  1569  0.0  1.2 168144  3240 ?S19:38   0:00 
/usr/sbin/corosync


The only way I found to correctly start corosync is doing a pkill -9
corosync and corosync start:


atlantis:~# ps auxf
[...]
root  2120  0.2  1.9 134288  5060 ?Ssl  19:59   0:00 
/usr/sbin/corosync
root  2128  0.0  4.5  76028 11600 ?SLs  19:59   0:00  \_ 
/usr/lib/heartbeat/stonithd
105   2129  0.1  2.0  79104  5120 ?S19:59   0:00  \_ 
/usr/lib/heartbeat/cib
root  2130  0.0  0.8  71580  2108 ?S19:59   0:00  \_ 
/usr/lib/heartbeat/lrmd
105   2131  0.0  1.3  79968  3340 ?S19:59   0:00  \_ 
/usr/lib/heartbeat/attrd
105   2132  0.0  1.1  80332  2892 ?S19:59   0:00  \_ 
/usr/lib/heartbeat/pengine
105   2133  0.0  1.4  86216  3764 ?S19:59   0:00  \_ 
/usr/lib/heartbeat/crmd


After this, the resource automatically migrates back to Atlantis:

---
daedalus:~# crm_mon --one-shot

Last updated: Thu Dec 23 20:03:18 2010
Stack: openais
Current DC: daedalus - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.


Online: [ atlantis daedalus ]

 failover-ip(ocf::heartbeat:IPaddr):Started atlantis
---


Any idea how to fix this problem with Corosync?

Why to do a shutdown of Atlantis the resource does not migrate to
Daedalus?



Thanks for your reply.

Regards,
Daniel

[1] http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo
-- 
Daniel Bareiro - GNU/Linux registered user #188.598
Proudly 

Re: [Pacemaker] Problems with Pacemaker + Corosync after reboot

2010-12-22 Thread Shravan Mishra
Hi,

What's happening is that corosync is forking but the exec is not happening.

I used to see this problem in my case when syslog-ng process was not running.

Try checking that and starting it and then start corosync.

Sincerely
Shravan

On Wed, Dec 22, 2010 at 4:43 AM, Daniel Bareiro daniel-lis...@gmx.net wrote:
 Hi all!

 I hope this is the right group to discuss my problem.

 I'm beginning to test HA clusters with Debian GNU/Linux and for that I
 decided to try Pacemaker + Corosync with Debian Lenny amd64 following
 this [1] howto.

 Both packages were installed from the Backports repositories. But I am
 observing that if after configuration I reboot a node, it fails to join
 to the cluster after the boot.

 This is what I see in /var/log/daemon.log:

 --
 Dec 19 17:13:13 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
 Sending message to local.crmd failed: unknown (rc=-2)
 Dec 19 17:13:13 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
 Sending message to local.cib failed: unknown (rc=-2)
 Dec 19 17:13:13 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
 Sending message to local.attrd failed: unknown (rc=-2)
 Dec 19 17:13:13 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
 Sending message to local.cib failed: unknown (rc=-2)
 Dec 19 17:13:14 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
 Sending message to local.cib failed: unknown (rc=-2)
 Dec 19 17:13:14 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
 Sending message to local.cib failed: unknown (rc=-2)
 Dec 19 17:13:21 atlantis corosync[1508]:   [TOTEM ] A processor failed, 
 forming new configuration.
 Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] notice: pcmk_peer_update: 
 Transitional membership event on ring 72: memb=1, new=0, lost=1
 Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: pcmk_peer_update: 
 memb: atlantis 335544586
 Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: pcmk_peer_update: 
 lost: daedalus 369099018
 Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] notice: pcmk_peer_update: 
 Stable membership event on ring 72: memb=1, new=0, lost=0
 Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: pcmk_peer_update: 
 MEMB: atlantis 335544586
 Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: 
 ais_mark_unseen_peer_dead: Node daedalus was not seen in the previous 
 transition
 Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: update_member: Node 
 369099018/daedalus is now: lost
 Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: 
 send_member_notification: Sending membership update 72 to 0 children
 Dec 19 17:13:25 atlantis corosync[1508]:   [TOTEM ] A processor joined or 
 left the membership and a new membership was formed.
 Dec 19 17:13:25 atlantis corosync[1508]:   [MAIN  ] Completed service 
 synchronization, ready to provide service.
 --

 # ps auxf
 [...]
 root      1508  0.1  1.9 182624  4880 ?        Ssl  15:52   0:22 
 /usr/sbin/corosync
 root      1539  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
 /usr/sbin/corosync
 root      1540  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
 /usr/sbin/corosync
 root      1541  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
 /usr/sbin/corosync
 root      1542  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
 /usr/sbin/corosync
 root      1543  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
 /usr/sbin/corosync
 root      1544  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
 /usr/sbin/corosync


 From what I see in the howto, the output should be something like this:


 root     29980  0.0  0.8  44304  3808 ?        Ssl  20:55   0:00 
 /usr/sbin/corosync
 root     29986  0.0  2.4  10812 10812 ?        SLs  20:55   0:00  \_ 
 /usr/lib/heartbeat/stonithd
 102      29987  0.0  0.8  13012  3804 ?        S    20:55   0:00  \_ 
 /usr/lib/heartbeat/cib
 root     29988  0.0  0.4   5444  1800 ?        S    20:55   0:00  \_ 
 /usr/lib/heartbeat/lrmd
 102      29989  0.0  0.5  12364  2368 ?        S    20:55   0:00  \_ 
 /usr/lib/heartbeat/attrd
 102      29990  0.0  0.5   8604  2304 ?        S    20:55   0:00  \_ 
 /usr/lib/heartbeat/pengine
 102      29991  0.0  0.6  12648  3080 ?        S    20:55   0:00  \_ 
 /usr/lib/heartbeat/crmd



 I also tried compiling Pacemaker using these [2] steps, but I get the
 same result.



 Thanks in advance for your reply.

 Regards,
 Daniel

 [1] http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo
 [2] http://www.clusterlabs.org/wiki/Install#Building_from_Source
 --
 Daniel Bareiro - GNU/Linux registered user #188.598
 Proudly running Debian GNU/Linux with uptime:
 06:39:43 up 70 days,  7:06, 10 users,  load average: 0.27, 0.16, 0.10

 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.9 (GNU/Linux)