Re: [Pacemaker] Problems with Pacemaker + Corosync after reboot

2010-12-24 Thread Shravan Mishra
HI,

Your configuration is straightforward, nothing out of the ordinary.

Make sure that when your other box comes up from offline, syslog-ng is
started before corosync. Because it appears that when you kill all the
process and restart by that time syslog-ng has started and everything comes
up properly.

Your resource will migrate back because there is no reason for it to to
stick there i.e. resource-stickiness.

You might want to look into how to get resource stickiness which may mean
enhancing your config a little more than what you have now. Configuration
manual explains it very nicely.

There is a tool called ptest you can use it to get the scores which
determines the stickiness for e.g. you can experiment with different
resource-stickiness values and then do

ptest -sL  to look at the score.

You will have to go a bit deeper than your vanilla config to understand and
also read the manual.


Thanks
-Shravan


O n Thu, Dec 23, 2010 at 6:12 PM, Daniel Bareiro 
wrote:
> On Wednesday, 22 December 2010 08:29:02 -0500,
> Shravan Mishra wrote:
>
>> Hi,
>
> Hi, Shravan.
>
>> What's happening is that corosync is forking but the exec is not
>> happening.
>
> And do you think that what is shown in the logs is consistent with what
> is shown using ps?
>
>> I used to see this problem in my case when syslog-ng process was not
>> running.
>>
>> Try checking that and starting it and then start corosync.
>
> Now I see that if I do a shutdown of the node that has the resource
> (failover-ip), then this does not migrate to another node. By doing the
> test I made sure Pacemaker + Corosync are functioning correctly on both
> nodes before doing a shutdown of Atlantis.
>
> Before making a shutdown of Atlantis:
>
> ---
> daedalus:~# crm_mon --one-shot
> 
> Last updated: Thu Dec 23 19:24:09 2010
> Stack: openais
> Current DC: atlantis - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> 
>
> Online: [ atlantis daedalus ]
>
>  failover-ip(ocf::heartbeat:IPaddr):Started atlantis
> ---
>
> After doing a shutdown of Atlantis:
>
> ---
> daedalus:~# crm_mon --one-shot
> 
> Last updated: Thu Dec 23 19:25:44 2010
> Stack: openais
> Current DC: daedalus - partition WITHOUT quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> 
>
> Online: [ daedalus ]
> OFFLINE: [ atlantis ]
> ---
>
> Here I'm using a configuration like the one presented in the wiki [1].
>
> I am also noting that after the Atlantis launch, corosync makes the fork
> without exec (as we assume from what I showed in the previous mail) and
> only now is when the resource migrates to Daedalus:
>
> ---
> daedalus:~# crm_mon --one-shot
> 
> Last updated: Thu Dec 23 19:49:11 2010
> Stack: openais
> Current DC: daedalus - partition with quorum
> Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
> 2 Nodes configured, 2 expected votes
> 1 Resources configured.
> 
>
> Online: [ daedalus ]
> OFFLINE: [ atlantis ]
>
>  failover-ip(ocf::heartbeat:IPaddr):Started daedalus
> ---
>
>
> ---
> atlantis:~# crm_mon --one-shot
>
> Connection to cluster failed: connection failed
> ---
>
> I tried doing a "corosync stop", but the processes are not closed:
>
> atlantis:~# ps auxf
> [...]
> root  1564  0.0  1.2 168144  3240 ?S19:38   0:00
/usr/sbin/corosync
> root  1565  0.0  1.2 168144  3240 ?S19:38   0:00
/usr/sbin/corosync
> root  1566  0.0  1.2 168144  3240 ?S19:38   0:00
/usr/sbin/corosync
> root  1567  0.0  1.2 168144  3240 ?S19:38   0:00
/usr/sbin/corosync
> root  1568  0.0  1.2 168144  3240 ?S19:38   0:00
/usr/sbin/corosync
> root  1569  0.0  1.2 168144  3240 ?S19:38   0:00
/usr/sbin/corosync
>
>
> The only way I found to correctly start corosync is doing a "pkill -9
> corosync" and "corosync start":
>
>
> atlantis:~# ps auxf
> [...]
> root  2120  0.2  1.9 134288  5060 ?Ssl  19:59   0:00
/usr/sbin/corosync
> root  2128  0.0  4.5  76028 11600 ?SLs  19:59   0:00  \_
/usr/lib/heartbeat/stonithd
> 105   2129  0.1  2.0  79104  5120 ?S19:59   0:00  \_
/usr/lib/heartbeat/cib
> root  2130  0.0  0.8  71580  2108 ?S19:59   0:00  \_
/usr/lib/he

Re: [Pacemaker] Problems with Pacemaker + Corosync after reboot

2010-12-23 Thread Daniel Bareiro
On Wednesday, 22 December 2010 08:29:02 -0500,
Shravan Mishra wrote:

> Hi,

Hi, Shravan.

> What's happening is that corosync is forking but the exec is not
> happening.

And do you think that what is shown in the logs is consistent with what
is shown using ps?

> I used to see this problem in my case when syslog-ng process was not
> running.
> 
> Try checking that and starting it and then start corosync.

Now I see that if I do a shutdown of the node that has the resource
(failover-ip), then this does not migrate to another node. By doing the
test I made sure Pacemaker + Corosync are functioning correctly on both
nodes before doing a shutdown of Atlantis.

Before making a shutdown of Atlantis:

---
daedalus:~# crm_mon --one-shot

Last updated: Thu Dec 23 19:24:09 2010
Stack: openais
Current DC: atlantis - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.


Online: [ atlantis daedalus ]

 failover-ip(ocf::heartbeat:IPaddr):Started atlantis
---

After doing a shutdown of Atlantis:

---
daedalus:~# crm_mon --one-shot

Last updated: Thu Dec 23 19:25:44 2010
Stack: openais
Current DC: daedalus - partition WITHOUT quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.


Online: [ daedalus ]
OFFLINE: [ atlantis ]
---

Here I'm using a configuration like the one presented in the wiki [1].

I am also noting that after the Atlantis launch, corosync makes the fork
without exec (as we assume from what I showed in the previous mail) and
only now is when the resource migrates to Daedalus:

---
daedalus:~# crm_mon --one-shot

Last updated: Thu Dec 23 19:49:11 2010
Stack: openais
Current DC: daedalus - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.


Online: [ daedalus ]
OFFLINE: [ atlantis ]

 failover-ip(ocf::heartbeat:IPaddr):Started daedalus
---


---
atlantis:~# crm_mon --one-shot

Connection to cluster failed: connection failed
---

I tried doing a "corosync stop", but the processes are not closed:

atlantis:~# ps auxf
[...]
root  1564  0.0  1.2 168144  3240 ?S19:38   0:00 
/usr/sbin/corosync
root  1565  0.0  1.2 168144  3240 ?S19:38   0:00 
/usr/sbin/corosync
root  1566  0.0  1.2 168144  3240 ?S19:38   0:00 
/usr/sbin/corosync
root  1567  0.0  1.2 168144  3240 ?S19:38   0:00 
/usr/sbin/corosync
root  1568  0.0  1.2 168144  3240 ?S19:38   0:00 
/usr/sbin/corosync
root  1569  0.0  1.2 168144  3240 ?S19:38   0:00 
/usr/sbin/corosync


The only way I found to correctly start corosync is doing a "pkill -9
corosync" and "corosync start":


atlantis:~# ps auxf
[...]
root  2120  0.2  1.9 134288  5060 ?Ssl  19:59   0:00 
/usr/sbin/corosync
root  2128  0.0  4.5  76028 11600 ?SLs  19:59   0:00  \_ 
/usr/lib/heartbeat/stonithd
105   2129  0.1  2.0  79104  5120 ?S19:59   0:00  \_ 
/usr/lib/heartbeat/cib
root  2130  0.0  0.8  71580  2108 ?S19:59   0:00  \_ 
/usr/lib/heartbeat/lrmd
105   2131  0.0  1.3  79968  3340 ?S19:59   0:00  \_ 
/usr/lib/heartbeat/attrd
105   2132  0.0  1.1  80332  2892 ?S19:59   0:00  \_ 
/usr/lib/heartbeat/pengine
105   2133  0.0  1.4  86216  3764 ?S19:59   0:00  \_ 
/usr/lib/heartbeat/crmd


After this, the resource automatically migrates back to Atlantis:

---
daedalus:~# crm_mon --one-shot

Last updated: Thu Dec 23 20:03:18 2010
Stack: openais
Current DC: daedalus - partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
1 Resources configured.


Online: [ atlantis daedalus ]

 failover-ip(ocf::heartbeat:IPaddr):Started atlantis
---


Any idea how to fix this problem with Corosync?

Why to do a shutdown of Atlantis the resource does not migrate to
Daedalus?



Thanks for your reply.

Regards,
Daniel

[1] http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo
-- 
Daniel Bareiro - GNU/Linux registered user #188

Re: [Pacemaker] Problems with Pacemaker + Corosync after reboot

2010-12-22 Thread Shravan Mishra
Hi,

What's happening is that corosync is forking but the exec is not happening.

I used to see this problem in my case when syslog-ng process was not running.

Try checking that and starting it and then start corosync.

Sincerely
Shravan

On Wed, Dec 22, 2010 at 4:43 AM, Daniel Bareiro  wrote:
> Hi all!
>
> I hope this is the right group to discuss my problem.
>
> I'm beginning to test HA clusters with Debian GNU/Linux and for that I
> decided to try Pacemaker + Corosync with Debian Lenny amd64 following
> this [1] howto.
>
> Both packages were installed from the Backports repositories. But I am
> observing that if after configuration I reboot a node, it fails to join
> to the cluster after the boot.
>
> This is what I see in /var/log/daemon.log:
>
> --
> Dec 19 17:13:13 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
> Sending message to local.crmd failed: unknown (rc=-2)
> Dec 19 17:13:13 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
> Sending message to local.cib failed: unknown (rc=-2)
> Dec 19 17:13:13 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
> Sending message to local.attrd failed: unknown (rc=-2)
> Dec 19 17:13:13 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
> Sending message to local.cib failed: unknown (rc=-2)
> Dec 19 17:13:14 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
> Sending message to local.cib failed: unknown (rc=-2)
> Dec 19 17:13:14 atlantis corosync[1508]:   [pcmk  ] WARN: route_ais_message: 
> Sending message to local.cib failed: unknown (rc=-2)
> Dec 19 17:13:21 atlantis corosync[1508]:   [TOTEM ] A processor failed, 
> forming new configuration.
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 72: memb=1, new=0, lost=1
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: pcmk_peer_update: 
> memb: atlantis 335544586
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: pcmk_peer_update: 
> lost: daedalus 369099018
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 72: memb=1, new=0, lost=0
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: pcmk_peer_update: 
> MEMB: atlantis 335544586
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: 
> ais_mark_unseen_peer_dead: Node daedalus was not seen in the previous 
> transition
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: update_member: Node 
> 369099018/daedalus is now: lost
> Dec 19 17:13:25 atlantis corosync[1508]:   [pcmk  ] info: 
> send_member_notification: Sending membership update 72 to 0 children
> Dec 19 17:13:25 atlantis corosync[1508]:   [TOTEM ] A processor joined or 
> left the membership and a new membership was formed.
> Dec 19 17:13:25 atlantis corosync[1508]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> --
>
> # ps auxf
> [...]
> root      1508  0.1  1.9 182624  4880 ?        Ssl  15:52   0:22 
> /usr/sbin/corosync
> root      1539  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
> /usr/sbin/corosync
> root      1540  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
> /usr/sbin/corosync
> root      1541  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
> /usr/sbin/corosync
> root      1542  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
> /usr/sbin/corosync
> root      1543  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
> /usr/sbin/corosync
> root      1544  0.0  1.2 168144  3240 ?        S    15:52   0:00  \_ 
> /usr/sbin/corosync
>
>
> From what I see in the howto, the output should be something like this:
>
>
> root     29980  0.0  0.8  44304  3808 ?        Ssl  20:55   0:00 
> /usr/sbin/corosync
> root     29986  0.0  2.4  10812 10812 ?        SLs  20:55   0:00  \_ 
> /usr/lib/heartbeat/stonithd
> 102      29987  0.0  0.8  13012  3804 ?        S    20:55   0:00  \_ 
> /usr/lib/heartbeat/cib
> root     29988  0.0  0.4   5444  1800 ?        S    20:55   0:00  \_ 
> /usr/lib/heartbeat/lrmd
> 102      29989  0.0  0.5  12364  2368 ?        S    20:55   0:00  \_ 
> /usr/lib/heartbeat/attrd
> 102      29990  0.0  0.5   8604  2304 ?        S    20:55   0:00  \_ 
> /usr/lib/heartbeat/pengine
> 102      29991  0.0  0.6  12648  3080 ?        S    20:55   0:00  \_ 
> /usr/lib/heartbeat/crmd
>
>
>
> I also tried compiling Pacemaker using these [2] steps, but I get the
> same result.
>
>
>
> Thanks in advance for your reply.
>
> Regards,
> Daniel
>
> [1] http://www.clusterlabs.org/wiki/Debian_Lenny_HowTo
> [2] http://www.clusterlabs.org/wiki/Install#Building_from_Source
> --
> Daniel Bareiro - GNU/Linux registered user #188.598
> Proudly running Debian GNU/Linux with uptime:
> 06:39:43 up 70 days,  7:06, 10 users,  load average: 0.27, 0.16, 0.10
>
> -BEGIN PGP SI