On 20/09/2013, at 8:19 AM, Lists <li...@benjamindsmith.com> wrote:

> On 09/18/2013 06:49 PM, Andrew Beekhof wrote:
>> On 19/09/2013, at 8:25 AM, David Lang <da...@lang.hm> wrote:
>> 
>>> What's the best way to see what it's getting stuck doing?
>> Log files.
>> 
>>> Is there a good way to tell if this is a pacemaker or corosync problem (so 
>>> I can drop one of the lists from the thread)?
>> Not without further information
>> 
> 
> We've had the same problem here, trying to get HA dns/named service working. 
> Works great for a day or so, then seizes up, simple commands like 
> `crm_standby -v true` timeout after 120 seconds, etc. We're testing for 
> release, and keep running into issues like this. At first we suspected 
> firewall issues, but even after confirmed operation and several hand-offs of 
> HA services back and forth, it still dies within a day or so.
> 
> We're on CentOS 6/64 with yum packages augmented from 
> http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/
> with exclude=pacemaker* corosync*
> 
> In order to make the log files visible, I've snipped out a time period during 
> which it becomes unresponsive visible at 
> http://hal.schoolpathways.com/details/
> 
> I don't know the exact moment,


I do.

It is right when you start seeing messages like:
Sep 19 00:56:09 [9004] nomad.schoolpathways.com       crmd:     info: 
send_ais_text:    Peer overloaded or membership in flux: Re-sending message 
(Attempt 1 of 20)

Eventually that escalates to:
Sep 19 00:59:39 [9004] nomad.schoolpathways.com       crmd:    error: 
send_ais_text:    Sending message 94 via cpg: FAILED (rc=6): Try again: Success 
(0)

From this we can infer that corosync has gotten horribly confused and, as a 
consequence, pacemaker can't talk to its peers anymore.

> this is a test cluster and not being monitored by a netmon. Any other details 
> I could provide that would be useful/helpful?

Shortly before this, Corosync claims:

Sep 19 00:47:07 corosync [TOTEM ] A processor joined or left the membership and 
a new membership was formed.
Sep 19 00:56:09 [9004] nomad.schoolpathways.com       crmd:     info: 
pcmk_cpg_membership:      Left[2.0] crmd.1 
Sep 19 00:56:09 [9004] nomad.schoolpathways.com       crmd:     info: 
crm_update_peer_proc:     pcmk_cpg_membership: Node 
bender.schoolpathways.com[1] - corosync-cpg is now offline
Sep 19 00:56:09 [9004] nomad.schoolpathways.com       crmd:     info: 
peer_update_callback:     Client bender.schoolpathways.com/peer now has status 
[offline] (DC=true)

Is this true?
If not, perhaps some timeouts need to be adjusted.  A switch to udpu (instead 
of multicast) may also be helpful.

> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to