On 20/09/2013, at 8:19 AM, Lists <li...@benjamindsmith.com> wrote: > On 09/18/2013 06:49 PM, Andrew Beekhof wrote: >> On 19/09/2013, at 8:25 AM, David Lang <da...@lang.hm> wrote: >> >>> What's the best way to see what it's getting stuck doing? >> Log files. >> >>> Is there a good way to tell if this is a pacemaker or corosync problem (so >>> I can drop one of the lists from the thread)? >> Not without further information >> > > We've had the same problem here, trying to get HA dns/named service working. > Works great for a day or so, then seizes up, simple commands like > `crm_standby -v true` timeout after 120 seconds, etc. We're testing for > release, and keep running into issues like this. At first we suspected > firewall issues, but even after confirmed operation and several hand-offs of > HA services back and forth, it still dies within a day or so. > > We're on CentOS 6/64 with yum packages augmented from > http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/ > with exclude=pacemaker* corosync* > > In order to make the log files visible, I've snipped out a time period during > which it becomes unresponsive visible at > http://hal.schoolpathways.com/details/ > > I don't know the exact moment,
I do. It is right when you start seeing messages like: Sep 19 00:56:09 [9004] nomad.schoolpathways.com crmd: info: send_ais_text: Peer overloaded or membership in flux: Re-sending message (Attempt 1 of 20) Eventually that escalates to: Sep 19 00:59:39 [9004] nomad.schoolpathways.com crmd: error: send_ais_text: Sending message 94 via cpg: FAILED (rc=6): Try again: Success (0) From this we can infer that corosync has gotten horribly confused and, as a consequence, pacemaker can't talk to its peers anymore. > this is a test cluster and not being monitored by a netmon. Any other details > I could provide that would be useful/helpful? Shortly before this, Corosync claims: Sep 19 00:47:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Sep 19 00:56:09 [9004] nomad.schoolpathways.com crmd: info: pcmk_cpg_membership: Left[2.0] crmd.1 Sep 19 00:56:09 [9004] nomad.schoolpathways.com crmd: info: crm_update_peer_proc: pcmk_cpg_membership: Node bender.schoolpathways.com[1] - corosync-cpg is now offline Sep 19 00:56:09 [9004] nomad.schoolpathways.com crmd: info: peer_update_callback: Client bender.schoolpathways.com/peer now has status [offline] (DC=true) Is this true? If not, perhaps some timeouts need to be adjusted. A switch to udpu (instead of multicast) may also be helpful. > > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org