On 05/24/2017 08:04 AM, Attila Megyeri wrote: > Hi Klaus, > > Thank you for your response. > I tried many things, but no luck. > > We have many pacemaker clusters with 99% identical configurations, package > versions, and only this one causes issues. (BTW we use unicast for corosync, > but this is the same for our other clusters as well.) > I checked all connection settings between the nodes (to confirm there are no > firewall issues), increased the number of cores on each node, but still - as > long as a monitor operation is pending for a resource, no other operation is > executed. > > e.g. resource A is being monitored, and timeout is 90 seconds, until this > check times out I cannot do a cleanup or start/stop on any other resource.
Do you have any constraints configured? If B depends on A, you probably want at least an ordering constraint. Then the cluster would stop B before stopping A, and not try to start it until A is up again. Throttling based on load wasn't added until Pacemaker 1.1.11, so the only limit on parallel execution in 1.1.10 was batch-limit, which defaulted to 30 at the time. I'd investigate by figuring out which node was DC at the time and checking its pacemaker log (preferably with PCMK_debug=crmd turned on). You can see each run of the policy engine and what decisions were made, ending with a message like "saving inputs in /var/lib/pacemaker/pengine/pe-input-4940.bz2". You can run crm_simulate on that file to get more information about the decision-making process. "crm_simulate -Sx $FILE -D transition.dot" will create a dot graph of the transition showing dependencies. You can convert the graph to an svg with "dot transition.dot -Tsvg > transition.svg" and then look at that file in any SVG viewer (including most browsers). > Two more interesting things: > - cluster recheck is set to 2 minutes, and even though the resources are > running properly, the fail counters are not reduced and crm_mon lists the > resources in failed actions section. forever. Or until I manually do resource > cleanup. > - If i execute a crm resource cleanup RES_name from another node, sometimes > it simply does not clean up the failed state. If I execute this from the node > where the resource IS actually runing, the resource is removed from the > failed actions. > > > What do you recommend, how could I start troubleshooting these issues? As I > said, this setup works fine in several other systems, but here I am > really-realy stuck. > > > thanks! > > Attila > > > > > >> -----Original Message----- >> From: Klaus Wenninger [mailto:kwenn...@redhat.com] >> Sent: Wednesday, May 10, 2017 2:04 PM >> To: users@clusterlabs.org >> Subject: Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond >> >> On 05/09/2017 10:34 PM, Attila Megyeri wrote: >>> >>> Actually I found some more details: >>> >>> >>> >>> there are two resources: A and B >>> >>> >>> >>> resource B depends on resource A (when the RA monitors B, if will fail >>> if A is not running properly) >>> >>> >>> >>> If I stop resource A, the next monitor operation of "B" will fail. >>> Interestingly, this check happens immediately after A is stopped. >>> >>> >>> >>> B is configured to restart if monitor fails. Start timeout is rather >>> long, 180 seconds. So pacemaker tries to restart B, and waits. >>> >>> >>> >>> If I want to start "A", nothing happens until the start operation of >>> "B" fails - typically several minutes. >>> >>> >>> >>> >>> >>> Is this the right behavior? >>> >>> It appears that pacemaker is blocked until resource B is being >>> started, and I cannot really start its dependency... >>> >>> Shouldn't it be possible to start a resource while another resource is >>> also starting? >>> >> >> As long as resources don't depend on each other parallel starting should >> work/happen. >> >> The number of parallel actions executed is derived from the number of >> cores and >> when load is detected some kind of throttling kicks in (in fact reduction of >> the operations executed in parallel with the aim to reduce the load induced >> by pacemaker). When throttling kicks in you should get log messages (there >> is in fact a parallel discussion going on ...). >> No idea if throttling might be a reason here but maybe worth considering >> at least. >> >> Another reason why certain things happen with quite some delay I've >> observed >> is that obviously some situations are just resolved when the >> cluster-recheck-interval >> triggers a pengine run in addition to those triggered by changes. >> You might easily verify this by changing the cluster-recheck-interval. >> >> Regards, >> Klaus >> >>> >>> >>> >>> >>> Thanks, >>> >>> Attila >>> >>> >>> >>> >>> >>> *From:*Attila Megyeri [mailto:amegy...@minerva-soft.com] >>> *Sent:* Tuesday, May 9, 2017 9:53 PM >>> *To:* users@clusterlabs.org; kgail...@redhat.com >>> *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond >>> >>> >>> >>> Hi Ken, all, >>> >>> >>> >>> >>> >>> We ran into an issue very similar to the one described in >>> https://bugzilla.redhat.com/show_bug.cgi?id=1430112 / [Intel 7.4 Bug] >>> Pacemaker occasionally takes minutes to respond >>> >>> >>> >>> But in our case we are not using fencing/stonith at all. >>> >>> >>> >>> Many times when I want to start/stop/cleanup a resource, it takes tens >>> of seconds (or even minutes) till the command gets executed. The logs >>> show nothing in that period, the redundant rings show no fault. >>> >>> >>> >>> Could this be the same issue? >>> >>> >>> >>> Any hints on how to troubleshoot this? >>> >>> It is pacemaker 1.1.10, corosync 2.3.3 >>> >>> >>> >>> >>> >>> Cheers, >>> >>> Attila >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Users mailing list: Users@clusterlabs.org >>> http://lists.clusterlabs.org/mailman/listinfo/users >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> >> -- >> Klaus Wenninger >> >> Senior Software Engineer, EMEA ENG Openstack Infrastructure >> >> Red Hat >> >> kwenn...@redhat.com _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org