Hi Klaus, Thank you for your response. I tried many things, but no luck.
We have many pacemaker clusters with 99% identical configurations, package versions, and only this one causes issues. (BTW we use unicast for corosync, but this is the same for our other clusters as well.) I checked all connection settings between the nodes (to confirm there are no firewall issues), increased the number of cores on each node, but still - as long as a monitor operation is pending for a resource, no other operation is executed. e.g. resource A is being monitored, and timeout is 90 seconds, until this check times out I cannot do a cleanup or start/stop on any other resource. Two more interesting things: - cluster recheck is set to 2 minutes, and even though the resources are running properly, the fail counters are not reduced and crm_mon lists the resources in failed actions section. forever. Or until I manually do resource cleanup. - If i execute a crm resource cleanup RES_name from another node, sometimes it simply does not clean up the failed state. If I execute this from the node where the resource IS actually runing, the resource is removed from the failed actions. What do you recommend, how could I start troubleshooting these issues? As I said, this setup works fine in several other systems, but here I am really-realy stuck. thanks! Attila > -----Original Message----- > From: Klaus Wenninger [mailto:kwenn...@redhat.com] > Sent: Wednesday, May 10, 2017 2:04 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond > > On 05/09/2017 10:34 PM, Attila Megyeri wrote: > > > > Actually I found some more details: > > > > > > > > there are two resources: A and B > > > > > > > > resource B depends on resource A (when the RA monitors B, if will fail > > if A is not running properly) > > > > > > > > If I stop resource A, the next monitor operation of "B" will fail. > > Interestingly, this check happens immediately after A is stopped. > > > > > > > > B is configured to restart if monitor fails. Start timeout is rather > > long, 180 seconds. So pacemaker tries to restart B, and waits. > > > > > > > > If I want to start "A", nothing happens until the start operation of > > "B" fails - typically several minutes. > > > > > > > > > > > > Is this the right behavior? > > > > It appears that pacemaker is blocked until resource B is being > > started, and I cannot really start its dependency... > > > > Shouldn't it be possible to start a resource while another resource is > > also starting? > > > > As long as resources don't depend on each other parallel starting should > work/happen. > > The number of parallel actions executed is derived from the number of > cores and > when load is detected some kind of throttling kicks in (in fact reduction of > the operations executed in parallel with the aim to reduce the load induced > by pacemaker). When throttling kicks in you should get log messages (there > is in fact a parallel discussion going on ...). > No idea if throttling might be a reason here but maybe worth considering > at least. > > Another reason why certain things happen with quite some delay I've > observed > is that obviously some situations are just resolved when the > cluster-recheck-interval > triggers a pengine run in addition to those triggered by changes. > You might easily verify this by changing the cluster-recheck-interval. > > Regards, > Klaus > > > > > > > > > > > Thanks, > > > > Attila > > > > > > > > > > > > *From:*Attila Megyeri [mailto:amegy...@minerva-soft.com] > > *Sent:* Tuesday, May 9, 2017 9:53 PM > > *To:* users@clusterlabs.org; kgail...@redhat.com > > *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond > > > > > > > > Hi Ken, all, > > > > > > > > > > > > We ran into an issue very similar to the one described in > > https://bugzilla.redhat.com/show_bug.cgi?id=1430112 / [Intel 7.4 Bug] > > Pacemaker occasionally takes minutes to respond > > > > > > > > But in our case we are not using fencing/stonith at all. > > > > > > > > Many times when I want to start/stop/cleanup a resource, it takes tens > > of seconds (or even minutes) till the command gets executed. The logs > > show nothing in that period, the redundant rings show no fault. > > > > > > > > Could this be the same issue? > > > > > > > > Any hints on how to troubleshoot this? > > > > It is pacemaker 1.1.10, corosync 2.3.3 > > > > > > > > > > > > Cheers, > > > > Attila > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > Users mailing list: Users@clusterlabs.org > > http://lists.clusterlabs.org/mailman/listinfo/users > > > > Project Home: http://www.clusterlabs.org > > Getting started: > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: http://bugs.clusterlabs.org > > > -- > Klaus Wenninger > > Senior Software Engineer, EMEA ENG Openstack Infrastructure > > Red Hat > > kwenn...@redhat.com > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org