On 10/04/2012 12:18 PM, James Harper wrote: >> Hi, >> >> On Wed, Oct 03, 2012 at 10:07:06PM +0000, James Harper wrote: >>> It seems like everytime I modify a resource, things start timing out. Just >>> now I changed the location of where a ping resource could run and this >>> happened: >>> Oct 4 07:07:07 bitvs5 lrmd: [3681]: WARN: perform_ra_op: the >>> operation monitor[52] on p_lvm_iscsi:0 for client 3686 stayed in >>> operation list for 22000 ms (longer than 10000 ms) >> >> That's interesting. Normally such a change should result in just a few >> operations. Did you take a look at the transition which resulted from this >> change? > > I don't think I know how to do that. All I changed though was a ping resource > that wasn't (yet) in use so I can't see that too much could have changed. > >>> Another oddity is that the resource for p_lvm_iscsi is defined as: >>> >>> primitive p_lvm_iscsi ocf:heartbeat:LVM \ >>> params volgrpname="vg-drbd" \ >>> op start interval="0" timeout="30s" \ >>> op stop interval="0" timeout="30s" \ >>> op monitor interval="10s" timeout="30s" >>> >>> so I don't know where the timeout of 10000ms is coming from?? >>> >>> When I change something with crm configure the cib process shoots up to >>> 100% CPU and stays there for a while, and the node becomes more-or-less >>> unresponsive, which may go some way to explaining why things time out. Is >>> this normal? It doesn't explain why lrmd complains that something took >>> longer than 10s when I set the timeout to 30s though, unless the interval >>> somehow interacts with that? >> >> Ten seconds is an ad-hoc time and has nothing to do with specific timeouts. >> lrmd logs a warning if an operation stays in the queue for longer than that. > > Ah. I leaped to the wrong conclusion there then. > >> How many resources do you have? You can also increase max-children (a >> lrmd parameter), which is a number of operations that lrmd is allowed to run >> concurrently (lrmadmin -p max-children n, by default it's set to 4). > > crm status says: > > Last updated: Thu Oct 4 19:42:30 2012 > Last change: Thu Oct 4 08:21:47 2012 > Stack: openais > Current DC: <xxx> - partition with quorum > Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff > 5 Nodes configured, 5 expected votes > 81 Resources configured. > > I'm not sure how many is considered a lot, but I can't think that 81 would > rate very highly.
Yes, that is a lot for Pacemaker ... 5 nodes and each node has a status section including operation results for every resource ... cib can get large quite fast. You tried setting a higher "batch-limit" in your properties? Do you see any corosync messages when applying such changes? There is a good chance you also need to tune corosync timings. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now > >>> Versions of software are all from Debian Wheezy: >>> corosync 1.4.2-3 >>> pacemaker 1.1.7-1 >> >> I'd suggest to open a bugzilla and include hb_report (or crm_report, >> whatever your distribution ships). >> > > I plan on doing a bit more testing on the weekend to see if I can find a bit > more information about exactly what is going on, otherwise any bug report is > going to be a bit vague. > > Thanks > > James > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org