On 02/27/2017 01:48 PM, Jeffrey Westgate wrote: > I think I may be on to something. It seems that every time my boxes start > showing increased host load, the preceding change that takes place is: > > crmd: info: throttle_send_command: New throttle mode: 0100 (was > 0000) > > I'm attaching the last 50-odd lines from the corosync.log. It just happens > that - at the moment - our host load on this box is coming back down. No > host load issue (0.00 load) immediately preceding this part of the log. > > I know the log shows them in reverse order, but it shows them as the same log > item, and printed at the same time. I'm assuming the throttle change takes > place and that increases the load, not the other way around.... > > So - what is the throttle mode? > > -- > Jeff Westgate > DIS UNIX/Linux System Administrator
Actually it is the other way around. When Pacemaker detects high load on a node, it "throttles" by reducing the number of operations it will execute concurrently (to avoid making a bad situation worse). So, what caused the load to go up is still a mystery. There have been some cases where corosync started using 100% CPU, but since you mentioned that processes aren't taking any more CPU, it doesn't sound like the same issue. > ------------------------------ > Message: 3 > Date: Mon, 27 Feb 2017 13:26:30 +0000 > From: Jeffrey Westgate <jeffrey.westg...@arkansas.gov> > To: "users@clusterlabs.org" <users@clusterlabs.org> > Subject: Re: [ClusterLabs] Never join a list without a problem... > Message-ID: > > <a36b14fa9aa67f4e836c0ee59dea89c4015b20c...@cm-sas-mbx-07.sas.arkgov.net> > > Content-Type: text/plain; charset="us-ascii" > > Thanks, Ken. > > Our late guru was the admin who set all this up, and it's been rock solid > until recent oddities started cropping up. They still function fine - > they've just developed some... quirks. > > I found the solution before I got your reply, which was essentially what we > did; update all but pacemaker, reboot, stop pacemaker, update pacemaker, > reboot. That process was necessary because they've been running sooo long, > pacemaker would not stop. it would try, then seemingly stall after several > minutes. > > We're good now, up-to-date-wise, and stuck only with the initial issue we > were hoping to eliminate by updating/patching EVERYthing. And we honestly > don't know what may be causing it. > > We use Nagios to monitor, and once every 20 to 40 hours - sometimes longer, > and we cannot set a clock by it - while the machine is 95% idle (or more > according to 'top'), the host load shoots up to 50 or 60%. It takes about 20 > minutes to peak, and another 30 to 45 minutes to come back down to baseline, > which is mostly 0.00. (attached hostload.pdf) This happens to both > machines, randomly, and is concerning, as we'd like to find what's causing it > and resolve it. > > We were hoping "uptime kernel bug", but patching has not helped. There seems > to be no increase in the number of processes running, and the processes > running do not take any more cpu time. They are DNS forwarding resolvers, > but there is no correlation between dns requests and load increase - > sometimes (like this morning) it rises around 1 AM when the dns load is > minimal. > > The oddity is - these are the only two boxes with this issue, and we have a > couple dozen at the same OS and level. Only these two, with this role and > this particular package set have the issue. > > -- > Jeff _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org