On 02/27/2017 01:48 PM, Jeffrey Westgate wrote:
> I think I may be on to something.  It seems that every time my boxes start 
> showing increased host load, the preceding change that takes place is:
> 
>  crmd:     info: throttle_send_command:       New throttle mode: 0100 (was 
> 0000)
> 
> I'm attaching the last 50-odd lines from the corosync.log.  It just happens 
> that  - at the moment - our host load on this box is coming back down.  No 
> host load issue (0.00 load) immediately preceding this part of the log.
> 
> I know the log shows them in reverse order, but it shows them as the same log 
> item, and printed at the same time.  I'm assuming the throttle change takes 
> place and that increases the load, not the other way around....
> 
> So - what is the throttle mode?
> 
> --
> Jeff Westgate
> DIS UNIX/Linux System Administrator

Actually it is the other way around. When Pacemaker detects high load on
a node, it "throttles" by reducing the number of operations it will
execute concurrently (to avoid making a bad situation worse).

So, what caused the load to go up is still a mystery.

There have been some cases where corosync started using 100% CPU, but
since you mentioned that processes aren't taking any more CPU, it
doesn't sound like the same issue.

> ------------------------------
> Message: 3
> Date: Mon, 27 Feb 2017 13:26:30 +0000
> From: Jeffrey Westgate <jeffrey.westg...@arkansas.gov>
> To: "users@clusterlabs.org" <users@clusterlabs.org>
> Subject: Re: [ClusterLabs] Never join a list without a problem...
> Message-ID:
>         
> <a36b14fa9aa67f4e836c0ee59dea89c4015b20c...@cm-sas-mbx-07.sas.arkgov.net>
> 
> Content-Type: text/plain; charset="us-ascii"
> 
> Thanks, Ken.
> 
> Our late guru was the admin who set all this up, and it's been rock solid 
> until recent oddities started cropping up.  They still function fine - 
> they've just developed some... quirks.
> 
> I found the solution before I got your reply, which was essentially what we 
> did; update all but pacemaker, reboot, stop pacemaker, update pacemaker, 
> reboot.  That process was necessary because they've been running sooo long, 
> pacemaker would not stop.  it would try, then seemingly stall after several 
> minutes.
> 
> We're good now, up-to-date-wise, and stuck only with the initial issue we 
> were hoping to eliminate by updating/patching EVERYthing.  And we honestly 
> don't know what may be causing it.
> 
> We use Nagios to monitor, and once every 20 to 40 hours - sometimes longer, 
> and we cannot set a clock by it - while the machine is 95% idle (or more 
> according to 'top'), the host load shoots up to 50 or 60%.  It takes about 20 
> minutes to peak, and another 30 to 45 minutes to come back down to baseline, 
> which is mostly 0.00.  (attached hostload.pdf)  This happens to both 
> machines, randomly, and is concerning, as we'd like to find what's causing it 
> and resolve it.
> 
> We were hoping "uptime kernel bug", but patching has not helped.  There seems 
> to be no increase in the number of processes running, and the processes 
> running do not take any more cpu time.  They are DNS forwarding resolvers, 
> but there is no correlation between dns requests and load increase - 
> sometimes (like this morning) it rises around 1 AM when the dns load is 
> minimal.
> 
> The oddity is - these are the only two boxes with this issue, and we have a 
> couple dozen at the same OS and level.  Only these two, with this role and 
> this particular package set have the issue.
> 
> --
> Jeff

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to