Since we have both pieces of the load-balanced cluster doing the same thing - for still-as-yet unidentified reasons - we've put atop on one and sysdig on the other. Running atop at 10 second slices, hoping it will catch something. While configuring it yesterday, that server went into it's 'episode', but there was nothing in the atop log to show anything. Nothing else changed except the cpu load average. No increase in any other parameter.
frustrating. ________________________________________ From: Adam Spiers [aspi...@suse.com] Sent: Wednesday, March 01, 2017 5:33 AM To: Cluster Labs - All topics related to open-source clustering welcomed Cc: Jeffrey Westgate Subject: Re: [ClusterLabs] Never join a list without a problem... Ferenc Wágner <wf...@niif.hu> wrote: >Jeffrey Westgate <jeffrey.westg...@arkansas.gov> writes: > >> We use Nagios to monitor, and once every 20 to 40 hours - sometimes >> longer, and we cannot set a clock by it - while the machine is 95% >> idle (or more according to 'top'), the host load shoots up to 50 or >> 60%. It takes about 20 minutes to peak, and another 30 to 45 minutes >> to come back down to baseline, which is mostly 0.00. (attached >> hostload.pdf) This happens to both machines, randomly, and is >> concerning, as we'd like to find what's causing it and resolve it. > >Try running atop (http://www.atoptool.nl/). It collects and logs >process accounting info, allowing you to step back in time and check >resource usage in the past. Nice, I didn't know atop could also log the collected data for future analysis. If you want to capture even more detail, sysdig is superb: http://www.sysdig.org/ _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org