Hi all

I have a set of servers running asterisk and some java apps which have (so far) unexplained spikes in load average.

A typical spike which occurs at "random" times would see the 1 minute load average load go from around 4 to upwards of 50, sometime approaching 200, within one second.

From proc manpage, the 1 min load average is "number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1 minute"

I am collecting many different stats from proc every second, but nothing I have found can correlate with the spike in load average. The counts of process numbers from /proc/stat and /prov/loadavg do not match up to the sudden spike. I have looked at memory paging, irqs, number of threads, cpu states(intr/iowait/etc), network traffic, disk io, etc but no metric I have yet found indicates it is changing behaviour at the same time as the load average spikes

As I am writing this, I have realized that I am not actually tracking the numbers which would be the direct cause of the load average, which would be to loop through all processes, extract the process state from /proc/<pid>/stat, and add up the various types. This would provide (hopefully) a match so I could see that the load average numbers are "correct", and may indicate a cause (many processes waiting for IO, or lots of the same process (asterisk or java) being scheduled to run at the same time)

While I do that, would anyone have some other idea of how to troubleshoot the cause of very high load spikes?

Regards

Chris

_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
http://lopsa.org/

Reply via email to