On Tue, Jul 19, 2011 at 11:52 AM, Steve Dibb <[email protected]> wrote:
> I've got two questions -- how do you guys usually go about monitoring
> this stuff? Monit can check the system general usage, but how do I know
> which applications are doing that?
You already got great suggestions for this question.
Monit, Munin, Cacti for general performance graphing/monitoring
Nagios, etc for host/service availability monitoring and notifications
Splunk or basic centralized syslog for log monitoring and analysis
>
> My second question is, where in the world do you start to diagnose
> something like this? Looking at the system and apache logs, it looks
> like everything just STOPPED. There's no red flags that I can see, so
> I'm having a hard time diagnosing it.
Nobody touched on this question, likely because it's a pain to
identify sudden massive memory spikes like this.
* Consider the services that the machine provides.
Are any of them likely or possible to eat tons of memory in a very short time?
* Check out your existing logs for all services.
Is there any other indications from any logs of increased activity?
* Consider timing and frequency of these failures.
Does it happen more than once? At the same frequency? Predictable?
* Look through your scheduled tasks (Cron) for any processes that
may coincide with this timing.
* Consider a more frequent system checker
Run a loop to gather process data.
$ while true; do ps auxww > ps.$(date +%s); sleep 10; done
or something similar
Increase the frequency of your existing monitoring, if possible.
Good Luck
--lonnie
_______________________________________________
UPHPU mailing list
[email protected]
http://uphpu.org/mailman/listinfo/uphpu
IRC: #uphpu on irc.freenode.net