On Tue, Jul 19, 2011 at 11:52 AM, Steve Dibb <[email protected]> wrote:
> I've got two questions -- how do you guys usually go about monitoring
> this stuff?  Monit can check the system general usage, but how do I know
> which applications are doing that?

You already got great suggestions for this question.

Monit, Munin, Cacti for general performance graphing/monitoring
Nagios, etc for host/service availability monitoring and notifications
Splunk or basic centralized syslog for log monitoring and analysis

>
> My second question is, where in the world do you start to diagnose
> something like this?  Looking at the system and apache logs, it looks
> like everything just STOPPED.  There's no red flags that I can see, so
> I'm having a hard time diagnosing it.

Nobody touched on this question, likely because it's a pain to
identify sudden massive memory spikes like this.
* Consider the services that the machine provides.
  Are any of them likely or possible to eat tons of memory in a very short time?
* Check out your existing logs for all services.
  Is there any other indications from any logs of increased activity?
* Consider timing and frequency of these failures.
  Does it happen more than once?  At the same frequency?  Predictable?
  * Look through your scheduled tasks (Cron) for any processes that
may coincide with this timing.
* Consider a more frequent system checker
  Run a loop to gather process data.
    $ while true; do ps auxww > ps.$(date +%s); sleep 10; done
    or something similar
  Increase the frequency of your existing monitoring, if possible.

Good Luck
--lonnie

_______________________________________________

UPHPU mailing list
[email protected]
http://uphpu.org/mailman/listinfo/uphpu
IRC: #uphpu on irc.freenode.net

Reply via email to