still at the client with the VLSI tools. Some of the users here are
running heavy simulations (all userspace, almost 0 kernel time), at
times a single process can hog the entire system. I have no idea how
that happens, as this is a fairly modern kernel (the slightly older
scheduler of RHEL4's 2.6.9) and the Cadence tools are not using
lightw×–ight procs, so all the load is on a single core (on a quad Xeon)
and yet once it starts the whole machine is choked, and I can only hit
the reset.

step 1: I asked them all to nice down the jobs, but they are not very
happy to. I'm trying to educate them and make them use wrappers (I'm
introducing condor here anyway)

step2: I have set up the root's .bashrc to renice me up to -4 and so I
can keep a session active for the next time this happens and at least be
able to run "top" and "kill"

step3: I need a monitor to alert and maybe kill or renice such processes
when they pop up and drag the machine down to a halt. till I find out
who the culprit is, I don't have a procname and so "monit" is not a good
choice.  any other good ideas?

step4: how do I log this without overlogging? some sort of a smart
process auditing daemon? I don't want to improvise with shell scripts
and cron, grepping from PS, because when the excrement impacts the venta
it may not be able to run (unless I hike the crond's priority to a
negative nice). I need a small reliable C proggy to do the right thing.

the obvious is maybe to set some ulimits on the users, but I don't want
to limit heavy processes that do NOT choke the system.

-- 
A meal best served cold
Ira Abramov
http://ira.abramov.org/email/

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to