still at the client with the VLSI tools. Some of the users here are running heavy simulations (all userspace, almost 0 kernel time), at times a single process can hog the entire system. I have no idea how that happens, as this is a fairly modern kernel (the slightly older scheduler of RHEL4's 2.6.9) and the Cadence tools are not using lightw×–ight procs, so all the load is on a single core (on a quad Xeon) and yet once it starts the whole machine is choked, and I can only hit the reset.
step 1: I asked them all to nice down the jobs, but they are not very happy to. I'm trying to educate them and make them use wrappers (I'm introducing condor here anyway) step2: I have set up the root's .bashrc to renice me up to -4 and so I can keep a session active for the next time this happens and at least be able to run "top" and "kill" step3: I need a monitor to alert and maybe kill or renice such processes when they pop up and drag the machine down to a halt. till I find out who the culprit is, I don't have a procname and so "monit" is not a good choice. any other good ideas? step4: how do I log this without overlogging? some sort of a smart process auditing daemon? I don't want to improvise with shell scripts and cron, grepping from PS, because when the excrement impacts the venta it may not be able to run (unless I hike the crond's priority to a negative nice). I need a small reliable C proggy to do the right thing. the obvious is maybe to set some ulimits on the users, but I don't want to limit heavy processes that do NOT choke the system. -- A meal best served cold Ira Abramov http://ira.abramov.org/email/ ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]