Am 29.08.2012 um 17:21 schrieb Brian Smith: > We use mem_free variable as a consumable. Then, we use a cronjob called > memkiller that terminates jobs if they go over their requested (or default) > memory allocation and
It would be more straight forward to use directly h_vmem. This is controlled by SGE and the job exceeding the limit will be killed by SGE. If you consume it as a consumable on a exechost level, it could be set to the built in physical memory. Was there any reason to use mem_free? -- Reuti > > 1. Swap space on node is used > 2. Swap rate is greater than 100 I/Os per second > > The user gets emailed with a report if this happens. > > This has made dealing with the oom killer a thing of the past in our shop. > > We manage memory on the principle that swap should NEVER be used. If you're > hitting oom killer, you're pretty far beyond that in terms of memory > utilization; if performance is a consideration, MHO is you should be looking > to schedule your memory usage accordingly. Oom killer shouldn't be a factor > if memory is handled as a scheduler consideration. > > -Brian > > Brian Smith > Sr. System Administrator > Research Computing, University of South Florida > 4202 E. Fowler Ave. SVC4010 > Office Phone: +1 813 974-1467 > Organization URL: http://rc.usf.edu > > On 08/29/2012 11:02 AM, Ben De Luca wrote: >> I was wondering, how people deal with oom conditions on there cluster. >> We constantly have machines that die because the oom killer takes out >> critical system services. >> >> Has any experiance with the oom_adj proc value, or a patch to grid to >> support it? >> >> >> /proc/[pid]/oom_adj (since Linux 2.6.11) >> This file can be used to adjust the score used to select >> which process >> should be killed in an out-of-memory (OOM) situation. >> The kernel uses >> this value for a bit-shift operation of the process's >> oom_score value: >> valid values are in the range -16 to +15, plus the >> special value -17, >> which disables OOM-killing altogether for this process. >> A positive >> score increases the likelihood of this process being >> killed by the OOM- >> killer; a negative score decreases the likelihood. The >> default value >> for this file is 0; a new process inherits its parent's oom_adj >> setting. A process must be privileged >> (CAP_SYS_RESOURCE) to update >> this file. >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
