„schedd_job_info“ does not scale due to its nature (the amount of messages per job are depend on the cluster size and for each job messages are generated). It is also questionable if all scheduler decisions for each job and resource (queue instances) needs to be documented temporarily. Hence the recommendation is always to turn it off (I think we changed the default to that in one of the last Sun versions). Alternatively you can use "qalter -w p <jobid>“ for figuring out why a job is not scheduled (produces similar messages but for only one particular job instead).
Daniel > Am 09.02.2015 um 09:43 schrieb Remy Dernat <[email protected]>: > > > Le 09/02/2015 03:56, Christopher Samuel a écrit : >> On 07/02/15 14:57, Alan Louis Scheinine wrote: >> >>> Only problem I've seen is that if a user allocates too much memory, >>> OOM killer can kill maintenance processes such as a scheduler daemon. >> This is why we disable overcommit. :-) >> > Hi, > > I already saw that problem on our master. The scheduler, SGE, runs out of > memory and OOM decided to kill it: > > Dec 1 15:01:07 cluster1 kernel: Out of memory: Kill process 7963 > (sge_qmaster) score 948 or sacrifice child > > I resolved that issue by disabling "schedd_job_info" in SGE with "qconf > -msconf". > > However, this setting gives significant informations about our jobs. > > How should I adjust OOM killer ? Sould I set > vm.overcomm! > it_memory > = 2 > ? > > Best regards, > > Rémy > > -- > Rémy Dernat > MBB/ISE-M > _______________________________________________ > Beowulf mailing list, [email protected] sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
