Hi, Am 14.10.2015 um 19:53 schrieb Korzennik, Sylvain:
> We found that after starting a large number of jobs (in the thousands), > approx 1% of new jobs fail b/c they get mark after 1-3 sec of execution time > as having exceeded either the cpu time limit or the memory limit. Neither > condition is correct as the jobs barely started. It is my guess that the part > of the code that handle keeping track of cpu time and/or memory gets > corrupted. We fix teh problem by restarting the sge_execd daemons on all the > compute nodes. > > On a possibly related issue, I also discovered that the usage line returned > by qstat -j <jid>, ie: > > usage 1: cpu=91:18:14:27, mem=1818707.42063 GBs, io=1.35317, vmem=1.551G, > maxvmem=1.553G > > gets also corrupted and is at times meaningless. > > In order to keep track of resource usage (esp. memory) I run a qstat -j <jid> > on all the jobs in a specific set of queues ever 5 minutes and log the > results. I found the following inconsistencies > - glitches in the value of cpu= (it should increase monotonically) > - jumps in the value of mem=; at some point it drops by a large value, as if > there was an overflow in a counter > - jumps in the value of maxvmem= again it starts high (10.89G), then drops > to (1.553GB) which does not make sense. That drop appears at the same time as > the mem= glitch > > Some jobs on our cluster run for 60 to 90 days. I also found inconsistencies > in the accounting file. Maybe the additional group ID which is used by SGE to keep track of resource consumption of jobs is getting reused too fast. What range did you specify when you installed SGE? How many jobs run at the same tim on each exechost? $ qconf -sconf | grep gid_range Are real groups occupying the same specified range and processes outside of SGE use these too? -- Reuti > > > We run OGS/GE 2011.11p1 under Rocks 6.1.1. We had the same problem > (sge_execd) with Rocks 5.x. > > Any pointers/hint/suggestions/etc welcome. > > Cheers, > Sylvain > -- > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users