Hi,

Am 13.06.2011 um 15:12 schrieb Javier Lopez Cacheiro:

> We have found a strange situation where GE 6.2u5 has allocated more resources 
> in a node than available, leaving a consumable with a value lower than 0 (in 
> this case the consumable is num_proc).
> 
> This is somehow similar to an issue that was found some time ago in SGE 6.2 
> (issue 2091) but in that case it was related to mpi jobs with fillup 
> allocation rule, and it was already solved in 6.2u3.
> 
> Now this is somehow different because it is not affecting mpi jobs but a 
> non-mpi job and it is occurring only in certain circumstances that are still 
> not clear.
> 
> In this case the situation was that at 06:13:57 the node had already 7 jobs 
> running, consuming 24 units of num_proc. Num_proc it is configured as a 
> consumable with a value of 24. So at that time the value of num_proc was 0. 
> But 4 seconds later, at 06:14:01, a new job was started in the node that 
> requested 24 num_proc, leaving the node with a value of -24 for num_proc.

num_proc is (fixed) feature of a node and shouldn't be made consumable. Is 
there any reason why you don't use slots?

Nevertheless: do you request anything else with the -l option?

-- Reuti


> I don't know if anyone else has come over this same problem with 6.2u5 and if 
> there is a workaround for it.
> 
> [jlopez@svgd ~]$ qhost -q -j -h c5-11
> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO
> SWAPUS
> -------------------------------------------------------------------------------
> global - - - - - - -
> compute-5-11 x86_64 -24 47.92 31.5G 9.0G 8.0G 0.0
> GRID_large BP 0/4/24
> 6667492 1.92242 STDIN compchem015 r 06/10/2011 06:13:30 MASTER
> 6667493 1.92241 STDIN compchem015 r 06/10/2011 06:13:41 MASTER
> 6667494 1.92241 STDIN compchem015 r 06/10/2011 06:13:47 MASTER
> 6667495 1.92241 STDIN compchem015 r 06/10/2011 06:13:57 MASTER
> GRID_small BP 0/0/24
> small BPC 0/10/24
> 6652641 11.27961 p1761-7 csebdmfa r 06/10/2011 06:14:01 MASTER
> 6655259 10.43999 p577-16 csebdmfa r 06/10/2011 06:12:26 MASTER
> 6667942 3.93900 AuLJ139 csmyslfs r 06/10/2011 06:12:46 MASTER
> SLAVE
> SLAVE
> SLAVE
> SLAVE
> SLAVE
> SLAVE
> SLAVE
> SLAVE
> g0-mem_small BPC 0/0/24
> offline BP 0/0/24
> 
> 
> Thanks in advance,
> Javier
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to