Am 11.06.2020 um 22:44 schrieb Chris Dagdigian:

> 
> The root cause was strange so it's worth documenting here ...
> 
> I had created a new consumable and requestable resource called "gpu" 
> configured like this:
> 
> gpu                 gpu        INT       <=    YES         YES        NONE    
>     0
> 
> And on host A I had set "complex_values gpu=1" and on host B I set 
> "complex_values gpu=2" etc. etc. across the cluster. 
> 
> My mistake was setting the default value of the new complex entry to "NONE" 
> instead of "0" which is what you probably want when the attribute is of type 
> INT
> 
> But this was bizzare;  basically I had a bad default value for a requestable 
> resource and as soon as we set that value down at the execution host level it 
> instantly broke all of our parallel environments.  SGE scheduler was treating 
> my mistake like I had created a requestable resource of type FORCED or 
> something. 

Aha, a couple of days ago I got a request in PM where someone swore that the 
configuration "h_vmem …  YES YES 0 0" was working fine all the time. Only after 
my suggestion to add h_vmem on an exechost level to avoid oversubscription all 
the jobs crashed then, due to no memory being available (as h_vmem = 0 was used 
this way as an automatically set limit).

Essentially: the default value in a complex definition is ignored, as long as 
there is nothing to consume from. If it's not ignored, then the type has to 
match.

-- Reuti


> 
> Strange but resolved now. 
> 
> Regards
> Chris
> 
> 
> 
> 
> Reuti wrote on 6/11/20 4:17 PM:
>> Hi,
>> 
>> Any consumables in place like memory or other resource requests? Any output 
>> of `qalter -w v …` or "-w p"?
>> 
>> -- Reuti
>> 
>> 
>> 
>>> Am 11.06.2020 um 20:32 schrieb Chris Dagdigian <d...@sonsorol.org>
>>> :
>>> 
>>> Hi folks,
>>> 
>>> Got a bewildering situation I've never seen before with simple SMP/threaded 
>>> PE techniques
>>> 
>>> I made a brand new PE called threaded:
>>> 
>>> $ qconf -sp threaded
>>> pe_name            threaded
>>> slots              999
>>> user_lists         NONE
>>> xuser_lists        NONE
>>> start_proc_args    NONE
>>> stop_proc_args     NONE
>>> allocation_rule    $pe_slots
>>> control_slaves     FALSE
>>> job_is_first_task  TRUE
>>> urgency_slots      min
>>> accounting_summary FALSE
>>> qsort_args         NONE
>>> 
>>> 
>>> And I attached that to all.q on an IDLE grid and submitted a job with '-pe 
>>> threaded 1' argument
>>> 
>>> However all "qstat -j" data is showing this scheduler decision line:
>>> 
>>> cannot run in PE "threaded" because it only offers 0 slots
>>> 
>>> 
>>> I'm sort of lost on how to debug this because I can't figure out how to 
>>> probe where SGE is keeping track of PE specific slots.  With other stuff I 
>>> can look at complex_values reported by execution hosts or I can use an "-F" 
>>> argument to qstat to dump the live state and status of a requestable 
>>> resource but I don't really have any debug or troubleshooting ideas for 
>>> "how to figure out why SGE thinks there are 0 slots when the static PE on 
>>> an idle cluster has. been set to contain 999 slots" 
>>> 
>>> Anyone seen something like this before?  I don't think I've ever seen this 
>>> particular issue with an SGE parallel environment before ...
>>> 
>>> 
>>> Chris
>>> 
>>> _______________________________________________
>>> users mailing list
>>> 
>>> users@gridengine.org
>>> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to