Am 11.06.2020 um 22:44 schrieb Chris Dagdigian:
> > The root cause was strange so it's worth documenting here ... > > I had created a new consumable and requestable resource called "gpu" > configured like this: > > gpu gpu INT <= YES YES NONE > 0 > > And on host A I had set "complex_values gpu=1" and on host B I set > "complex_values gpu=2" etc. etc. across the cluster. > > My mistake was setting the default value of the new complex entry to "NONE" > instead of "0" which is what you probably want when the attribute is of type > INT > > But this was bizzare; basically I had a bad default value for a requestable > resource and as soon as we set that value down at the execution host level it > instantly broke all of our parallel environments. SGE scheduler was treating > my mistake like I had created a requestable resource of type FORCED or > something. Aha, a couple of days ago I got a request in PM where someone swore that the configuration "h_vmem … YES YES 0 0" was working fine all the time. Only after my suggestion to add h_vmem on an exechost level to avoid oversubscription all the jobs crashed then, due to no memory being available (as h_vmem = 0 was used this way as an automatically set limit). Essentially: the default value in a complex definition is ignored, as long as there is nothing to consume from. If it's not ignored, then the type has to match. -- Reuti > > Strange but resolved now. > > Regards > Chris > > > > > Reuti wrote on 6/11/20 4:17 PM: >> Hi, >> >> Any consumables in place like memory or other resource requests? Any output >> of `qalter -w v …` or "-w p"? >> >> -- Reuti >> >> >> >>> Am 11.06.2020 um 20:32 schrieb Chris Dagdigian <d...@sonsorol.org> >>> : >>> >>> Hi folks, >>> >>> Got a bewildering situation I've never seen before with simple SMP/threaded >>> PE techniques >>> >>> I made a brand new PE called threaded: >>> >>> $ qconf -sp threaded >>> pe_name threaded >>> slots 999 >>> user_lists NONE >>> xuser_lists NONE >>> start_proc_args NONE >>> stop_proc_args NONE >>> allocation_rule $pe_slots >>> control_slaves FALSE >>> job_is_first_task TRUE >>> urgency_slots min >>> accounting_summary FALSE >>> qsort_args NONE >>> >>> >>> And I attached that to all.q on an IDLE grid and submitted a job with '-pe >>> threaded 1' argument >>> >>> However all "qstat -j" data is showing this scheduler decision line: >>> >>> cannot run in PE "threaded" because it only offers 0 slots >>> >>> >>> I'm sort of lost on how to debug this because I can't figure out how to >>> probe where SGE is keeping track of PE specific slots. With other stuff I >>> can look at complex_values reported by execution hosts or I can use an "-F" >>> argument to qstat to dump the live state and status of a requestable >>> resource but I don't really have any debug or troubleshooting ideas for >>> "how to figure out why SGE thinks there are 0 slots when the static PE on >>> an idle cluster has. been set to contain 999 slots" >>> >>> Anyone seen something like this before? I don't think I've ever seen this >>> particular issue with an SGE parallel environment before ... >>> >>> >>> Chris >>> >>> _______________________________________________ >>> users mailing list >>> >>> users@gridengine.org >>> https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users