`qstat -f` doesn't shoe any queue instances being disbaled/in alarm state? -- Reuti
> Am 12.04.2018 um 21:31 schrieb Joshua Baker-LePain <[email protected]>: > > On Thu, 12 Apr 2018 at 10:15am, Joshua Baker-LePain wrote > >> We're running SoGE 8.1.9 on a smallish (but growing) cluster. We've >> recently added GPU nodes to the cluster. On each GPU node, a consumable >> complex named 'gpu' is defined with the number of GPUs in the node. The >> complex definition looks like this: >> >> # name shortcut type relop requestable consumable # >> default urgency >> # >> -------------------------------------------------------------------------------------- >> gpu gpu INT <= YES JOB 0 0 >> >> We're frequently seeing GPU jobs stuck in 'qw' even when slots and resources >> on GPU nodes are available. What appears to be happening is that SGE is >> choosing a node that's full and then waiting for that node to become >> available rather than switching to another node. For example: >> >> $ qstat -u "*" -q gpu.q >> 370002 0.05778 C3D1000b2_ user1 r 04/11/2018 00:18:17 >> gpu.q@msg-iogpu10 5 >> 369728 0.05778 C3D4000b2_ user1 r 04/10/2018 18:00:24 >> gpu.q@msg-iogpu11 5 >> 371490 0.06613 class3d user2 r 04/11/2018 20:50:02 >> gpu.q@msg-iogpu12 3 >> 367554 0.05778 C3D3000b2_ user1 r 04/08/2018 16:07:24 >> gpu.q@msg-iogpu3 3 >> 367553 0.05778 C3D2000b2_ user1 r 04/08/2018 17:56:54 >> gpu.q@msg-iogpu4 3 >> 367909 0.05778 C3D11k_b2Y user1 r 04/09/2018 00:04:24 >> gpu.q@msg-iogpu8 3 >> 371511 0.06613 class3d user2 r 04/11/2018 21:45:02 >> gpu.q@msg-iogpu9 3 >> 371593 0.95000 refine_joi user3 qw 04/11/2018 23:05:57 >> 5 >> >> Job 371593 has requested '-l gpu=2'. Nodes msg-iogpu2, 5, 6, and 7 have no >> jobs in gpu.q on them and avaialable gpu resources, e.g.: >> >> $ qhost -F -h msg-iogpu2 >> . >> . >> hc:gpu=2.000000 >> >> However, SGE seems to want to insist on running this job on msg-iogpu9, as >> seen by these lines in the messages file for each scheduling run: >> >> 04/12/2018 09:59:47|worker|wynq1|E|debiting 2.000000 of gpu on host >> msg-iogpu9 for 1 slots would exceed remaining capacity of 0.000000 >> 04/12/2018 09:59:47|worker|wynq1|E|resources no longer available for start >> of job 371593.1 >> >>> From past experience, job 371593 will indeed wait until msg-iogpu9 becomes >> available and run there. We do advise our users to set "-R y" for these >> jobs -- is this a reservation issue? Where else should I look for clues? >> Any ideas? I'm a bit flummoxed on this one... > > One last bit of info. Running 'qalter -w p' on the stuck job proves that it > *should* be able to run: > > $ qalter -w p 371593 > verification: found possible assignment with 5 slots > > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
