Hi,

Am 01.03.2016 um 23:44 schrieb Michael Stauffer:

> SoGE 8.1.8
> 
> I need to reboot my compute nodes after the glibc patch, and wanted to do so 
> nicely, i.e. wait for each node's jobs to finish before rebooting. I've done 
> this before and it worked, but now my setup is a little more complicated and 
> I changed my reinstall script.
> 
> I have a queue for qsub jobs and one for qlogin. Each is assigned a different 
> number of cores per node so that some nodes always have at least a couple 
> cores available for qlogin sessions, and some nodes are used only for qsub 
> jobs.
> 
> However my reinstall script (taken from the sge examples, listed below) does 
> its thing by submitting a job that requests all the cores on a node, so it 
> only runs when other jobs have completed. So I created a new queue called 
> reboot.q that is allotted all cores on all nodes. My understanding was that 
> the queues would cooperatively manage resources, so if a node was using, for 
> example, 8 cores for jobs on my qsub queue, then my reboot job that's 
> requesting 16 cores would wait until those jobs finish. 

Did you limit the overall slot count across all queues by a consumable complex 
on an exechost level ("complex_values slots=8") and/or with an RQS? Otherwise 
each queue can use all defined slots counts in each particular queue definition 
(and overload the nodes essentially).


> But when I ran my script, all nodes rebooted for reinstall immediately. I 
> guess I don't understand things correctly? Can someone set me straight? How 
> do I do a node reboot only after jobs have finished under these circumstances?

What about attaching the "exclusive" complex (needs to be defined manually in 
`qconf -mc`) to each exechost and request this when submitting the reboot job? 
Even one slot would be enough then to get exclusive access to each node.

-- Reuti


> script:
> 
> ME=`hostname`
> 
> EXECHOSTS=`qconf -sel`
> 
> for TARGETHOST in $EXECHOSTS; do
> 
>         if [ "$ME" == "$TARGETHOST" ]; then
> 
>                 echo "Skipping $ME. This is the submission host"
> 
>         else
> 
>                 numprocs=`qconf -se $TARGETHOST | \
> 
>                         awk '/^processors/ {print $2}'`
> 
>                 /opt/rocks/bin/rocks set host boot $TARGETHOST action=install
> 
>                 qsub -p 1024 -pe unihost $numprocs -binding 
> linear:${numprocs} -q reboot.q@$TARGETHOST \
> 
>                         /root/admin/scripts/sge-reboot.qsub
> 
>                 echo "Set $TARGETHOST for Reinstallation"
> 
>         fi
> 
> done
> 
> 
> Thanks
> 
> -M
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to