Hi,
Am 01.03.2016 um 23:44 schrieb Michael Stauffer:
> SoGE 8.1.8
>
> I need to reboot my compute nodes after the glibc patch, and wanted to do so
> nicely, i.e. wait for each node's jobs to finish before rebooting. I've done
> this before and it worked, but now my setup is a little more complicated and
> I changed my reinstall script.
>
> I have a queue for qsub jobs and one for qlogin. Each is assigned a different
> number of cores per node so that some nodes always have at least a couple
> cores available for qlogin sessions, and some nodes are used only for qsub
> jobs.
>
> However my reinstall script (taken from the sge examples, listed below) does
> its thing by submitting a job that requests all the cores on a node, so it
> only runs when other jobs have completed. So I created a new queue called
> reboot.q that is allotted all cores on all nodes. My understanding was that
> the queues would cooperatively manage resources, so if a node was using, for
> example, 8 cores for jobs on my qsub queue, then my reboot job that's
> requesting 16 cores would wait until those jobs finish.
Did you limit the overall slot count across all queues by a consumable complex
on an exechost level ("complex_values slots=8") and/or with an RQS? Otherwise
each queue can use all defined slots counts in each particular queue definition
(and overload the nodes essentially).
> But when I ran my script, all nodes rebooted for reinstall immediately. I
> guess I don't understand things correctly? Can someone set me straight? How
> do I do a node reboot only after jobs have finished under these circumstances?
What about attaching the "exclusive" complex (needs to be defined manually in
`qconf -mc`) to each exechost and request this when submitting the reboot job?
Even one slot would be enough then to get exclusive access to each node.
-- Reuti
> script:
>
> ME=`hostname`
>
> EXECHOSTS=`qconf -sel`
>
> for TARGETHOST in $EXECHOSTS; do
>
> if [ "$ME" == "$TARGETHOST" ]; then
>
> echo "Skipping $ME. This is the submission host"
>
> else
>
> numprocs=`qconf -se $TARGETHOST | \
>
> awk '/^processors/ {print $2}'`
>
> /opt/rocks/bin/rocks set host boot $TARGETHOST action=install
>
> qsub -p 1024 -pe unihost $numprocs -binding
> linear:${numprocs} -q reboot.q@$TARGETHOST \
>
> /root/admin/scripts/sge-reboot.qsub
>
> echo "Set $TARGETHOST for Reinstallation"
>
> fi
>
> done
>
>
> Thanks
>
> -M
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users