On Tue, Mar 1, 2016 at 6:19 PM, Reuti <[email protected]> wrote:

> Hi,
>
>>
> Am 01.03.2016 um 23:44 schrieb Michael Stauffer:
>
>>
> > SoGE 8.1.8
>
>> >
>
>> > I need to reboot my compute nodes after the glibc patch, and wanted to
> do so nicely, i.e. wait for each node's jobs to finish before rebooting.
> I've done this before and it worked, but now my setup is a little more
> complicated and I changed my reinstall script.
>
>> >
>
>> > I have a queue for qsub jobs and one for qlogin. Each is assigned a
> different number of cores per node so that some nodes always have at least
> a couple cores available for qlogin sessions, and some nodes are used only
> for qsub jobs.
>
>> >
>
>> > However my reinstall script (taken from the sge examples, listed below)
> does its thing by submitting a job that requests all the cores on a node,
> so it only runs when other jobs have completed. So I created a new queue
> called reboot.q that is allotted all cores on all nodes. My understanding
> was that the queues would cooperatively manage resources, so if a node was
> using, for example, 8 cores for jobs on my qsub queue, then my reboot job
> that's requesting 16 cores would wait until those jobs finish.
>
>>
> Did you limit the overall slot count across all queues by a consumable
> complex on an exechost level ("complex_values slots=8") and/or with an RQS?
> Otherwise each queue can use all defined slots counts in each particular
> queue definition (and overload the nodes essentially).
>

No, at least not knowingly. I should do this also for regular usage to
avoid overloading. How do I actually do this? That is, I don't know from
what you say how to actually do this.

My queues look like this for 'slots' (e.g. for the qsub queue:)

slots                 1,[compute-0-0.local=0],[compute-0-1.local=15], \
                      [compute-0-2.local=15],[compute-0-3.local=15], \
                      [compute-0-4.local=16],[compute-0-5.local=16], \
                      [compute-0-6.local=16],[compute-0-7.local=16], \
                      [compute-0-9.local=16],[compute-0-10.local=16], \
                      [compute-0-11.local=16],[compute-0-12.local=16], \
                      [compute-0-13.local=16],[compute-0-14.local=16], \
                      [compute-0-15.local=16],[compute-0-16.local=16], \
                      [compute-0-17.local=16],[compute-0-18.local=16], \
                      [compute-0-8.local=16],[compute-0-19.local=16], \
                      [compute-0-20.local=16]

complex_values        NONE
Do I do something similar for the complex_values parameter?

> But when I ran my script, all nodes rebooted for reinstall immediately. I
> guess I don't understand things correctly? Can someone set me straight? How
> do I do a node reboot only after jobs have finished under these
> circumstances?
>
>>
> What about attaching the "exclusive" complex (needs to be defined manually
> in `qconf -mc`) to each exechost and request this when submitting the
> reboot job? Even one slot would be enough then to get exclusive access to
> each node.
>

This sounds great. Can you give me details on how to do this?

What are values needed for the complex configuration params? Something like
this?

name       shortcut   type        relop requestable consumable default
 urgency
exclusive ex            BOOL     ==     YES             NO               0
         0


How is it attached to each exechost?

Thanks very much.

-M



> -- Reuti
>
>>
>
> > script:
>
>> >
>
>> > ME=`hostname`
>
>> >
>
>> > EXECHOSTS=`qconf -sel`
>
>> >
>
>> > for TARGETHOST in $EXECHOSTS; do
>
>> >
>
>> >         if [ "$ME" == "$TARGETHOST" ]; then
>
>> >
>
>> >                 echo "Skipping $ME. This is the submission host"
>
>> >
>
>> >         else
>
>> >
>
>> >                 numprocs=`qconf -se $TARGETHOST | \
>
>> >
>
>> >                         awk '/^processors/ {print $2}'`
>
>> >
>
>> >                 /opt/rocks/bin/rocks set host boot $TARGETHOST
> action=install
>
>> >
>
>> >                 qsub -p 1024 -pe unihost $numprocs -binding
> linear:${numprocs} -q reboot.q@$TARGETHOST \
>
>> >
>
>> >                         /root/admin/scripts/sge-reboot.qsub
>
>> >
>
>> >                 echo "Set $TARGETHOST for Reinstallation"
>
>> >
>
>> >         fi
>
>> >
>
>> > done
>
>> >
>
>> >
>
>> > Thanks
>
>> >
>
>> > -M
>
>> > _______________________________________________
>
>> > users mailing list
>
>> > [email protected]
>
>> > https://gridengine.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to