Clarify... rank 0 in the previous email = the parallel job launcher
(eg. mpirun) process - usually running on the rank 0 machine.

A few years ago, we added code to allow every process to get the
suspend signal (only for the tight-integration case), but Sun at that
time did not integrate it into the tree so we will need to start the
discussion again and see if it really is a good idea to suspend
parallel jobs.

Rayson



On Mon, Jun 11, 2012 at 4:21 PM, Rayson Ho <[email protected]> wrote:
> Only rank 0 of the job is suspended if I recall correctly - it was
> designed specifically because not all parallel jobs are able to handle
> suspend/restart correctly - for example you can get TCP timeouts and
> things like those.
>
> Rayson
>
>
>
> On Mon, Jun 11, 2012 at 3:53 PM, Joseph Farran <[email protected]> wrote:
>> Hi.
>>
>> With the help of this group, I've been able to make good progress on setting
>> up OGE 2011.11 with our cluster.
>>
>> I am testing the Suspend & Resume features and it works great for serial
>> jobs but not able to get Parallel jobs suspended.
>>
>> I created a simple Parallel Environment (PE) called mpi and I submitted a
>> NAMD job to it and it runs just fine.    I then tried suspending it using
>> qmon 'suspend' button and it says that it suspended the job and qstat also
>> confirms that job is suspended with the 's' flag, however looking at the
>> nodes on which NAMD is running, NAMD continues to run.
>>
>> What am I missing with respect to being able to suspend PE jobs since it
>> works for serial jobs?
>>
>> Joseph
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to