Am 11.06.2012 um 22:21 schrieb Rayson Ho:

> Only rank 0 of the job is suspended if I recall correctly - it was
> designed specifically because not all parallel jobs are able to handle
> suspend/restart correctly - for example you can get TCP timeouts and
> things like those.

It was also just on the MPICH2 list: I thought you put it into OGE as there was 
this discussion some time ago:

https://arc.liv.ac.uk/trac/SGE/ticket/577

-- Reuti


> Rayson
> 
> 
> 
> On Mon, Jun 11, 2012 at 3:53 PM, Joseph Farran <[email protected]> wrote:
>> Hi.
>> 
>> With the help of this group, I've been able to make good progress on setting
>> up OGE 2011.11 with our cluster.
>> 
>> I am testing the Suspend & Resume features and it works great for serial
>> jobs but not able to get Parallel jobs suspended.
>> 
>> I created a simple Parallel Environment (PE) called mpi and I submitted a
>> NAMD job to it and it runs just fine.    I then tried suspending it using
>> qmon 'suspend' button and it says that it suspended the job and qstat also
>> confirms that job is suspended with the 's' flag, however looking at the
>> nodes on which NAMD is running, NAMD continues to run.
>> 
>> What am I missing with respect to being able to suspend PE jobs since it
>> works for serial jobs?
>> 
>> Joseph
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to