Clarify... rank 0 in the previous email = the parallel job launcher (eg. mpirun) process - usually running on the rank 0 machine.
A few years ago, we added code to allow every process to get the suspend signal (only for the tight-integration case), but Sun at that time did not integrate it into the tree so we will need to start the discussion again and see if it really is a good idea to suspend parallel jobs. Rayson On Mon, Jun 11, 2012 at 4:21 PM, Rayson Ho <[email protected]> wrote: > Only rank 0 of the job is suspended if I recall correctly - it was > designed specifically because not all parallel jobs are able to handle > suspend/restart correctly - for example you can get TCP timeouts and > things like those. > > Rayson > > > > On Mon, Jun 11, 2012 at 3:53 PM, Joseph Farran <[email protected]> wrote: >> Hi. >> >> With the help of this group, I've been able to make good progress on setting >> up OGE 2011.11 with our cluster. >> >> I am testing the Suspend & Resume features and it works great for serial >> jobs but not able to get Parallel jobs suspended. >> >> I created a simple Parallel Environment (PE) called mpi and I submitted a >> NAMD job to it and it runs just fine. I then tried suspending it using >> qmon 'suspend' button and it says that it suspended the job and qstat also >> confirms that job is suspended with the 's' flag, however looking at the >> nodes on which NAMD is running, NAMD continues to run. >> >> What am I missing with respect to being able to suspend PE jobs since it >> works for serial jobs? >> >> Joseph >> >> >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
