|
Ron, I am happy to report that NAMD survived just fine. I started a 16 core (slots) NAMD job on two 8-core nodes I did a "skill -STOP namd2" on both nodes, waited a while and then did a "skill -CONT namd2" and the NAMD survives just fine. I also stopped NAMD on just one node and it resumed just fine. Did this multiple times - no issues. I should also mentioned that with Torque/Maui, which is what I am familiar with, we have never had any issues suspending and resuming NAMD jobs. I am not sure what implementation they use and if it's also a simple kill signal or not. Joseph On 6/11/2012 6:00 PM, Ron Chen wrote:
Hi Joseph, Only a few people have asked for this feature in the past, and as Sun (I think it was Andy) told us that suspending PE jobs can cause issues, so the code was never changed in the original Grid Engine or in OGS/GE.To help us (and also you) understand the behaviour of suspending PE jobs, we need to do some manual testing. Can you run a small NAMD job that spans 2 or more nodes, and then on each node: - run ps to look for the PIDs of the NAMD processes of that job - prepare to send a STOP signal to each one - when you are finished with typing all those kill -STOP signals, then with as little delay as you can, press ENTER on all the nodes. Then wait for a while, may be 15+mins or longer, send the CONT signal to resume the tasks. See if NAMD continues to run. Let us know the result. So you are manually suspending the PE job by hand. As mentioned by others, TCP timeout can be an issue, and in fact some checkpoint/restart libraries do not support TCP socket connections. -Ron ----- Original Message ----- From: Joseph Farran <[email protected]> To: Rayson Ho <[email protected]> Cc: "[email protected]" <[email protected]> Sent: Monday, June 11, 2012 5:17 PM Subject: Re: [gridengine users] PE Job Suspend / Resume Thanks for the clarification. This is NAMD run, so I am launching it via "charmrun" and not mpirun. If the OGE code suspend via rank 0, I would think that charmrun and/or any other parallel job would suspend as well, no? I will try an mpirun job next to see if it behaves differently and suspends correctly or not. Joseph On 06/11/2012 01:32 PM, Rayson Ho wrote:Clarify... rank 0 in the previous email = the parallel job launcher (eg. mpirun) process - usually running on the rank 0 machine. A few years ago, we added code to allow every process to get the suspend signal (only for the tight-integration case), but Sun at that time did not integrate it into the tree so we will need to start the discussion again and see if it really is a good idea to suspend parallel jobs. Rayson On Mon, Jun 11, 2012 at 4:21 PM, Rayson Ho<[email protected]> wrote:Only rank 0 of the job is suspended if I recall correctly - it was designed specifically because not all parallel jobs are able to handle suspend/restart correctly - for example you can get TCP timeouts and things like those. Rayson On Mon, Jun 11, 2012 at 3:53 PM, Joseph Farran<[email protected]> wrote:Hi. With the help of this group, I've been able to make good progress on setting up OGE 2011.11 with our cluster. I am testing the Suspend& Resume features and it works great for serial jobs but not able to get Parallel jobs suspended. I created a simple Parallel Environment (PE) called mpi and I submitted a NAMD job to it and it runs just fine. I then tried suspending it using qmon 'suspend' button and it says that it suspended the job and qstat also confirms that job is suspended with the 's' flag, however looking at the nodes on which NAMD is running, NAMD continues to run. What am I missing with respect to being able to suspend PE jobs since it works for serial jobs? Joseph _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users |
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
