Re: [OMPI users] How to restart a job twice

2008-04-24 Thread Josh Hursey
Tamer, I'm confident that this particular problem is now fixed in the trunk (r18276). If you are interested in the details on the bug and how it was fixed the commit message is fairly detailed: https://svn.open-mpi.org/trac/ompi/changeset/18276 Let me know if this patch fixes things. Like

Re: [OMPI users] How to restart a job twice

2008-04-24 Thread Josh Hursey
Tamer, Another user contacted me off list yesterday with a similar problem with the current trunk. I have been able to reproduce this, and am currently trying to debug it again. It seems to occur more often with builds without the checkpoint thread (--disable-ft-thread). It seems to be a

Re: [OMPI users] How to restart a job twice

2008-04-24 Thread Tamer
Josh, Thank you for your help. I was able to do the following with r18241: start the parallel job checkpoint and restart checkpoint and restart checkpoint but failed to restart with the following message: ompi-restart ompi_global_snapshot_23800.ckpt [dhcp-119-202.caltech.edu:23650] [[45699,1],

Re: [OMPI users] How to restart a job twice

2008-04-22 Thread Josh Hursey
Tamer, This should now be fixed in r18241. Though I was able to replicate this bug, it only occurred sporadically for me. It seemed to be caused by some socket descriptor caching that was not properly cleaned up by the restart procedure. My testing appears to conclude that this bug is now

Re: [OMPI users] How to restart a job twice

2008-04-22 Thread Josh Hursey
Tamer, Just wanted to update you on my progress. I am able to reproduce something similar to this problem. I am currently working on a solution to it. I'll let you know when it is available, probably in the next day or two. Thank you for the bug report. Cheers, Josh On Apr 18, 2008, at

Re: [OMPI users] How to restart a job twice

2008-04-18 Thread Tamer
Hi Josh: I am running on linux fedora core 7 kernel: 2.6.23.15-80.fc7 The machine is dual-core with shared memory so it's not even a cluster. I downloaded r18208 and built it with the following options: ./configure --prefix=/usr/local/openmpi-with-checkpointing-r18208 -- with-ft=cr --with-blc

Re: [OMPI users] How to restart a job twice

2008-04-18 Thread Josh Hursey
This problem has come up in the past and may have been fixed since r14519. Can you update to r18208 and see if the error still occurs? A few other questions that will help me try to reproduce the problem. Can you tell me more about the configuration of the system you are running on (number

Re: [OMPI users] How to restart a job twice

2008-04-18 Thread Tamer
Thanks Josh, I tried what you suggested with my existing r14519, and I was able to checkpoint the restarted job but was never able to restart it. I looked up the PID for 'orterun' and checkpointed the restarted job but when I try to restart from that point I get the following error: ompi-re

Re: [OMPI users] How to restart a job twice

2008-04-18 Thread Josh Hursey
When you use 'ompi-restart' to restart a job it fork/execs the completely new job using the restarted processes for the ranks. However instead of calling the 'mpirun' process ompi-restart currently calls 'orterun'. These two programs are exactly the same (mpirun is a symbolic link to orteru

[OMPI users] How to restart a job twice

2008-04-18 Thread Tamer
Dear all, I installed the developer's version r14519 and was able to get it running. I successfully checkpointed a parallel job and restarted it. My question is how can I checkpoint the restarted job? The problem is once the original job is terminated and restarted later on, the mpirun does