Re: [OMPI users] Checkpoint/Restart error

2010-02-01 Thread Josh Hursey
Thanks for the bug report. There are a couple of places in the code that, in a sense, hard code '/tmp' as the temporary directory. It shouldn't be to hard to fix since there is a common function used in the code to discovery the 'true' temporary directory (which defaults to /tmp). Of course

Re: [OMPI users] Checkpoint/Restart error

2010-01-29 Thread Jazcek Braden
Josh, I was following this thread as I had similar symptoms and discovered a peculiar error. when I launch the program, openmpi follows the $TMPDIR environment variable and puts the session information in the $TMPDIR directory. However ompi-checkpoint seems to be requiring the sessions file to b

Re: [OMPI users] Checkpoint/Restart error

2010-01-25 Thread Josh Hursey
I tested the 1.4.1 release, and everything worked fine for me (tested a few different configurations of nodes/environments). The ompi-checkpoint error you cited is usually caused by one of two things: - The PID specified is wrong (which I don't think that is the case here) - The session

Re: [OMPI users] Checkpoint/Restart error

2010-01-25 Thread Andreea Costea
So? anyone? any clue? Summarize: - installed OpenMPI 1.4.1 on fresh Centos 5 - mpirun works but ompi-checkpoint throws this error: ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405 - on another VM I have OpenMPI 1.3.3. installed. Checkpointing works fine on guest but has the previous

Re: [OMPI users] Checkpoint/Restart error

2010-01-19 Thread Andreea Costea
I noticed one more thing. As I still have some VMs that have OpenMPI version 1.3.3 installed I started to use those machines 'till I fix the problem with 1.4.1 And while checkpointing on one of this VMs I realized that checkpointing as a guest works fine and checkpointing as a root outputs the same

Re: [OMPI users] Checkpoint/Restart error

2010-01-18 Thread Andreea Costea
Well... I decided to install a fresh OS to be sure that there is no OpenMPI version conflict. So I formatted one of my VMs, did a fresh CentOS install, installed BLCR 0.8.2 and OpenMPI 1.4.1 and the result: the same. mpirun works but ompi-checkpoint has that error at line 405: [[35906,0],0] ORTE_E

Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
It's almost midnight here, so I left home, but I will try it tomorrow. There were some directories left after "make uninstall". I will give more details tomorrow. Thanks Jeff, Andreea On Fri, Jan 15, 2010 at 11:30 PM, Jeff Squyres wrote: > On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote: > >

Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Jeff Squyres
On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote: > - I wanted to update to version 1.4.1 and I uninstalled previous version like > this: make uninstall, and than manually deleted all the left over files. the > directory where I installed was /usr/local I'll let Josh answer your CR questions,

Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
I don't know what else should I try... because it worked on 1.3.3 doing exactly the same steps. I tried to install it both with an active eth interface and an inactive one. I am running on a virtual machine that has CentOS as OS. Any suggestions? Thanks, Andreea On Fri, Jan 15, 2010 at 9:07 PM,

Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
I tried the new version, that was uploaded today. I still have that error, just that now is at line 405 instead of 399. Maybe if I give more details: - I first had OpenMPI version 1.3.3 with BLCR installed: mpirun, ompi-checkpoint and ompi-restart worked with that version. - I wanted to update to

Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
Hi... still not working. Though I uninstalled OpenMPI with make uninstall and I manually deleted all other files, I still have the same error when checkpointing. Any idea? Thanks, Andreea On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote: > On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote

Re: [OMPI users] Checkpoint/Restart error

2010-01-14 Thread Joshua Hursey
On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote: > Hi, > > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have > downloaded today. When I want to checkpoint I am having the following error > message: > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line

[OMPI users] Checkpoint/Restart error

2010-01-14 Thread Andreea Costea
Hi, I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have downloaded today. When I want to checkpoint I am having the following error message: [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 399 HNP with PID 2337 Not found! I tried the same thing with vers