Re: [OMPI users] Checkpoint/Restart error
Thanks for the bug report. There are a couple of places in the code that, in a sense, hard code '/tmp' as the temporary directory. It shouldn't be to hard to fix since there is a common function used in the code to discovery the 'true' temporary directory (which defaults to /tmp). Of course there might be other complexities once I dig into the problem. I don't know when I will get to this, but I filed a ticket about this if you want to track it: https://svn.open-mpi.org/trac/ompi/ticket/2208 Thanks again, Josh On Jan 29, 2010, at 4:41 PM, Jazcek Braden wrote: Josh, I was following this thread as I had similar symptoms and discovered a peculiar error. when I launch the program, openmpi follows the $TMPDIR environment variable and puts the session information in the $TMPDIR directory. However ompi-checkpoint seems to be requiring the sessions file to be in /tmp ignoring the $TMPDIR. Is this by design and what would I have to do to get it to obey the $TMPDIR environment variable. -- Jazcek Braden ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Checkpoint/Restart error
Josh, I was following this thread as I had similar symptoms and discovered a peculiar error. when I launch the program, openmpi follows the $TMPDIR environment variable and puts the session information in the $TMPDIR directory. However ompi-checkpoint seems to be requiring the sessions file to be in /tmp ignoring the $TMPDIR. Is this by design and what would I have to do to get it to obey the $TMPDIR environment variable. -- Jazcek Braden
Re: [OMPI users] Checkpoint/Restart error
I tested the 1.4.1 release, and everything worked fine for me (tested a few different configurations of nodes/environments). The ompi-checkpoint error you cited is usually caused by one of two things: - The PID specified is wrong (which I don't think that is the case here) - The session directory cannot be found in /tmp. So I think the problem is the latter. The session directory looks something like: /tmp/openmpi-sessions-USERNAME@LOCALHOST_0 Within this directory the mpirun process places its contact information. ompi-checkpoint uses this contact information to connect to the job. If it cannot find it, then it errors out. (We definitely need a better error message here. I filed a ticket [1]). We usually do not recommend running Open MPI as a root user. So I would strongly recommend that you do not run as a root user. With a regular user, check the location of the session directory. Make sure that it is in /tmp on the node where 'mpirun' and 'ompi- checkpoint' are run. -- Josh [1] https://svn.open-mpi.org/trac/ompi/ticket/2189 On Jan 25, 2010, at 5:48 AM, Andreea Costea wrote: So? anyone? any clue? Summarize: - installed OpenMPI 1.4.1 on fresh Centos 5 - mpirun works but ompi-checkpoint throws this error: ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405 - on another VM I have OpenMPI 1.3.3. installed. Checkpointing works fine on guest but has the previous mentioned error on root. Both root and guest show the same output after "param -all -all" except for the $HOME (which only matters for mca_component_path, mca_param_files, snapc_base_global_snapshot_dir) Thanks, Andreea On Tue, Jan 19, 2010 at 9:01 PM, Andreea Costeawrote: I noticed one more thing. As I still have some VMs that have OpenMPI version 1.3.3 installed I started to use those machines 'till I fix the problem with 1.4.1 And while checkpointing on one of this VMs I realized that checkpointing as a guest works fine and checkpointing as a root outputs the same error like in 1.4.1. : ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405 I logged the outputs of "ompi_info --param all all" which I run for root and for another user and the only differences were at these parameters: mca_component_path mca_param_files snapc_base_global_snapshot_dir All 3 params differ because of the $HOME. One more thing: I don't have the directory $HOME/.openmpi Ideas? Thanks, Andreea On Tue, Jan 19, 2010 at 12:51 PM, Andreea Costea wrote: Well... I decided to install a fresh OS to be sure that there is no OpenMPI version conflict. So I formatted one of my VMs, did a fresh CentOS install, installed BLCR 0.8.2 and OpenMPI 1.4.1 and the result: the same. mpirun works but ompi-checkpoint has that error at line 405: [[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405 As for the files remaining after uninstalling: Jeff you were rigth. There is no file left, just some empty directories. Which might be the problem with that ORTE_ERROR_LOG error? Thanks, Andreea On Fri, Jan 15, 2010 at 11:47 PM, Andreea Costea wrote: It's almost midnight here, so I left home, but I will try it tomorrow. There were some directories left after "make uninstall". I will give more details tomorrow. Thanks Jeff, Andreea On Fri, Jan 15, 2010 at 11:30 PM, Jeff Squyres wrote: On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote: > - I wanted to update to version 1.4.1 and I uninstalled previous version like this: make uninstall, and than manually deleted all the left over files. the directory where I installed was /usr/local I'll let Josh answer your CR questions, but I did want to ask about this point. AFAIK, "make uninstall" removes *all* Open MPI files. For example: - [7:25] $ cd /path/to/my/OMPI/tree [7:25] $ make install > /dev/null [7:26] $ find /tmp/bogus/ -type f | wc 646 646 28082 [7:26] $ make uninstall > /dev/null [7:27] $ find /tmp/bogus/ -type f | wc 0 0 0 [7:27] $ - I realize that some *directories* are left in $prefix, but there should be no *files* left. Are you seeing something different? -- Jeff Squyres jsquy...@cisco.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Checkpoint/Restart error
So? anyone? any clue? Summarize: - installed OpenMPI 1.4.1 on fresh Centos 5 - mpirun works but ompi-checkpoint throws this error: ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405 - on another VM I have OpenMPI 1.3.3. installed. Checkpointing works fine on guest but has the previous mentioned error on root. Both root and guest show the same output after "param -all -all" except for the $HOME (which only matters for mca_component_path, mca_param_files, snapc_base_global_snapshot_dir) Thanks, Andreea On Tue, Jan 19, 2010 at 9:01 PM, Andreea Costeawrote: > I noticed one more thing. As I still have some VMs that have OpenMPI > version 1.3.3 installed I started to use those machines 'till I fix the > problem with 1.4.1 And while checkpointing on one of this VMs I realized > that checkpointing as a guest works fine and checkpointing as a root outputs > the same error like in 1.4.1. : ORTE_ERROR_LOG: Not found in file > orte-checkpoint.c at line 405 > > I logged the outputs of "ompi_info --param all all" which I run for root > and for another user and the only differences were at these parameters: > > mca_component_path > mca_param_files > snapc_base_global_snapshot_dir > > All 3 params differ because of the $HOME. > One more thing: I don't have the directory $HOME/.openmpi > > Ideas? > > Thanks, > Andreea > > > > > > On Tue, Jan 19, 2010 at 12:51 PM, Andreea Costea > wrote: > >> Well... I decided to install a fresh OS to be sure that there is no >> OpenMPI version conflict. So I formatted one of my VMs, did a fresh CentOS >> install, installed BLCR 0.8.2 and OpenMPI 1.4.1 and the result: the same. >> mpirun works but ompi-checkpoint has that error at line 405: >> >> [[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line >> 405 >> >> As for the files remaining after uninstalling: Jeff you were rigth. There >> is no file left, just some empty directories. >> >> Which might be the problem with that ORTE_ERROR_LOG error? >> >> Thanks, >> Andreea >> >> On Fri, Jan 15, 2010 at 11:47 PM, Andreea Costea >> wrote: >> >>> It's almost midnight here, so I left home, but I will try it tomorrow. >>> There were some directories left after "make uninstall". I will give more >>> details tomorrow. >>> >>> Thanks Jeff, >>> Andreea >>> >>> >>> On Fri, Jan 15, 2010 at 11:30 PM, Jeff Squyres wrote: >>> On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote: > - I wanted to update to version 1.4.1 and I uninstalled previous version like this: make uninstall, and than manually deleted all the left over files. the directory where I installed was /usr/local I'll let Josh answer your CR questions, but I did want to ask about this point. AFAIK, "make uninstall" removes *all* Open MPI files. For example: - [7:25] $ cd /path/to/my/OMPI/tree [7:25] $ make install > /dev/null [7:26] $ find /tmp/bogus/ -type f | wc 646 646 28082 [7:26] $ make uninstall > /dev/null [7:27] $ find /tmp/bogus/ -type f | wc 0 0 0 [7:27] $ - I realize that some *directories* are left in $prefix, but there should be no *files* left. Are you seeing something different? -- Jeff Squyres jsquy...@cisco.com ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >> >
Re: [OMPI users] Checkpoint/Restart error
On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote: > - I wanted to update to version 1.4.1 and I uninstalled previous version like > this: make uninstall, and than manually deleted all the left over files. the > directory where I installed was /usr/local I'll let Josh answer your CR questions, but I did want to ask about this point. AFAIK, "make uninstall" removes *all* Open MPI files. For example: - [7:25] $ cd /path/to/my/OMPI/tree [7:25] $ make install > /dev/null [7:26] $ find /tmp/bogus/ -type f | wc 646 646 28082 [7:26] $ make uninstall > /dev/null [7:27] $ find /tmp/bogus/ -type f | wc 0 0 0 [7:27] $ - I realize that some *directories* are left in $prefix, but there should be no *files* left. Are you seeing something different? -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] Checkpoint/Restart error
I don't know what else should I try... because it worked on 1.3.3 doing exactly the same steps. I tried to install it both with an active eth interface and an inactive one. I am running on a virtual machine that has CentOS as OS. Any suggestions? Thanks, Andreea On Fri, Jan 15, 2010 at 9:07 PM, Andreea Costeawrote: > I tried the new version, that was uploaded today. I still have that error, > just that now is at line 405 instead of 399. > > Maybe if I give more details: > - I first had OpenMPI version 1.3.3 with BLCR installed: mpirun, > ompi-checkpoint and ompi-restart worked with that version. > - I wanted to update to version 1.4.1 and I uninstalled previous version > like this: make uninstall, and than manually deleted all the left over > files. the directory where I installed was /usr/local > - I installed 1.4.1 in the same directory: /usr/locale. paths set > correctly to usr/local/bin and /usr/local/lib > - mpirun works, ompi-checkpoint gives the following error: > [[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line > 405 > HNP with PID 7899 Not found! > > I would appreciate any help, > Andreea > > > > On Fri, Jan 15, 2010 at 1:15 PM, Andreea Costea wrote: > >> Hi... >> still not working. Though I uninstalled OpenMPI with make uninstall and I >> manually deleted all other files, I still have the same error when >> checkpointing. >> >> Any idea? >> >> Thanks, >> Andreea >> >> >> >> On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote: >> >>> On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote: >>> >>> > Hi, >>> > >>> > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have >>> downloaded today. When I want to checkpoint I am having the following error >>> message: >>> > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at >>> line 399 >>> > HNP with PID 2337 Not found! >>> >>> This looks like an error coming from the 1.3.3 install. In 1.4.1 there is >>> no error at line 399, in 1.3.3 there is. Check your installation of Open >>> MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected >>> problems. >>> >>> Try a clean installation of 1.4.1 and double check that 1.3.3 is not in >>> your path/lib_path any longer. >>> >>> -- Josh >>> >>> > >>> > I tried the same thing with version 1.3.3 and it works perfectly. >>> > >>> > Any idea why? >>> > >>> > thanks, >>> > Andreea >>> > ___ >>> > users mailing list >>> > us...@open-mpi.org >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >
Re: [OMPI users] Checkpoint/Restart error
I tried the new version, that was uploaded today. I still have that error, just that now is at line 405 instead of 399. Maybe if I give more details: - I first had OpenMPI version 1.3.3 with BLCR installed: mpirun, ompi-checkpoint and ompi-restart worked with that version. - I wanted to update to version 1.4.1 and I uninstalled previous version like this: make uninstall, and than manually deleted all the left over files. the directory where I installed was /usr/local - I installed 1.4.1 in the same directory: /usr/locale. paths set correctly to usr/local/bin and /usr/local/lib - mpirun works, ompi-checkpoint gives the following error: [[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405 HNP with PID 7899 Not found! I would appreciate any help, Andreea On Fri, Jan 15, 2010 at 1:15 PM, Andreea Costeawrote: > Hi... > still not working. Though I uninstalled OpenMPI with make uninstall and I > manually deleted all other files, I still have the same error when > checkpointing. > > Any idea? > > Thanks, > Andreea > > > > On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote: > >> On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote: >> >> > Hi, >> > >> > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have >> downloaded today. When I want to checkpoint I am having the following error >> message: >> > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at >> line 399 >> > HNP with PID 2337 Not found! >> >> This looks like an error coming from the 1.3.3 install. In 1.4.1 there is >> no error at line 399, in 1.3.3 there is. Check your installation of Open >> MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected >> problems. >> >> Try a clean installation of 1.4.1 and double check that 1.3.3 is not in >> your path/lib_path any longer. >> >> -- Josh >> >> > >> > I tried the same thing with version 1.3.3 and it works perfectly. >> > >> > Any idea why? >> > >> > thanks, >> > Andreea >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >
Re: [OMPI users] Checkpoint/Restart error
Hi... still not working. Though I uninstalled OpenMPI with make uninstall and I manually deleted all other files, I still have the same error when checkpointing. Any idea? Thanks, Andreea On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hurseywrote: > On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote: > > > Hi, > > > > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have > downloaded today. When I want to checkpoint I am having the following error > message: > > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line > 399 > > HNP with PID 2337 Not found! > > This looks like an error coming from the 1.3.3 install. In 1.4.1 there is > no error at line 399, in 1.3.3 there is. Check your installation of Open > MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected > problems. > > Try a clean installation of 1.4.1 and double check that 1.3.3 is not in > your path/lib_path any longer. > > -- Josh > > > > > I tried the same thing with version 1.3.3 and it works perfectly. > > > > Any idea why? > > > > thanks, > > Andreea > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Checkpoint/Restart error
On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote: > Hi, > > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have > downloaded today. When I want to checkpoint I am having the following error > message: > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 399 > HNP with PID 2337 Not found! This looks like an error coming from the 1.3.3 install. In 1.4.1 there is no error at line 399, in 1.3.3 there is. Check your installation of Open MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected problems. Try a clean installation of 1.4.1 and double check that 1.3.3 is not in your path/lib_path any longer. -- Josh > > I tried the same thing with version 1.3.3 and it works perfectly. > > Any idea why? > > thanks, > Andreea > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Checkpoint/Restart error
Hi, I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have downloaded today. When I want to checkpoint I am having the following error message: [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 399 HNP with PID 2337 Not found! I tried the same thing with version 1.3.3 and it works perfectly. Any idea why? thanks, Andreea