I tested the 1.4.1 release, and everything worked fine for me (tested a few different configurations of nodes/environments).

The ompi-checkpoint error you cited is usually caused by one of two things: - The PID specified is wrong (which I don't think that is the case here)
 - The session directory cannot be found in /tmp.

So I think the problem is the latter. The session directory looks something like:
  /tmp/openmpi-sessions-USERNAME@LOCALHOST_0
Within this directory the mpirun process places its contact information. ompi-checkpoint uses this contact information to connect to the job. If it cannot find it, then it errors out. (We definitely need a better error message here. I filed a ticket [1]).

We usually do not recommend running Open MPI as a root user. So I would strongly recommend that you do not run as a root user.

With a regular user, check the location of the session directory. Make sure that it is in /tmp on the node where 'mpirun' and 'ompi- checkpoint' are run.

-- Josh

[1] https://svn.open-mpi.org/trac/ompi/ticket/2189

On Jan 25, 2010, at 5:48 AM, Andreea Costea wrote:

So? anyone? any clue?

Summarize:
- installed OpenMPI 1.4.1 on fresh Centos 5
- mpirun works but ompi-checkpoint throws this error:
ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405
- on another VM I have OpenMPI 1.3.3. installed. Checkpointing works fine on guest but has the previous mentioned error on root. Both root and guest show the same output after "param -all -all" except for the $HOME (which only matters for mca_component_path, mca_param_files, snapc_base_global_snapshot_dir)


Thanks,
Andreea


On Tue, Jan 19, 2010 at 9:01 PM, Andreea Costea <andre.cos...@gmail.com > wrote: I noticed one more thing. As I still have some VMs that have OpenMPI version 1.3.3 installed I started to use those machines 'till I fix the problem with 1.4.1 And while checkpointing on one of this VMs I realized that checkpointing as a guest works fine and checkpointing as a root outputs the same error like in 1.4.1. : ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405

I logged the outputs of "ompi_info --param all all" which I run for root and for another user and the only differences were at these parameters:

mca_component_path
mca_param_files
snapc_base_global_snapshot_dir

All 3 params differ because of the $HOME.
One more thing: I don't have the directory $HOME/.openmpi

Ideas?

Thanks,
Andreea





On Tue, Jan 19, 2010 at 12:51 PM, Andreea Costea <andre.cos...@gmail.com > wrote: Well... I decided to install a fresh OS to be sure that there is no OpenMPI version conflict. So I formatted one of my VMs, did a fresh CentOS install, installed BLCR 0.8.2 and OpenMPI 1.4.1 and the result: the same. mpirun works but ompi-checkpoint has that error at line 405:

[[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 405

As for the files remaining after uninstalling: Jeff you were rigth. There is no file left, just some empty directories.

Which might be the problem with that ORTE_ERROR_LOG error?

Thanks,
Andreea

On Fri, Jan 15, 2010 at 11:47 PM, Andreea Costea <andre.cos...@gmail.com > wrote:
It's almost midnight here, so I left home, but I will try it tomorrow.
There were some directories left after "make uninstall". I will give more details tomorrow.

Thanks Jeff,
Andreea


On Fri, Jan 15, 2010 at 11:30 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote:

> - I wanted to update to version 1.4.1 and I uninstalled previous version like this: make uninstall, and than manually deleted all the left over files. the directory where I installed was /usr/local

I'll let Josh answer your CR questions, but I did want to ask about this point. AFAIK, "make uninstall" removes *all* Open MPI files. For example:

-----
[7:25] $ cd /path/to/my/OMPI/tree
[7:25] $ make install > /dev/null
[7:26] $ find /tmp/bogus/ -type f | wc
   646     646   28082
[7:26] $ make uninstall > /dev/null
[7:27] $ find /tmp/bogus/ -type f | wc
     0       0       0
[7:27] $
-----

I realize that some *directories* are left in $prefix, but there should be no *files* left. Are you seeing something different?

--
Jeff Squyres
jsquy...@cisco.com


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to