Greetings Patrick. Many thanks for the detailed run-down; sorry I didn't reply earlier.

This is quite definitely a known problem, and I'm pretty sure we have an open ticket on it (I'm on a plane right now and can't check the web-based bug tracker). We have a solution in mind for the issue, but it hadn't been done yet mainly because it hadn't bubbled up high enough in priority / no one had the time to code it up.

How high of a priority is the ability to re-home an OMPI installation for you?


On Dec 8, 2006, at 8:53 AM, Patrick Jessee wrote:


Hello. For OpenMPI 1.1.2, I've come across a situation where the -- prefix syntax does not seem to be working. I've investigated the issue by stepping through the mpirun startup in a debugger. Below is a summary of the problem and details about the investigation (along with a prospective fix).

Summary of  problem
===============

When starting a openMPI run with the --prefix option, the MPI application does not start up correctly in certain situations. An important point is that this problem behavior is masked (and not seen) if the openMPI libraries are available at the compile/install- time location defined by OPAL_PKGLIBDIR (defined in opal/include/ opal/install_dirs.h). So in debugging the problem, it is important to move the openMPI installation from the installed location, and then set the --prefix value to the new location. In addition, LD_LIBRARY_PATH needs to be set to the new location so mpirun can find liborte.so and libopal.so at program load time (--prefix can't help mpirun with liborte.so and libopal.so because (a) these libs are dynamically linked into mpirun and are needed at program load time, and (b) the --prefix arg isn't processed until after load time. Thus LD_LIBRARY_PATH is needed for mpirun, but this is tangential).

The behavior that is see is the following output:

---------------------------------------------------------------------- ---- It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_sds_base_select failed
--> Returned value -13 instead of ORTE_SUCCESS
:
:
---------------------------------------------------------------------- ----
Open RTE was unable to initialize properly.  The error occurred while
attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS. ---------------------------------------------------------------------- ----


Investigation of the problem
===================

As mentioned before, I've looked at mpirun in the debugger. The instance of mpirun (and the MPI app) find the dynamically linked libraries (liborte.so, libopal.so) just fine, but they do not locate the dynamically loaded ones (the ones in lib/openmpi such as mca_paffinity_linux.so, etc.). The --prefix directory does not seem to be getting used to open the libraries in lib/openmpi.

It appears that the location to search is getting set in mca_base_open.c around line 68 (1.1.2):

asprintf(&value, "%s:~/.openmpi/components", OPAL_PKGLIBDIR);
mca_base_param_component_path =
 mca_base_param_reg_string_name("mca", "component_path",
"Path where to look for Open MPI and ORTE components",
                                false, false, value, NULL);


Here, OPAL_PKGLIBDIR is a fixed, compile-time location. It appears that the --prefix directory (actually <prefix_dir>/lib/openmpi) needs to be appended, if not prepended, to the component_path. Alternatively, the static OPAL_PKGLIBDIR directory could just be replaced by the runtime value of <prefix_dir>/lib/openmpi.

I've compiled in a quick fix to libopal.so to see if the approach addressed the issue. I didn't see how to get access to the -- prefix directory at this point, so I just prepended genenv ("LD_LIBRARY_PATH") to "value" and added <prefix_dir>/lib/openmpi to LD_LIBRARY_PATH before starting the app (note: this is just a way for verifying that if the --prefix directory was used here, it would address the issue; this is not a proposed solution. The <prefix_dir>/lib/openmpi should be used directly). Anyway, this fixed the issue and the application was able so start.

In applying this fix, I also found that is was not only important for mca_base_param_component_path to include the <prefix_dir>/lib/ openmpi directory in the instances of mpirun and the MPI app, but also in all instances of orted before they dynamically load libraries.
----

In summary, it seems that this issue can be resolved by applying the --prefix directory (<prefix_dir>/lib/openmpi) to mca_base_param_component_path in instances of mpirun, orted, and the MPI app.

Any help in getting this fix implemented in the code base would be very much appreciated, and I'll be happy to provide any more information or help.

Regards,

Patrick

P.S. Even with the fix, a (non-fatal) message is printed. It's probably a tangential issue, but thought it was worth mentioning. Again, the --prefix directory probably needs to be used somewhere in place of a static directory. The message is:

---------------------------------------------------------------------- ----
Sorry!  You were supposed to get help about:
 rds:no-hostfile
from the file:
 help-rds-hostfile.txt
But I couldn't find any file matching that name.  Sorry!
---------------------------------------------------------------------- ----
<pj.vcf>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

Reply via email to