Re: [OMPI devel] Open MPI v1.3.4rc4 is out
That's interesting... Works great now that carto is built. Why is carto now required? -- Samuel K. Gutierrez Los Alamos National Laboratory On Nov 5, 2009, at 4:11 PM, David Gunter wrote: Oh, good catch. I'm not sure who updates the platform files or who would have added the "carto" option to the no_build. It's the only difference between the the 1.3.4 platform files and the previous ones, save for some compiler flags. -david -- David Gunter HPC-3: Infrastructure Team Los Alamos National Laboratory On Nov 5, 2009, at 3:55 PM, Jeff Squyres wrote: I see: enable_mca_no_build=carto,crs,routed-direct,routed-linear,snapc,pml- dr,pml-crcp2,pml-crcpw,pml-v,pml-example,crcp,pml-cm,filem Which means that you're directing all carto components not to build at all. It looks like carto is now required...? On Nov 5, 2009, at 5:38 PM, Samuel K. Gutierrez wrote: Hi Jeff, This is how I configured my build. ./configure --with-platform=./contrib/platform/lanl/rr-class/ optimized- panasas --prefix=/usr/projects/hpctools/samuel/local/rr-dev/apps/ openmpi/gcc/ompi-1.3.4rc4 --libdir=/usr/projects/hpctools/samuel/ local/ rr-dev/apps/openmpi/gcc/ompi-1.3.4rc4/lib64 I'll send the build log shortly. Thanks! -- Samuel K. Gutierrez Los Alamos National Laboratory On Nov 5, 2009, at 3:07 PM, Jeff Squyres wrote: > How did you build? > > I see one carto component named "auto_detect" in the 1.3.4 source > tree, but I don't see it in your ompi_info output. > > Did that component not build? > > > On Nov 4, 2009, at 7:20 PM, Samuel K. Gutierrez wrote: > >> Hi All, >> >> I just built OMPI 1.3.4rc4 on one of our Roadrunner machines. When I >> try to launch a simple MPI job, I get the following: >> >> [rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for >> carto components >> [rra011a.rr.lanl.gov:31601] mca: base: components_open: opening carto >> components >> [rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto >> components >> [rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component >> selected! >> -- >> It looks like opal_init failed for some reason; your parallel >> process is >> likely to abort. There are many reasons that a parallel process can >> fail during opal_init; some of which are due to configuration or >> environment problems. This failure appears to be an internal >> failure; >> here's some additional information (which may only be relevant to an >> Open MPI developer): >> >> opal_carto_base_select failed >> --> Returned value -13 instead of OPAL_SUCCESS >> -- >> [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not >> found in file runtime/orte_init.c at line 77 >> [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not >> found in file orterun.c at line 541 >> >> This may be an issue on our end regarding a runtime parameter that >> isn't set correctly. See attached. Please let me know if you need >> any more info. >> >> Thanks! >> -- >> Samuel K. Gutierrez >> Los Alamos National Laboratory >> >> >> > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Open MPI v1.3.4rc4 is out
Oh, good catch. I'm not sure who updates the platform files or who would have added the "carto" option to the no_build. It's the only difference between the the 1.3.4 platform files and the previous ones, save for some compiler flags. -david -- David Gunter HPC-3: Infrastructure Team Los Alamos National Laboratory On Nov 5, 2009, at 3:55 PM, Jeff Squyres wrote: I see: enable_mca_no_build=carto,crs,routed-direct,routed-linear,snapc,pml- dr,pml-crcp2,pml-crcpw,pml-v,pml-example,crcp,pml-cm,filem Which means that you're directing all carto components not to build at all. It looks like carto is now required...? On Nov 5, 2009, at 5:38 PM, Samuel K. Gutierrez wrote: Hi Jeff, This is how I configured my build. ./configure --with-platform=./contrib/platform/lanl/rr-class/ optimized- panasas --prefix=/usr/projects/hpctools/samuel/local/rr-dev/apps/ openmpi/gcc/ompi-1.3.4rc4 --libdir=/usr/projects/hpctools/samuel/ local/ rr-dev/apps/openmpi/gcc/ompi-1.3.4rc4/lib64 I'll send the build log shortly. Thanks! -- Samuel K. Gutierrez Los Alamos National Laboratory On Nov 5, 2009, at 3:07 PM, Jeff Squyres wrote: > How did you build? > > I see one carto component named "auto_detect" in the 1.3.4 source > tree, but I don't see it in your ompi_info output. > > Did that component not build? > > > On Nov 4, 2009, at 7:20 PM, Samuel K. Gutierrez wrote: > >> Hi All, >> >> I just built OMPI 1.3.4rc4 on one of our Roadrunner machines. When I >> try to launch a simple MPI job, I get the following: >> >> [rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for >> carto components >> [rra011a.rr.lanl.gov:31601] mca: base: components_open: opening carto >> components >> [rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto >> components >> [rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component >> selected! >> -- >> It looks like opal_init failed for some reason; your parallel >> process is >> likely to abort. There are many reasons that a parallel process can >> fail during opal_init; some of which are due to configuration or >> environment problems. This failure appears to be an internal >> failure; >> here's some additional information (which may only be relevant to an >> Open MPI developer): >> >> opal_carto_base_select failed >> --> Returned value -13 instead of OPAL_SUCCESS >> -- >> [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not >> found in file runtime/orte_init.c at line 77 >> [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not >> found in file orterun.c at line 541 >> >> This may be an issue on our end regarding a runtime parameter that >> isn't set correctly. See attached. Please let me know if you need >> any more info. >> >> Thanks! >> -- >> Samuel K. Gutierrez >> Los Alamos National Laboratory >> >> >> > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Open MPI v1.3.4rc4 is out
I see: enable_mca_no_build=carto,crs,routed-direct,routed-linear,snapc,pml- dr,pml-crcp2,pml-crcpw,pml-v,pml-example,crcp,pml-cm,filem Which means that you're directing all carto components not to build at all. It looks like carto is now required...? On Nov 5, 2009, at 5:38 PM, Samuel K. Gutierrez wrote: Hi Jeff, This is how I configured my build. ./configure --with-platform=./contrib/platform/lanl/rr-class/ optimized- panasas --prefix=/usr/projects/hpctools/samuel/local/rr-dev/apps/ openmpi/gcc/ompi-1.3.4rc4 --libdir=/usr/projects/hpctools/samuel/ local/ rr-dev/apps/openmpi/gcc/ompi-1.3.4rc4/lib64 I'll send the build log shortly. Thanks! -- Samuel K. Gutierrez Los Alamos National Laboratory On Nov 5, 2009, at 3:07 PM, Jeff Squyres wrote: > How did you build? > > I see one carto component named "auto_detect" in the 1.3.4 source > tree, but I don't see it in your ompi_info output. > > Did that component not build? > > > On Nov 4, 2009, at 7:20 PM, Samuel K. Gutierrez wrote: > >> Hi All, >> >> I just built OMPI 1.3.4rc4 on one of our Roadrunner machines. When I >> try to launch a simple MPI job, I get the following: >> >> [rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for >> carto components >> [rra011a.rr.lanl.gov:31601] mca: base: components_open: opening carto >> components >> [rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto >> components >> [rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component >> selected! >> -- >> It looks like opal_init failed for some reason; your parallel >> process is >> likely to abort. There are many reasons that a parallel process can >> fail during opal_init; some of which are due to configuration or >> environment problems. This failure appears to be an internal >> failure; >> here's some additional information (which may only be relevant to an >> Open MPI developer): >> >> opal_carto_base_select failed >> --> Returned value -13 instead of OPAL_SUCCESS >> -- >> [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not >> found in file runtime/orte_init.c at line 77 >> [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not >> found in file orterun.c at line 541 >> >> This may be an issue on our end regarding a runtime parameter that >> isn't set correctly. See attached. Please let me know if you need >> any more info. >> >> Thanks! >> -- >> Samuel K. Gutierrez >> Los Alamos National Laboratory >> >> >> > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Open MPI v1.3.4rc4 is out
Hi Jeff, This is how I configured my build. ./configure --with-platform=./contrib/platform/lanl/rr-class/optimized- panasas --prefix=/usr/projects/hpctools/samuel/local/rr-dev/apps/ openmpi/gcc/ompi-1.3.4rc4 --libdir=/usr/projects/hpctools/samuel/local/ rr-dev/apps/openmpi/gcc/ompi-1.3.4rc4/lib64 I'll send the build log shortly. Thanks! -- Samuel K. Gutierrez Los Alamos National Laboratory On Nov 5, 2009, at 3:07 PM, Jeff Squyres wrote: How did you build? I see one carto component named "auto_detect" in the 1.3.4 source tree, but I don't see it in your ompi_info output. Did that component not build? On Nov 4, 2009, at 7:20 PM, Samuel K. Gutierrez wrote: Hi All, I just built OMPI 1.3.4rc4 on one of our Roadrunner machines. When I try to launch a simple MPI job, I get the following: [rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for carto components [rra011a.rr.lanl.gov:31601] mca: base: components_open: opening carto components [rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto components [rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component selected! -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_carto_base_select failed --> Returned value -13 instead of OPAL_SUCCESS -- [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 77 [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 541 This may be an issue on our end regarding a runtime parameter that isn't set correctly. See attached. Please let me know if you need any more info. Thanks! -- Samuel K. Gutierrez Los Alamos National Laboratory -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Open MPI v1.3.4rc4 is out
I used one of the LANL platform files to build, $ configure --with-platform=contrib/platform/lanl/rr-class/debug- panasas-nocell Did the same thing with the non-debug platform file and it dies in the same location. -david -- David Gunter HPC-3: Infrastructure Team Los Alamos National Laboratory On Nov 5, 2009, at 3:07 PM, Jeff Squyres wrote: How did you build? I see one carto component named "auto_detect" in the 1.3.4 source tree, but I don't see it in your ompi_info output. Did that component not build? On Nov 4, 2009, at 7:20 PM, Samuel K. Gutierrez wrote: Hi All, I just built OMPI 1.3.4rc4 on one of our Roadrunner machines. When I try to launch a simple MPI job, I get the following: [rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for carto components [rra011a.rr.lanl.gov:31601] mca: base: components_open: opening carto components [rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto components [rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component selected! -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_carto_base_select failed --> Returned value -13 instead of OPAL_SUCCESS -- [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 77 [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 541 This may be an issue on our end regarding a runtime parameter that isn't set correctly. See attached. Please let me know if you need any more info. Thanks! -- Samuel K. Gutierrez Los Alamos National Laboratory -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Open MPI v1.3.4rc4 is out
How did you build? I see one carto component named "auto_detect" in the 1.3.4 source tree, but I don't see it in your ompi_info output. Did that component not build? On Nov 4, 2009, at 7:20 PM, Samuel K. Gutierrez wrote: Hi All, I just built OMPI 1.3.4rc4 on one of our Roadrunner machines. When I try to launch a simple MPI job, I get the following: [rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for carto components [rra011a.rr.lanl.gov:31601] mca: base: components_open: opening carto components [rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto components [rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component selected! -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_carto_base_select failed --> Returned value -13 instead of OPAL_SUCCESS -- [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 77 [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 541 This may be an issue on our end regarding a runtime parameter that isn't set correctly. See attached. Please let me know if you need any more info. Thanks! -- Samuel K. Gutierrez Los Alamos National Laboratory -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Open MPI v1.3.4rc4 is out
I, too, have tried various builds of the rc4 release. It's dying during orterun. Specifically, here's the call chain where things fall apart: orterun -> orte_init -> opal_init -> opal_carto_base_select -> mca_base_select 54 for (item = opal_list_get_first(components_available); 55 item != opal_list_get_end(components_available); 56 item = opal_list_get_next(item) ) { 57 cli = (mca_base_component_list_item_t *) item; 58 component = (mca_base_component_t *) cli->cli_component; The code is failing on line #55, i.e. item must be getting set to the end on the first pass through. The code then jumps to line #107 and passes the NULL test there: 107if (NULL == *best_component) { 108 opal_output_verbose(5, output_id, 109 "mca:base:select:(%5s) No component selected!", 110 type_name); 111 /* 112 * Still close the non-selected components 113 */ 114 mca_base_components_close(0, /* Pass 0 to keep this from closing the output handle */ 115 components_available, 116 NULL); 117 return OPAL_ERR_NOT_FOUND; 118 } -david -- David Gunter HPC-3: Infrastructure Team Los Alamos National Laboratory Sam Gutierrez wrote: > Hi All, > I just built OMPI 1.3.4rc4 on one of our Roadrunner machines. When I > try to launch a simple MPI job, I get the following: > [rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for > carto components > [rra011a.rr.lanl.gov:31601] mca: base: components_open: opening carto > components > [rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto > components > [rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component > selected! > -- > It looks like opal_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > opal_carto_base_select failed > --> Returned value -13 instead of OPAL_SUCCESS > -- > [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not > found in file runtime/orte_init.c at line 77 > [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not > found in file orterun.c at line 541 > This may be an issue on our end regarding a runtime parameter that > isn't set correctly. See attached. Please let me know if you need > any more info. > Thanks! > -- Samuel K. Gutierrez Los Alamos National Laboratory On Nov 4, 2009, at 3:00 PM, Jeff Squyres wrote: > The latest-n-greatest is available here: > > http://www.open-mpi.org/software/ompi/v1.3/ > > Please beat it up and look for problems! > > -- > Jeff Squyres > jsquyres_at_[hidden] > > ___ > devel mailing list > devel_at_[hidden] > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Fwd: [hwloc-announce] Hardware Locality (hwloc) v0.9.2 released
Just in case you aren't on the hwloc announcement list, we finally released v0.9.2. See the announcement below for details. Begin forwarded message: From: "Jeff Squyres (jsquyres)" Date: November 5, 2009 10:12:28 AM EST To: "Hardware Locality Announcement List" mpi.org> Subject: [hwloc-announce] Hardware Locality (hwloc) v0.9.2 released Reply-To: The Hardware Locality (hwloc) team is pleased to announce the release of v0.9.2 (we made some trivial documentation-only changes after the v0.9.1 tarballs were posted publicly, and have therefore re-released with the version "v0.9.2"). http://www.open-mpi.org/projects/hwloc/ (mirrors will update shortly) hwloc provides command line tools and a C API to obtain the hierarchical map of key computing elements, such as: NUMA memory nodes, shared caches, processor sockets, processor cores, and processor "threads". hwloc also gathers various attributes such as cache and memory information, and is portable across a variety of different operating systems and platforms. hwloc primarily aims at helping high-performance computing (HPC) applications, but is also applicable to any project seeking to exploit code and/or data locality on modern computing platforms. *** Note that the hwloc project represents the merger of the libtopology project from INRIA and the Portable Linux Processor Affinity (PLPA) sub-project from Open MPI. *Both of these prior projects are now deprecated.* The hwloc v0.9.1/v0.9.2 release is essentially a "re-branding" of the libtopology code base, but with both a few genuinely new features and a few PLPA-like features added in. More new features and more PLPA-like features will be added to hwloc over time. hwloc supports the following operating systems: * Linux (including old kernels not having sysfs topology information, with knowledge of cpusets, offline cpus, and Kerrighed support) * Solaris * AIX * Darwin / OS X * OSF/1 (a.k.a., Tru64) * HP-UX * Microsoft Windows hwloc only reports the number of processors on unsupported operating systems; no topology information is available. hwloc is available under the BSD license. -- Jeff Squyres jsquy...@cisco.com ___ hwloc-announce mailing list hwloc-annou...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-announce -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] orte_rml_base_select failed
I think you must be accidentally mixing Open MPI versions -- the file "orte/runtime/orte_system_init.c" does not exist in the Open MPI v1.3 series. It did exist, however, back in the Open MPI 1.2 series. Could you double check that the OMPI that is installed (and is being found/used) on host-desktop1 is the same version as all the others? On Nov 5, 2009, at 7:18 AM, Amit Sharma wrote: I had built OMPI with "-mca rml_base_verbose 10 -mca oob_base_verbose 10" but still no luck. On some machine, where mpirun is working properly, it is giving correct debug messages as below: # mpirun -mca rml_base_verbose 10 -mca oob_base_verbose 10 arch [linux] mca: base: components_open: Looking for rml components [linux] mca: base: components_open: opening rml components [linux] mca: base: components_open: found loaded component oob [linux] mca: base: components_open: component oob has no register function [linux] mca: base: components_open: Looking for oob components [linux] mca: base: components_open: opening oob components [linux] mca: base: components_open: found loaded component tcp [linux] mca: base: components_open: component tcp has no register function [linux] mca: base: components_open: component tcp open function successful [linux] mca: base: components_open: component oob open function successful [linux] orte_rml_base_select: initializing rml component oob [linux] [[55739,0],0] rml:base:update:contact:info got uri 3652911104.0;tcp://128.88.143.227:39207 x86_64 [linux] mca: base: close: component tcp closed [linux] mca: base: close: unloading component tcp [linux] mca: base: close: component oob closed [linux] mca: base: close: unloading component oob # But on the problem reported machine, still the problem is same. It is not showing the debug messages. Directly it is giving the error as below: # mpirun arch [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/ orte_init_stage1.c at line 182 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_rml_base_select failed --> Returned value -13 instead of ORTE_SUCCESS -- [host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42 [host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52 -- Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS. -- Not getting the root cause of failure. Please guide. Regards, Amit Sharma Sr. Software Engineer, Wipro Technologies, Bangalore From: rhc.open...@gmail.com [mailto:rhc.open...@gmail.com] On Behalf Of Ralph Castain Sent: Tuesday, November 03, 2009 11:08 PM To: amit.shar...@wipro.com; Open MPI Developers Subject: Re: [OMPI devel] orte_rml_base_select failed No parameter will help - the issue is that we couldn't find a TCP interface to use for wiring up the job. First thing you might check is that you have a TCP interface alive and active - can be the loopback interface, but you need at least something. If you do have an interface, then you might rebuild OMPI with -- enable-debug so you can get some diagnostics. Then run the job again with -mca rml_base_verbose 10 -mca oob_base_verbose 10 and see what diagnostic error messages emerge. On Tue, Nov 3, 2009 at 4:42 AM, Amit Sharma wrote: Hi, I am using open-mpi version 1.3.2. on SLES 11 machine. I have built it simply like ./configure => make => make install. I am facing the following error with mpirun on some machines. Root # mpirun -np 2 ls [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/ orte_init_stage1.c at line 182 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_rml_base_select failed --> Returned value -13 instead of ORTE_SUCCESS -- [host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42 [host-desktop1:091
Re: [OMPI devel] orte_rml_base_select failed
I had built OMPI with "-mca rml_base_verbose 10 -mca oob_base_verbose 10" but still no luck. On some machine, where mpirun is working properly, it is giving correct debug messages as below: # mpirun -mca rml_base_verbose 10 -mca oob_base_verbose 10 arch [linux] mca: base: components_open: Looking for rml components [linux] mca: base: components_open: opening rml components [linux] mca: base: components_open: found loaded component oob [linux] mca: base: components_open: component oob has no register function [linux] mca: base: components_open: Looking for oob components [linux] mca: base: components_open: opening oob components [linux] mca: base: components_open: found loaded component tcp [linux] mca: base: components_open: component tcp has no register function [linux] mca: base: components_open: component tcp open function successful [linux] mca: base: components_open: component oob open function successful [linux] orte_rml_base_select: initializing rml component oob [linux] [[55739,0],0] rml:base:update:contact:info got uri 3652911104.0;tcp://128.88.143.227:39207 x86_64 [linux] mca: base: close: component tcp closed [linux] mca: base: close: unloading component tcp [linux] mca: base: close: component oob closed [linux] mca: base: close: unloading component oob # But on the problem reported machine, still the problem is same. It is not showing the debug messages. Directly it is giving the error as below: # mpirun arch [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at line 182 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_rml_base_select failed --> Returned value -13 instead of ORTE_SUCCESS -- [host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42 [host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52 -- Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS. -- Not getting the root cause of failure. Please guide. Regards, Amit Sharma Sr. Software Engineer, Wipro Technologies, Bangalore _ From: rhc.open...@gmail.com [mailto:rhc.open...@gmail.com] On Behalf Of Ralph Castain Sent: Tuesday, November 03, 2009 11:08 PM To: amit.shar...@wipro.com; Open MPI Developers Subject: Re: [OMPI devel] orte_rml_base_select failed No parameter will help - the issue is that we couldn't find a TCP interface to use for wiring up the job. First thing you might check is that you have a TCP interface alive and active - can be the loopback interface, but you need at least something. If you do have an interface, then you might rebuild OMPI with --enable-debug so you can get some diagnostics. Then run the job again with -mca rml_base_verbose 10 -mca oob_base_verbose 10 and see what diagnostic error messages emerge. On Tue, Nov 3, 2009 at 4:42 AM, Amit Sharma wrote: Hi, I am using open-mpi version 1.3.2. on SLES 11 machine. I have built it simply like ./configure => make => make install. I am facing the following error with mpirun on some machines. Root # mpirun -np 2 ls [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at line 182 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_rml_base_select failed --> Returned value -13 instead of ORTE_SUCCESS -- [host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42 [host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52 -- Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS. -- Can you please guide me to resolve this issue. Is there any run time environme
Re: [OMPI devel] MPI_Grequest_start and MPI_Wait clarification
Hi Jeff, On Mon, 2 Nov 2009 21:15:15 -0500 Jeff Squyres wrote: > > I had to go re-read that whole section on generalized requests; I > agree with your analysis. Could you open a ticket and submit a > patch? You might want to look at the back ends to MPI_TEST[_ANY] > and MPI_WAIT_ANY as well (if you haven't already). I had a look at MPI_WAIT_ANY and MPI_TEST_ANY and they also suffer from the same bug. I've submitted a ticket (#2093) and attached a patch to it for all of them. Regards, Chris -- cy...@au.ibm.com