Re: [OMPI devel] Open MPI v1.3.4rc4 is out

2009-11-05 Thread Samuel K. Gutierrez
That's interesting...  Works great now that carto is built.  Why is  
carto now required?


--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Nov 5, 2009, at 4:11 PM, David Gunter wrote:

Oh, good catch.  I'm not sure who updates the platform files or who  
would have added the "carto" option to the no_build.  It's the only  
difference between the the 1.3.4 platform files and the previous  
ones, save for some compiler flags.


-david

--
David Gunter
HPC-3: Infrastructure Team
Los Alamos National Laboratory




On Nov 5, 2009, at 3:55 PM, Jeff Squyres wrote:


I see:

enable_mca_no_build=carto,crs,routed-direct,routed-linear,snapc,pml- 
dr,pml-crcp2,pml-crcpw,pml-v,pml-example,crcp,pml-cm,filem


Which means that you're directing all carto components not to build  
at all.


It looks like carto is now required...?


On Nov 5, 2009, at 5:38 PM, Samuel K. Gutierrez wrote:


Hi Jeff,

This is how I configured my build.

./configure --with-platform=./contrib/platform/lanl/rr-class/ 
optimized-

panasas --prefix=/usr/projects/hpctools/samuel/local/rr-dev/apps/
openmpi/gcc/ompi-1.3.4rc4 --libdir=/usr/projects/hpctools/samuel/ 
local/

rr-dev/apps/openmpi/gcc/ompi-1.3.4rc4/lib64

I'll send the build log shortly.

Thanks!
--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Nov 5, 2009, at 3:07 PM, Jeff Squyres wrote:

> How did you build?
>
> I see one carto component named "auto_detect" in the 1.3.4 source
> tree, but I don't see it in your ompi_info output.
>
> Did that component not build?
>
>
> On Nov 4, 2009, at 7:20 PM, Samuel K. Gutierrez wrote:
>
>> Hi All,
>>
>> I just built OMPI 1.3.4rc4 on one of our Roadrunner machines.   
When I

>> try to launch a simple MPI job, I get the following:
>>
>> [rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking  
for

>> carto components
>> [rra011a.rr.lanl.gov:31601] mca: base: components_open: opening  
carto

>> components
>> [rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto
>> components
>> [rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component
>> selected!
>>  
--

>> It looks like opal_init failed for some reason; your parallel
>> process is
>> likely to abort.  There are many reasons that a parallel  
process can

>> fail during opal_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal
>> failure;
>> here's some additional information (which may only be relevant  
to an

>> Open MPI developer):
>>
>>   opal_carto_base_select failed
>>   --> Returned value -13 instead of OPAL_SUCCESS
>>  
--
>> [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG:  
Not

>> found in file runtime/orte_init.c at line 77
>> [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG:  
Not

>> found in file orterun.c at line 541
>>
>> This may be an issue on our end regarding a runtime parameter  
that
>> isn't set correctly.  See attached.  Please let me know if you  
need

>> any more info.
>>
>> Thanks!
>> --
>> Samuel K. Gutierrez
>> Los Alamos National Laboratory
>>
>>
>> 
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Open MPI v1.3.4rc4 is out

2009-11-05 Thread David Gunter
Oh, good catch.  I'm not sure who updates the platform files or who  
would have added the "carto" option to the no_build.  It's the only  
difference between the the 1.3.4 platform files and the previous ones,  
save for some compiler flags.


-david

--
David Gunter
HPC-3: Infrastructure Team
Los Alamos National Laboratory




On Nov 5, 2009, at 3:55 PM, Jeff Squyres wrote:


I see:

enable_mca_no_build=carto,crs,routed-direct,routed-linear,snapc,pml- 
dr,pml-crcp2,pml-crcpw,pml-v,pml-example,crcp,pml-cm,filem


Which means that you're directing all carto components not to build  
at all.


It looks like carto is now required...?


On Nov 5, 2009, at 5:38 PM, Samuel K. Gutierrez wrote:


Hi Jeff,

This is how I configured my build.

./configure --with-platform=./contrib/platform/lanl/rr-class/ 
optimized-

panasas --prefix=/usr/projects/hpctools/samuel/local/rr-dev/apps/
openmpi/gcc/ompi-1.3.4rc4 --libdir=/usr/projects/hpctools/samuel/ 
local/

rr-dev/apps/openmpi/gcc/ompi-1.3.4rc4/lib64

I'll send the build log shortly.

Thanks!
--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Nov 5, 2009, at 3:07 PM, Jeff Squyres wrote:

> How did you build?
>
> I see one carto component named "auto_detect" in the 1.3.4 source
> tree, but I don't see it in your ompi_info output.
>
> Did that component not build?
>
>
> On Nov 4, 2009, at 7:20 PM, Samuel K. Gutierrez wrote:
>
>> Hi All,
>>
>> I just built OMPI 1.3.4rc4 on one of our Roadrunner machines.   
When I

>> try to launch a simple MPI job, I get the following:
>>
>> [rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking  
for

>> carto components
>> [rra011a.rr.lanl.gov:31601] mca: base: components_open: opening  
carto

>> components
>> [rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto
>> components
>> [rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component
>> selected!
>>  
--

>> It looks like opal_init failed for some reason; your parallel
>> process is
>> likely to abort.  There are many reasons that a parallel process  
can

>> fail during opal_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal
>> failure;
>> here's some additional information (which may only be relevant  
to an

>> Open MPI developer):
>>
>>   opal_carto_base_select failed
>>   --> Returned value -13 instead of OPAL_SUCCESS
>>  
--
>> [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG:  
Not

>> found in file runtime/orte_init.c at line 77
>> [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG:  
Not

>> found in file orterun.c at line 541
>>
>> This may be an issue on our end regarding a runtime parameter that
>> isn't set correctly.  See attached.  Please let me know if you  
need

>> any more info.
>>
>> Thanks!
>> --
>> Samuel K. Gutierrez
>> Los Alamos National Laboratory
>>
>>
>> 
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Open MPI v1.3.4rc4 is out

2009-11-05 Thread Jeff Squyres

I see:

enable_mca_no_build=carto,crs,routed-direct,routed-linear,snapc,pml- 
dr,pml-crcp2,pml-crcpw,pml-v,pml-example,crcp,pml-cm,filem


Which means that you're directing all carto components not to build at  
all.


It looks like carto is now required...?


On Nov 5, 2009, at 5:38 PM, Samuel K. Gutierrez wrote:


Hi Jeff,

This is how I configured my build.

./configure --with-platform=./contrib/platform/lanl/rr-class/ 
optimized-

panasas --prefix=/usr/projects/hpctools/samuel/local/rr-dev/apps/
openmpi/gcc/ompi-1.3.4rc4 --libdir=/usr/projects/hpctools/samuel/ 
local/

rr-dev/apps/openmpi/gcc/ompi-1.3.4rc4/lib64

I'll send the build log shortly.

Thanks!
--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Nov 5, 2009, at 3:07 PM, Jeff Squyres wrote:

> How did you build?
>
> I see one carto component named "auto_detect" in the 1.3.4 source
> tree, but I don't see it in your ompi_info output.
>
> Did that component not build?
>
>
> On Nov 4, 2009, at 7:20 PM, Samuel K. Gutierrez wrote:
>
>> Hi All,
>>
>> I just built OMPI 1.3.4rc4 on one of our Roadrunner machines.   
When I

>> try to launch a simple MPI job, I get the following:
>>
>> [rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for
>> carto components
>> [rra011a.rr.lanl.gov:31601] mca: base: components_open: opening  
carto

>> components
>> [rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto
>> components
>> [rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component
>> selected!
>>  
--

>> It looks like opal_init failed for some reason; your parallel
>> process is
>> likely to abort.  There are many reasons that a parallel process  
can

>> fail during opal_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal
>> failure;
>> here's some additional information (which may only be relevant to  
an

>> Open MPI developer):
>>
>>   opal_carto_base_select failed
>>   --> Returned value -13 instead of OPAL_SUCCESS
>>  
--

>> [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
>> found in file runtime/orte_init.c at line 77
>> [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
>> found in file orterun.c at line 541
>>
>> This may be an issue on our end regarding a runtime parameter that
>> isn't set correctly.  See attached.  Please let me know if you need
>> any more info.
>>
>> Thanks!
>> --
>> Samuel K. Gutierrez
>> Los Alamos National Laboratory
>>
>>
>> 
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI devel] Open MPI v1.3.4rc4 is out

2009-11-05 Thread Samuel K. Gutierrez

Hi Jeff,

This is how I configured my build.

./configure --with-platform=./contrib/platform/lanl/rr-class/optimized- 
panasas --prefix=/usr/projects/hpctools/samuel/local/rr-dev/apps/ 
openmpi/gcc/ompi-1.3.4rc4 --libdir=/usr/projects/hpctools/samuel/local/ 
rr-dev/apps/openmpi/gcc/ompi-1.3.4rc4/lib64


I'll send the build log shortly.

Thanks!
--
Samuel K. Gutierrez
Los Alamos National Laboratory

On Nov 5, 2009, at 3:07 PM, Jeff Squyres wrote:


How did you build?

I see one carto component named "auto_detect" in the 1.3.4 source  
tree, but I don't see it in your ompi_info output.


Did that component not build?


On Nov 4, 2009, at 7:20 PM, Samuel K. Gutierrez wrote:


Hi All,

I just built OMPI 1.3.4rc4 on one of our Roadrunner machines.  When I
try to launch a simple MPI job, I get the following:

[rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for
carto components
[rra011a.rr.lanl.gov:31601] mca: base: components_open: opening carto
components
[rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto
components
[rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component
selected!
--
It looks like opal_init failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal  
failure;

here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_carto_base_select failed
  --> Returned value -13 instead of OPAL_SUCCESS
--
[rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
found in file runtime/orte_init.c at line 77
[rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
found in file orterun.c at line 541

This may be an issue on our end regarding a runtime parameter that
isn't set correctly.  See attached.  Please let me know if you need
any more info.

Thanks!
--
Samuel K. Gutierrez
Los Alamos National Laboratory






--
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Open MPI v1.3.4rc4 is out

2009-11-05 Thread David Gunter

I used one of the LANL platform files to build,

$ configure --with-platform=contrib/platform/lanl/rr-class/debug- 
panasas-nocell


Did the same thing with the non-debug platform file and it dies in the  
same location.


-david

--
David Gunter
HPC-3: Infrastructure Team
Los Alamos National Laboratory




On Nov 5, 2009, at 3:07 PM, Jeff Squyres wrote:


How did you build?

I see one carto component named "auto_detect" in the 1.3.4 source  
tree, but I don't see it in your ompi_info output.


Did that component not build?


On Nov 4, 2009, at 7:20 PM, Samuel K. Gutierrez wrote:


Hi All,

I just built OMPI 1.3.4rc4 on one of our Roadrunner machines.  When I
try to launch a simple MPI job, I get the following:

[rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for
carto components
[rra011a.rr.lanl.gov:31601] mca: base: components_open: opening carto
components
[rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto
components
[rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component
selected!
--
It looks like opal_init failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal  
failure;

here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_carto_base_select failed
  --> Returned value -13 instead of OPAL_SUCCESS
--
[rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
found in file runtime/orte_init.c at line 77
[rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
found in file orterun.c at line 541

This may be an issue on our end regarding a runtime parameter that
isn't set correctly.  See attached.  Please let me know if you need
any more info.

Thanks!
--
Samuel K. Gutierrez
Los Alamos National Laboratory






--
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Open MPI v1.3.4rc4 is out

2009-11-05 Thread Jeff Squyres

How did you build?

I see one carto component named "auto_detect" in the 1.3.4 source  
tree, but I don't see it in your ompi_info output.


Did that component not build?


On Nov 4, 2009, at 7:20 PM, Samuel K. Gutierrez wrote:


Hi All,

I just built OMPI 1.3.4rc4 on one of our Roadrunner machines.  When I
try to launch a simple MPI job, I get the following:

[rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for
carto components
[rra011a.rr.lanl.gov:31601] mca: base: components_open: opening carto
components
[rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto
components
[rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component
selected!
--
It looks like opal_init failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

   opal_carto_base_select failed
   --> Returned value -13 instead of OPAL_SUCCESS
--
[rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
found in file runtime/orte_init.c at line 77
[rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
found in file orterun.c at line 541

This may be an issue on our end regarding a runtime parameter that
isn't set correctly.  See attached.  Please let me know if you need
any more info.

Thanks!
--
Samuel K. Gutierrez
Los Alamos National Laboratory






--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI devel] Open MPI v1.3.4rc4 is out

2009-11-05 Thread David Gunter
I, too, have tried various builds of the rc4 release.  It's dying  
during orterun.


Specifically, here's the call chain where things fall apart:

orterun -> orte_init -> opal_init -> opal_carto_base_select  ->  
mca_base_select


54 for (item  = opal_list_get_first(components_available);
55 item != opal_list_get_end(components_available);
56  item  = opal_list_get_next(item) ) {
57 cli = (mca_base_component_list_item_t *) item;
58 component = (mca_base_component_t *) cli->cli_component;

The code is failing on line #55, i.e. item must be getting set to the  
end on the first pass through.  The code then jumps to line #107 and  
passes the NULL test there:


107if (NULL == *best_component) {
108 opal_output_verbose(5, output_id,
109 "mca:base:select:(%5s) No component  
selected!",

110 type_name);
111 /*
112  * Still close the non-selected components
113  */
114 mca_base_components_close(0, /* Pass 0 to keep this from  
closing the output handle */

115   components_available,
116   NULL);
117 return OPAL_ERR_NOT_FOUND;
118 }


-david
--
David Gunter
HPC-3: Infrastructure Team
Los Alamos National Laboratory



Sam Gutierrez wrote:

>   Hi All,

>  I just built OMPI 1.3.4rc4 on one of our Roadrunner machines. When I
>  try to launch a simple MPI job, I get the following:

>  [rra011a.rr.lanl.gov:31601] mca: base: components_open: Looking for
>  carto components
>  [rra011a.rr.lanl.gov:31601] mca: base: components_open: opening  
carto

>  components
>  [rra011a.rr.lanl.gov:31601] mca:base:select: Auto-selecting carto
>  components
>  [rra011a.rr.lanl.gov:31601] mca:base:select:(carto) No component
>  selected!
>   
--
>  It looks like opal_init failed for some reason; your parallel  
process is

>  likely to abort. There are many reasons that a parallel process can
>  fail during opal_init; some of which are due to configuration or
>  environment problems. This failure appears to be an internal  
failure;

>  here's some additional information (which may only be relevant to an
>  Open MPI developer):

> opal_carto_base_select failed
> --> Returned value -13 instead of OPAL_SUCCESS
>   
--

>  [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
>  found in file runtime/orte_init.c at line 77
>  [rra011a.rr.lanl.gov:31601] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
>  found in file orterun.c at line 541

>  This may be an issue on our end regarding a runtime parameter that
>  isn't set correctly. See attached. Please let me know if you need
>  any more info.

>  Thanks!

>  --
Samuel K. Gutierrez
Los Alamos National Laboratory



On Nov 4, 2009, at 3:00 PM, Jeff Squyres wrote:
> The latest-n-greatest is available here:
>
> http://www.open-mpi.org/software/ompi/v1.3/
>
> Please beat it up and look for problems!
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
>
> ___
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel 


[OMPI devel] Fwd: [hwloc-announce] Hardware Locality (hwloc) v0.9.2 released

2009-11-05 Thread Jeff Squyres
Just in case you aren't on the hwloc announcement list, we finally  
released v0.9.2.  See the announcement below for details.



Begin forwarded message:


From: "Jeff Squyres (jsquyres)" 
Date: November 5, 2009 10:12:28 AM EST
To: "Hardware Locality Announcement List" mpi.org>

Subject: [hwloc-announce] Hardware Locality (hwloc) v0.9.2 released
Reply-To: 

The Hardware Locality (hwloc) team is pleased to announce the release
of v0.9.2 (we made some trivial documentation-only changes after the
v0.9.1 tarballs were posted publicly, and have therefore re-released
with the version "v0.9.2").

http://www.open-mpi.org/projects/hwloc/
(mirrors will update shortly)

hwloc provides command line tools and a C API to obtain the
hierarchical map of key computing elements, such as: NUMA memory
nodes, shared caches, processor sockets, processor cores, and
processor "threads".  hwloc also gathers various attributes such as
cache and memory information, and is portable across a variety of
different operating systems and platforms.

hwloc primarily aims at helping high-performance computing (HPC)
applications, but is also applicable to any project seeking to exploit
code and/or data locality on modern computing platforms.

*** Note that the hwloc project represents the merger of the
libtopology project from INRIA and the Portable Linux Processor
Affinity (PLPA) sub-project from Open MPI.  *Both of these prior
projects are now deprecated.*  The hwloc v0.9.1/v0.9.2 release is
essentially a "re-branding" of the libtopology code base, but with
both a few genuinely new features and a few PLPA-like features added
in.  More new features and more PLPA-like features will be added to
hwloc over time.

hwloc supports the following operating systems:

  * Linux (including old kernels not having sysfs topology
information, with
knowledge of cpusets, offline cpus, and Kerrighed support)
  * Solaris
  * AIX
  * Darwin / OS X
  * OSF/1 (a.k.a., Tru64)
  * HP-UX
  * Microsoft Windows

hwloc only reports the number of processors on unsupported operating
systems; no topology information is available.

hwloc is available under the BSD license.

--
Jeff Squyres
jsquy...@cisco.com

___
hwloc-announce mailing list
hwloc-annou...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-announce




--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI devel] orte_rml_base_select failed

2009-11-05 Thread Jeff Squyres
I think you must be accidentally mixing Open MPI versions -- the file  
"orte/runtime/orte_system_init.c" does not exist in the Open MPI v1.3  
series.  It did exist, however, back in the Open MPI 1.2 series.


Could you double check that the OMPI that is installed (and is being  
found/used) on host-desktop1 is the same version as all the others?



On Nov 5, 2009, at 7:18 AM, Amit Sharma wrote:

I had built OMPI with "-mca rml_base_verbose 10 -mca  
oob_base_verbose 10" but still no luck. On some machine, where  
mpirun is working properly, it is giving correct debug messages as  
below:


# mpirun -mca rml_base_verbose 10 -mca oob_base_verbose 10 arch
[linux] mca: base: components_open: Looking for rml components
[linux] mca: base: components_open: opening rml components
[linux] mca: base: components_open: found loaded component oob
[linux] mca: base: components_open: component oob has no register  
function

[linux] mca: base: components_open: Looking for oob components
[linux] mca: base: components_open: opening oob components
[linux] mca: base: components_open: found loaded component tcp
[linux] mca: base: components_open: component tcp has no register  
function
[linux] mca: base: components_open: component tcp open function  
successful
[linux] mca: base: components_open: component oob open function  
successful

[linux] orte_rml_base_select: initializing rml component oob
[linux] [[55739,0],0] rml:base:update:contact:info got uri  
3652911104.0;tcp://128.88.143.227:39207

x86_64
[linux] mca: base: close: component tcp closed
[linux] mca: base: close: unloading component tcp
[linux] mca: base: close: component oob closed
[linux] mca: base: close: unloading component oob
#

But on the problem reported machine, still the problem is same. It  
is not showing the debug messages. Directly it is giving the error  
as below:


 # mpirun arch

[NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/ 
orte_init_stage1.c at

line 182
--
It looks like orte_init failed for some reason; your parallel  
process is
likely to abort. There are many reasons that a parallel process can  
fail
during orte_init; some of which are due to configuration or  
environment

problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

orte_rml_base_select failed
--> Returned value -13 instead of ORTE_SUCCESS

--
[host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42 [host-desktop1:09127] [NO-NAME]
ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52
--
Open RTE was unable to initialize properly. The error occured while
attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS.
--
Not getting the root cause of failure. Please guide.


Regards,
Amit Sharma
Sr. Software Engineer,
Wipro Technologies, Bangalore



From: rhc.open...@gmail.com [mailto:rhc.open...@gmail.com] On Behalf  
Of Ralph Castain

Sent: Tuesday, November 03, 2009 11:08 PM
To: amit.shar...@wipro.com; Open MPI Developers
Subject: Re: [OMPI devel] orte_rml_base_select failed

No parameter will help - the issue is that we couldn't find a TCP  
interface to use for wiring up the job. First thing you might check  
is that you have a TCP interface alive and active - can be the  
loopback interface, but you need at least something.


If you do have an interface, then you might rebuild OMPI with -- 
enable-debug so you can get some diagnostics. Then run the job again  
with


 -mca rml_base_verbose 10 -mca oob_base_verbose 10

and see what diagnostic error messages emerge.


On Tue, Nov 3, 2009 at 4:42 AM, Amit Sharma   
wrote:



Hi,

I am using open-mpi version 1.3.2. on SLES 11 machine. I have built it
simply like ./configure => make => make install.

I am facing the following error with mpirun on some machines.

Root # mpirun -np 2 ls

[NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/ 
orte_init_stage1.c at

line 182
--
It looks like orte_init failed for some reason; your parallel  
process is
likely to abort. There are many reasons that a parallel process can  
fail
during orte_init; some of which are due to configuration or  
environment

problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

orte_rml_base_select failed
--> Returned value -13 instead of ORTE_SUCCESS

--
[host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42 [host-desktop1:091

Re: [OMPI devel] orte_rml_base_select failed

2009-11-05 Thread Amit Sharma
I had built OMPI with "-mca rml_base_verbose 10 -mca oob_base_verbose 10"
but still no luck. On some machine, where mpirun is working properly, it is
giving correct debug messages as below:

# mpirun -mca rml_base_verbose 10 -mca oob_base_verbose 10 arch
[linux] mca: base: components_open: Looking for rml components
[linux] mca: base: components_open: opening rml components
[linux] mca: base: components_open: found loaded component oob
[linux] mca: base: components_open: component oob has no register function
[linux] mca: base: components_open: Looking for oob components
[linux] mca: base: components_open: opening oob components
[linux] mca: base: components_open: found loaded component tcp
[linux] mca: base: components_open: component tcp has no register function
[linux] mca: base: components_open: component tcp open function successful
[linux] mca: base: components_open: component oob open function successful
[linux] orte_rml_base_select: initializing rml component oob
[linux] [[55739,0],0] rml:base:update:contact:info got uri
3652911104.0;tcp://128.88.143.227:39207
x86_64
[linux] mca: base: close: component tcp closed
[linux] mca: base: close: unloading component tcp
[linux] mca: base: close: component oob closed
[linux] mca: base: close: unloading component oob
#

But on the problem reported machine, still the problem is same. It is not
showing the debug messages. Directly it is giving the error as below:

 # mpirun arch

[NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at
line 182
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can fail
during orte_init; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

orte_rml_base_select failed
--> Returned value -13 instead of ORTE_SUCCESS

--
[host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42 [host-desktop1:09127] [NO-NAME]
ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52
--
Open RTE was unable to initialize properly. The error occured while
attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS.
--

Not getting the root cause of failure. Please guide.



Regards,
Amit Sharma
Sr. Software Engineer,
Wipro Technologies, Bangalore



  _  

From: rhc.open...@gmail.com [mailto:rhc.open...@gmail.com] On Behalf Of
Ralph Castain
Sent: Tuesday, November 03, 2009 11:08 PM
To: amit.shar...@wipro.com; Open MPI Developers
Subject: Re: [OMPI devel] orte_rml_base_select failed


No parameter will help - the issue is that we couldn't find a TCP interface
to use for wiring up the job. First thing you might check is that you have a
TCP interface alive and active - can be the loopback interface, but you need
at least something.

If you do have an interface, then you might rebuild OMPI with --enable-debug
so you can get some diagnostics. Then run the job again with

 -mca rml_base_verbose 10 -mca oob_base_verbose 10

and see what diagnostic error messages emerge.



On Tue, Nov 3, 2009 at 4:42 AM, Amit Sharma  wrote:




Hi,

I am using open-mpi version 1.3.2. on SLES 11 machine. I have built it
simply like ./configure => make => make install.

I am facing the following error with mpirun on some machines.

Root # mpirun -np 2 ls

[NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at
line 182
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can fail
during orte_init; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

orte_rml_base_select failed
--> Returned value -13 instead of ORTE_SUCCESS

--
[host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42 [host-desktop1:09127] [NO-NAME]
ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52
--
Open RTE was unable to initialize properly. The error occured while
attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS.
--

Can you please guide me to resolve this issue. Is there any run time
environme

Re: [OMPI devel] MPI_Grequest_start and MPI_Wait clarification

2009-11-05 Thread Christopher Yeoh
Hi Jeff,

On Mon, 2 Nov 2009 21:15:15 -0500
Jeff Squyres  wrote:
> 
> I had to go re-read that whole section on generalized requests; I  
> agree with your analysis.  Could you open a ticket and submit a  
> patch?  You might want to look at the back ends to MPI_TEST[_ANY]
> and MPI_WAIT_ANY as well (if you haven't already).

I had a look at MPI_WAIT_ANY and MPI_TEST_ANY and they also suffer from
the same bug. I've submitted a ticket (#2093) and attached a patch to it
for all of them.

Regards,

Chris
-- 
cy...@au.ibm.com