Re: [OMPI devel] hwloc error

2014-09-15 Thread Alina Sklarevich
Thanks Ralph,
adding --hetero-nodes to the command line solved this issue.

Alina.

On Mon, Sep 15, 2014 at 6:51 AM, Ralph Castain  wrote:

> Try adding --hetero-nodes to your mpirun cmd line
>
> On Sep 14, 2014, at 8:25 AM, Alina Sklarevich 
> wrote:
>
> Hello,
>
>
> I am using ompi-v1.8 and have come across the following error:
>
>
> --
>
> Open MPI tried to bind a new process, but something went wrong.  The
>
> process was killed without launching the target application.  Your job
>
> will now abort.
>
>
>   Local host:vegas17
>
>   Application name:  trivial/test_get__trivial/c_hello
>
>   Error message: hwloc_set_cpubind returned "Error" for bitmap "0,16"
>
>   Location:  odls_default_module.c:551
>
> --
>
>
> This happens when running a simple trivial test with the following command
> line:
>
>
> mpirun --map-by node --bind-to core -display-map -np 2 -mca pml ob1
> …/trivial/test_get__trivial/c_hello
>
>
> What seems to eliminate this error is changing the binding policy from
> core to none (--bind-to none).
>
> The only nodes which are issuing this error are always the nodes which
> have GPUs in them.
>
> When running the same command line on other non-GPU nodes, there is no
> error.
>
> I’m using Slurm to allocate the nodes.
>
>
> Has anyone seen this issue or knows what’s wrong here?
>
>
> Thanks,
>
> Alina.
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15824.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15825.php
>


[OMPI devel] coll ml error with some nonblocking collectives

2014-09-15 Thread Rolf vandeVaart
I wonder if anyone else is seeing this failure. Not sure when this started but 
it is only on the trunk. Here is a link to my failures as well as an example 
below that. There are a variety of nonblocking collectives failing like this.


http://mtt.open-mpi.org/index.php?do_redir=2208


[rvandevaart@drossetti-ivy0 collective]$ mpirun --mca btl self,sm,tcp -host 
drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 iallreduce
--

ML detected an unrecoverable error on intrinsic communicator MPI_COMM_WORLD

The program will now abort
--
[drossetti-ivy0.nvidia.com:04664] 3 more processes have sent help message 
help-mpi-coll-ml.txt / coll-ml-check-fatal-error
[rvandevaart@drossetti-ivy0 collective]$


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


[OMPI devel] External loopback

2014-09-15 Thread Håkon Bugge
From time-to-time, and have a need for running Open MPI apps using the openib 
btl on a single node, where port 1 on the HCA is connected to port 2 on the 
same HCA.

Using a vintage 1.5.4, my command line would read:

mpiexec --mca btl self,openib --mca btl_openib_cpc_include oob \
   -np 1 /usr/bin/env OMPI_MCA_btl_openib_if_include=mlx4_0:1 ./a.out  : \
   -np 1 /usr/bin/env OMPI_MCA_btl_openib_if_include=mlx4_0:2 ./a.out


Now, I had a need for a newer Open MPI, and compiled and installed version 
1.8.2. Now the problems began ;-) Apparently, the old (and in my opinion 
nice)"oob" connection management method has disappeared. However, by modifying 
the command line to:

mpiexec --mca btl self,openib --mca btl_openib_cpc_include udcm \
   -np 1 /usr/bin/env OMPI_MCA_btl_openib_if_include=mlx4_0:1 ./a.out : \
   -np 1 /usr/bin/env OMPI_MCA_btl_openib_if_include=mlx4_0:2 ./a.out


I get tons of:

connect/btl_openib_connect_udcm.c:1390:udcm_find_endpoint] could not find 
endpoint with port: 1, lid: 4608, msg_type: 100

Interestingly, the lid here is the lid for Port 2 (when port numbers start at 
1). I do suspect that the printout above counts ports from zero.

Anyway, must I get back to an older Open MPI supporting "oob", or do I have a 
flaw in my command line?


Thanks, Håkon



Re: [OMPI devel] coll ml error with some nonblocking collectives

2014-09-15 Thread Pritchard Jr., Howard
Hi Rolf,

This may be related to change set 32659.

If you back this change out, do the tests pass?


Howard




From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Monday, September 15, 2014 8:55 AM
To: de...@open-mpi.org
Subject: [OMPI devel] coll ml error with some nonblocking collectives


I wonder if anyone else is seeing this failure. Not sure when this started but 
it is only on the trunk. Here is a link to my failures as well as an example 
below that. There are a variety of nonblocking collectives failing like this.



http://mtt.open-mpi.org/index.php?do_redir=2208



[rvandevaart@drossetti-ivy0 collective]$ mpirun --mca btl self,sm,tcp -host 
drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 iallreduce
--

ML detected an unrecoverable error on intrinsic communicator MPI_COMM_WORLD

The program will now abort
--
[drossetti-ivy0.nvidia.com:04664] 3 more processes have sent help message 
help-mpi-coll-ml.txt / coll-ml-check-fatal-error
[rvandevaart@drossetti-ivy0 collective]$


This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.



Re: [OMPI devel] coll ml error with some nonblocking collectives

2014-09-15 Thread Rolf vandeVaart
Confirmed that trunk version r32658 does pass the test.

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Pritchard Jr., 
Howard
Sent: Monday, September 15, 2014 4:16 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] coll ml error with some nonblocking collectives

Hi Rolf,

This may be related to change set 32659.

If you back this change out, do the tests pass?


Howard




From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Monday, September 15, 2014 8:55 AM
To: de...@open-mpi.org
Subject: [OMPI devel] coll ml error with some nonblocking collectives


I wonder if anyone else is seeing this failure. Not sure when this started but 
it is only on the trunk. Here is a link to my failures as well as an example 
below that. There are a variety of nonblocking collectives failing like this.



http://mtt.open-mpi.org/index.php?do_redir=2208



[rvandevaart@drossetti-ivy0 collective]$ mpirun --mca btl self,sm,tcp -host 
drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 iallreduce
--

ML detected an unrecoverable error on intrinsic communicator MPI_COMM_WORLD

The program will now abort
--
[drossetti-ivy0.nvidia.com:04664] 3 more processes have sent help message 
help-mpi-coll-ml.txt / coll-ml-check-fatal-error
[rvandevaart@drossetti-ivy0 collective]$


This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.



[OMPI devel] removing cnos support from ompi

2014-09-15 Thread Pritchard Jr., Howard
Hi Folks,

I'd like to rip out the cnos ess/alps code from ompi.  Its dead - no
one is using CNOS (old cray xt systems) - and its very confusing
to leave around.

Any objections?

Howard


-
Howard Pritchard
HPC-5
Los Alamos National Laboratory




Re: [OMPI devel] removing cnos support from ompi

2014-09-15 Thread Ralph Castain
No objection from me - that pretty much belongs to you folks

On Sep 15, 2014, at 2:51 PM, Pritchard Jr., Howard  wrote:

> Hi Folks,
>  
> I’d like to rip out the cnos ess/alps code from ompi.  Its dead – no
> one is using CNOS (old cray xt systems) – and its very confusing
> to leave around.
>  
> Any objections?
>  
> Howard
>  
>  
> -
> Howard Pritchard
> HPC-5
> Los Alamos National Laboratory
>  
>  
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15832.php



Re: [OMPI devel] coll ml error with some nonblocking collectives

2014-09-15 Thread Pritchard Jr., Howard
HI Rolf,

Okay.  I'll work with ORNL folks to see how to really fix this.

Howard


From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Monday, September 15, 2014 3:10 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] coll ml error with some nonblocking collectives

Confirmed that trunk version r32658 does pass the test.

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Pritchard Jr., 
Howard
Sent: Monday, September 15, 2014 4:16 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] coll ml error with some nonblocking collectives

Hi Rolf,

This may be related to change set 32659.

If you back this change out, do the tests pass?


Howard




From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Monday, September 15, 2014 8:55 AM
To: de...@open-mpi.org
Subject: [OMPI devel] coll ml error with some nonblocking collectives


I wonder if anyone else is seeing this failure. Not sure when this started but 
it is only on the trunk. Here is a link to my failures as well as an example 
below that. There are a variety of nonblocking collectives failing like this.



http://mtt.open-mpi.org/index.php?do_redir=2208



[rvandevaart@drossetti-ivy0 collective]$ mpirun --mca btl self,sm,tcp -host 
drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 iallreduce
--

ML detected an unrecoverable error on intrinsic communicator MPI_COMM_WORLD

The program will now abort
--
[drossetti-ivy0.nvidia.com:04664] 3 more processes have sent help message 
help-mpi-coll-ml.txt / coll-ml-check-fatal-error
[rvandevaart@drossetti-ivy0 collective]$


This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.



Re: [OMPI devel] coll ml error with some nonblocking collectives

2014-09-15 Thread Gilles Gouaillardet
Howard, and Rolf,

i initially reported the issue at
http://www.open-mpi.org/community/lists/devel/2014/09/15767.php

r32659 is not a fix nor a regression, it simply aborts instead of
OBJ_RELEASE(mpi_comm_world).
/* my point here is we should focus on the root cause and not the
consequence */

first, this is a race condition, so one run is not enough to conclude the
problem is fixed.
second, if you do not configure with --enable-debug, there might be a
silent data corruption with undefined results if the bug is hit. undefined
result can mean the test success.

bottom line and imho :
- if your test success without r32659, it just means you were lucky ...
- an abort with an understandable error message is better than a silent
corruption

last but not least, r32659 was acked for v1.8 8 #4888).
coll/ml priority is now zero in v1.8 and this is likely the only reason why
you do not see any errors in this branch.

Cheers,

Gilles

On Tue, Sep 16, 2014 at 8:33 AM, Pritchard Jr., Howard 
wrote:

>  HI Rolf,
>
>
>
> Okay.  I’ll work with ORNL folks to see how to really fix this.
>
>
>
> Howard
>
>
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Rolf
> vandeVaart
> *Sent:* Monday, September 15, 2014 3:10 PM
>
> *To:* Open MPI Developers
> *Subject:* Re: [OMPI devel] coll ml error with some nonblocking
> collectives
>
>
>
> Confirmed that trunk version r32658 does pass the test.
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org
> ] *On Behalf Of *Pritchard Jr., Howard
> *Sent:* Monday, September 15, 2014 4:16 PM
> *To:* Open MPI Developers
> *Subject:* Re: [OMPI devel] coll ml error with some nonblocking
> collectives
>
>
>
> Hi Rolf,
>
>
>
> This may be related to change set 32659.
>
>
>
> If you back this change out, do the tests pass?
>
>
>
>
>
> Howard
>
>
>
>
>
>
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org
> ] *On Behalf Of *Rolf vandeVaart
> *Sent:* Monday, September 15, 2014 8:55 AM
> *To:* de...@open-mpi.org
> *Subject:* [OMPI devel] coll ml error with some nonblocking collectives
>
>
>
> I wonder if anyone else is seeing this failure. Not sure when this started
> but it is only on the trunk. Here is a link to my failures as well as an
> example below that. There are a variety of nonblocking collectives failing
> like this.
>
>
>
> http://mtt.open-mpi.org/index.php?do_redir=2208
>
>
>
> [rvandevaart@drossetti-ivy0 collective]$ mpirun --mca btl self,sm,tcp
> -host drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 iallreduce
> --
>
> ML detected an unrecoverable error on intrinsic communicator MPI_COMM_WORLD
>
> The program will now abort
> --
> [drossetti-ivy0.nvidia.com:04664] 3 more processes have sent help message
> help-mpi-coll-ml.txt / coll-ml-check-fatal-error
> [rvandevaart@drossetti-ivy0 collective]$
>   --
>
> This email message is for the sole use of the intended recipient(s) and
> may contain confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message.
>   --
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15834.php
>