Hi,
what ofed version do you use?
(ofed_info -s)
On Sun, Aug 17, 2014 at 7:16 PM, Rio Yokota wrote:
> I have recently upgraded from Ubuntu 12.04 to 14.04 and OpenMPI gives the
> following warning upon execution, which did not appear before the upgrade.
>
> WARNING: It appears that your OpenFabr
Hi,
I just did compile without Cuda, and the result is the same. No output,
exits with code 65.
[mboisson@helios-login1 examples]$ ldd ring_c
linux-vdso.so.1 => (0x7fff3ab31000)
libmpi.so.1 =>
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libmpi.so.1
(0x7fab9ec
Maxime,
Can you run with:
mpirun -np 4 --mca plm_base_verbose 10 /path/to/examples//ring_c
On Mon, Aug 18, 2014 at 12:21 PM, Maxime Boissonneault <
maxime.boissonnea...@calculquebec.ca> wrote:
> Hi,
> I just did compile without Cuda, and the result is the same. No output,
> exits with code
Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :
mpirun -np 4 --mca plm_base_verbose 10
[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose
10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm
components
[helios-login1:27853] mca: base: compone
This is all on one node, yes?
Try adding the following:
-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5
Lot of garbage, but should tell us what is going on.
On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault
wrote:
> Here it is
> Le 2014-08-18 12:30, Joshua Ladd
This is all one one node indeed.
Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca
state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee
output_ringc_verbose.txt
Maxime
Le 2014-08-18 12:48, Ralph Castain a écrit :
This is all on one nod
Ah...now that showed the problem. To pinpoint it better, please add
-mca oob_base_verbose 10
and I think we'll have it
On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault
wrote:
> This is all one one node indeed.
>
> Attached is the output of
> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_
Here it is.
Maxime
Le 2014-08-18 12:59, Ralph Castain a écrit :
Ah...now that showed the problem. To pinpoint it better, please add
-mca oob_base_verbose 10
and I think we'll have it
On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault
wrote:
This is all one one node indeed.
Attached is th
Yep, that pinpointed the problem:
[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer
[[63019,0],0] on socket 11
[helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect:
connection failed: Co
Indeed, that makes sense now.
Why isn't OpenMPI attempting to connect with the local loop for same
node ? This used to work with 1.6.5.
Maxime
Le 2014-08-18 13:11, Ralph Castain a écrit :
Yep, that pinpointed the problem:
[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[heli
Yeah, there are some issues with the internal connection logic that need to get
fixed. We haven't had many cases where it's been an issue, but a couple like
this have cropped up - enough that I need to set aside some time to fix it.
My apologies for the problem.
On Aug 18, 2014, at 10:31 AM, M
Ok, I confirm that with
mpiexec -mca oob_tcp_if_include lo ring_c
it works.
It also works with
mpiexec -mca oob_tcp_if_include ib0 ring_c
We have 4 interfaces on this node.
- lo, the local loop
- ib0, infiniband
- eth2, a management network
- eth3, the public network
It seems that mpiexec atte
Indeed odd - I'm afraid that this is just the kind of case that has been
causing problems. I think I've figured out the problem, but have been buried
with my "day job" for the last few weeks and unable to pursue it.
On Aug 18, 2014, at 11:10 AM, Maxime Boissonneault
wrote:
> Ok, I confirm th
I get "ofed_info: command not found". Note that I don't install the entire
OFED, but do a component wise installation by doing "apt-get install
infiniband-diags ibutils ibverbs-utils libmlx4-dev" for the drivers and
utilities.
> Hi,
> what ofed version do you use?
> (ofed_info -s)
>
>
> On Su
most likely you installing old ofed which does not have this parameter:
try:
#modinfo mlx4_core
and see if it is there.
I would suggest install latest OFED or Mellanox OFED.
On Mon, Aug 18, 2014 at 9:53 PM, Rio Yokota wrote:
> I get "ofed_info: command not found". Note that I don't install t
Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and simple one.
I reduced the code to the minimal that would reproduce the bug. I have
pasted it here :
http://pa
Just to help reduce the scope of the problem, can you retest with a non
CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the
configure line to help with the stack trace?
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonn
Try the following:
export MALLOC_CHECK_=1
and then run it again
Kind regards,
Alex Granovsky
-Original Message-
From: Maxime Boissonneault
Sent: Tuesday, August 19, 2014 12:23 AM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes
Hi,
Since my previ
Same thing :
[mboisson@gpu-k20-07 simple_cuda_mpi]$ export MALLOC_CHECK_=1
[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node
cudampi_simple
malloc: using debugging hooks
malloc: using debugging hooks
[gpu-k20-07:47628] *** Process received signal ***
[gpu-k20-07:47628] Si
It's building... to be continued tomorrow morning.
Le 2014-08-18 16:45, Rolf vandeVaart a écrit :
Just to help reduce the scope of the problem, can you retest with a non
CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the
configure line to help with the stack trace?
20 matches
Mail list logo