from:"Maxime Boissonneault"

I am also filing a bug at Adaptive Computing since, while I do set 
CUDA_VISIBLE_DEVICES myself, the default value set by Torque in that 
case is also wrong.


Maxime

Le 2014-08-19 10:47, Rolf vandeVaart a écrit :

Glad it was solved.  I will submit a bug at NVIDIA as that does not seem like a 
very friendly way to handle that error.


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Tuesday, August 19, 2014 10:39 AM
To: Open MPI Users
Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
I believe I found what the problem was. My script set the
CUDA_VISIBLE_DEVICES based on the content of $PBS_GPUFILE. Since the
GPUs were listed twice in the $PBS_GPUFILE because of the two nodes, I had
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7

instead of
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Sorry for the false bug and thanks for directing me toward the solution.

Maxime


Le 2014-08-19 09:15, Rolf vandeVaart a écrit :

Hi:
This problem does not appear to have anything to do with MPI. We are

getting a SEGV during the initial call into the CUDA driver.  Can you log on to
gpu-k20-08, compile your simple program without MPI, and run it there?  Also,
maybe run dmesg on gpu-k20-08 and see if there is anything in the log?

Also, does your program run if you just run it on gpu-k20-07?

Can you include the output from nvidia-smi on each node?

Thanks,
Rolf


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Tuesday, August 19, 2014 8:55 AM
To: Open MPI Users
Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not
give me much more information.
[mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug
Prefix:
/software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda
Internal debug support: yes
Memory debugging support: no


Is there something I need to do at run time to get more information
out of it ?


[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by
ppr:1:node cudampi_simple [gpu-k20-08:46045] *** Process received
signal *** [gpu-k20-08:46045] Signal: Segmentation fault (11)
[gpu-k20-08:46045] Signal code: Address not mapped (1)
[gpu-k20-08:46045] Failing at address: 0x8 [gpu-k20-08:46045] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710]
[gpu-k20-08:46045] [ 1]
/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf]
[gpu-k20-08:46045] [ 2]
/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83]
[gpu-k20-08:46045] [ 3]
/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da]
[gpu-k20-08:46045] [ 4]
/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933]
[gpu-k20-08:46045] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df
4965]
[gpu-k20-08:46045] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df
4a0a]
[gpu-k20-08:46045] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df
4a3b]
[gpu-k20-08:46045] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0
f647] [gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae]
[gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received
signal *** [gpu-k20-07:61816] Signal: Segmentation fault (11)
[gpu-k20-07:61816] Signal code: Address not mapped (1)
[gpu-k20-07:61816] Failing at address: 0x8 [gpu-k20-07:61816] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710]
[gpu-k20-07:61816] [ 1]
/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf]
[gpu-k20-07:61816] [ 2]
/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83]
[gpu-k20-07:61816] [ 3]
/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da]
[gpu-k20-07:61816] [ 4]
/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933]
[gpu-k20-07:61816] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b
6965]
[gpu-k20-07:61816] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b
6a0a]
[gpu-k20-07:61816] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b
6a3b]
[gpu-k20-07:61816] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d
1647
]
[gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae] [gpu-k20-07:61816]
[10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d]
[gpu-k20-07:61816] [11] cudampi_simple[0x400699] [gpu-k20-07:61816]
*** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d]
[gpu-k20-08:46045] [11] cudampi_simple[0x400699] [gpu-k20-08:46045]
*** End of error message ***
-
- mpiexec noticed that process rank 1 with PID 46045 on node
gpu-k20-08 exited on signal 11 (Segmentation fault).
-
-


Thanks,

Maxime


Le 2014-08-18 16:45, Rolf vandeVaart a écrit :

Just to help reduce the scope

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes


Hi,
I believe I found what the problem was. My script set the 
CUDA_VISIBLE_DEVICES based on the content of $PBS_GPUFILE. Since the 
GPUs were listed twice in the $PBS_GPUFILE because of the two nodes, I had

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7

instead of
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Sorry for the false bug and thanks for directing me toward the solution.

Maxime


Le 2014-08-19 09:15, Rolf vandeVaart a écrit :

Hi:
This problem does not appear to have anything to do with MPI. We are getting a 
SEGV during the initial call into the CUDA driver.  Can you log on to 
gpu-k20-08, compile your simple program without MPI, and run it there?  Also, 
maybe run dmesg on gpu-k20-08 and see if there is anything in the log?

Also, does your program run if you just run it on gpu-k20-07?

Can you include the output from nvidia-smi on each node?

Thanks,
Rolf


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Tuesday, August 19, 2014 8:55 AM
To: Open MPI Users
Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not give me
much more information.
[mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug
   Prefix:
/software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda
   Internal debug support: yes
Memory debugging support: no


Is there something I need to do at run time to get more information out
of it ?


[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by
ppr:1:node
cudampi_simple
[gpu-k20-08:46045] *** Process received signal ***
[gpu-k20-08:46045] Signal: Segmentation fault (11)
[gpu-k20-08:46045] Signal code: Address not mapped (1)
[gpu-k20-08:46045] Failing at address: 0x8
[gpu-k20-08:46045] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710]
[gpu-k20-08:46045] [ 1]
/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf]
[gpu-k20-08:46045] [ 2]
/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83]
[gpu-k20-08:46045] [ 3]
/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da]
[gpu-k20-08:46045] [ 4]
/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933]
[gpu-k20-08:46045] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df4965]
[gpu-k20-08:46045] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df4a0a]
[gpu-k20-08:46045] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df4a3b]
[gpu-k20-08:46045] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0f647]
[gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae]
[gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received signal ***
[gpu-k20-07:61816] Signal: Segmentation fault (11)
[gpu-k20-07:61816] Signal code: Address not mapped (1)
[gpu-k20-07:61816] Failing at address: 0x8
[gpu-k20-07:61816] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710]
[gpu-k20-07:61816] [ 1]
/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf]
[gpu-k20-07:61816] [ 2]
/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83]
[gpu-k20-07:61816] [ 3]
/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da]
[gpu-k20-07:61816] [ 4]
/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933]
[gpu-k20-07:61816] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b6965]
[gpu-k20-07:61816] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b6a0a]
[gpu-k20-07:61816] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b6a3b]
[gpu-k20-07:61816] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d1647
]
[gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae]
[gpu-k20-07:61816] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d]
[gpu-k20-07:61816] [11] cudampi_simple[0x400699]
[gpu-k20-07:61816] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d]
[gpu-k20-08:46045] [11] cudampi_simple[0x400699]
[gpu-k20-08:46045] *** End of error message ***
--
mpiexec noticed that process rank 1 with PID 46045 on node gpu-k20-08
exited on signal 11 (Segmentation fault).
--


Thanks,

Maxime


Le 2014-08-18 16:45, Rolf vandeVaart a écrit :

Just to help reduce the scope of the problem, can you retest with a non

CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the
configure line to help with the stack trace?



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Monday, August 18, 2014 4:23 PM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes


Hi,
I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not give 
me much more information.

[mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug
  Prefix: 
/software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda

  Internal debug support: yes
Memory debugging support: no


Is there something I need to do at run time to get more information out 
of it ?



[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node 
cudampi_simple

[gpu-k20-08:46045] *** Process received signal ***
[gpu-k20-08:46045] Signal: Segmentation fault (11)
[gpu-k20-08:46045] Signal code: Address not mapped (1)
[gpu-k20-08:46045] Failing at address: 0x8
[gpu-k20-08:46045] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710]
[gpu-k20-08:46045] [ 1] 
/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf]
[gpu-k20-08:46045] [ 2] 
/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83]
[gpu-k20-08:46045] [ 3] 
/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da]
[gpu-k20-08:46045] [ 4] 
/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933]
[gpu-k20-08:46045] [ 5] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df4965]
[gpu-k20-08:46045] [ 6] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df4a0a]
[gpu-k20-08:46045] [ 7] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df4a3b]
[gpu-k20-08:46045] [ 8] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0f647]

[gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae]
[gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received signal ***
[gpu-k20-07:61816] Signal: Segmentation fault (11)
[gpu-k20-07:61816] Signal code: Address not mapped (1)
[gpu-k20-07:61816] Failing at address: 0x8
[gpu-k20-07:61816] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710]
[gpu-k20-07:61816] [ 1] 
/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf]
[gpu-k20-07:61816] [ 2] 
/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83]
[gpu-k20-07:61816] [ 3] 
/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da]
[gpu-k20-07:61816] [ 4] 
/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933]
[gpu-k20-07:61816] [ 5] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b6965]
[gpu-k20-07:61816] [ 6] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b6a0a]
[gpu-k20-07:61816] [ 7] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b6a3b]
[gpu-k20-07:61816] [ 8] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d1647]

[gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae]
[gpu-k20-07:61816] [10] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d]

[gpu-k20-07:61816] [11] cudampi_simple[0x400699]
[gpu-k20-07:61816] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d]
[gpu-k20-08:46045] [11] cudampi_simple[0x400699]
[gpu-k20-08:46045] *** End of error message ***
--
mpiexec noticed that process rank 1 with PID 46045 on node gpu-k20-08 
exited on signal 11 (Segmentation fault).

--


Thanks,

Maxime


Le 2014-08-18 16:45, Rolf vandeVaart a écrit :

Just to help reduce the scope of the problem, can you retest with a non 
CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the 
configure line to help with the stack trace?



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Monday, August 18, 2014 4:23 PM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and simple one.

I reduced the code to the minimal that would reproduce the bug. I have
pasted it here :
http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory, and then
free memory and finalize MPI. Nothing else.

When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the following stack
trace :
[gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal:
Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped
(1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Indeed, there were those to problems. I took the code from here and 
simplified it.

http://cudamusing.blogspot.ca/2011/08/cuda-mpi-and-infiniband.html

However, even with the modified code here http://pastebin.com/ax6g10GZ

The symptoms are still the same.

Maxime


Le 2014-08-19 07:59, Alex A. Granovsky a écrit :
Also you need to check return code from cudaMalloc before calling 
cudaFree -

the pointer may be invalid as you did not initialized cuda properly.

Alex

-Original Message- From: Maxime Boissonneault
Sent: Tuesday, August 19, 2014 2:19 AM
To: Open MPI Users
Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Same thing :

[mboisson@gpu-k20-07 simple_cuda_mpi]$ export MALLOC_CHECK_=1
[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node
cudampi_simple
malloc: using debugging hooks
malloc: using debugging hooks
[gpu-k20-07:47628] *** Process received signal ***
[gpu-k20-07:47628] Signal: Segmentation fault (11)
[gpu-k20-07:47628] Signal code: Address not mapped (1)
[gpu-k20-07:47628] Failing at address: 0x8
[gpu-k20-07:47628] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b14cf850710]
[gpu-k20-07:47628] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2b14d4e9facf]
[gpu-k20-07:47628] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2b14d4e65a83]
[gpu-k20-07:47628] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2b14d4d972da]
[gpu-k20-07:47628] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2b14d4d83933]
[gpu-k20-07:47628] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b14cf0cf965] 


[gpu-k20-07:47628] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b14cf0cfa0a] 


[gpu-k20-07:47628] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b14cf0cfa3b] 


[gpu-k20-07:47628] [ 8]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2b14cf0f0532] 


[gpu-k20-07:47628] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:47628] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b14cfa7cd1d]
[gpu-k20-07:47628] [11] cudampi_simple[0x400699]
[gpu-k20-07:47628] *** End of error message ***
... (same segfault from the other node)

Maxime


Le 2014-08-18 16:52, Alex A. Granovsky a écrit :

Try the following:

export MALLOC_CHECK_=1

and then run it again

Kind regards,
Alex Granovsky



-Original Message- From: Maxime Boissonneault
Sent: Tuesday, August 19, 2014 12:23 AM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and simple one.

I reduced the code to the minimal that would reproduce the bug. I have
pasted it here :
http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory,
and then free memory and finalize MPI. Nothing else.

When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the following
stack trace :
[gpu-k20-07:40041] *** Process received signal ***
[gpu-k20-07:40041] Signal: Segmentation fault (11)
[gpu-k20-07:40041] Signal code: Address not mapped (1)
[gpu-k20-07:40041] Failing at address: 0x8
[gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] 


[gpu-k20-07:40041] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] 


[gpu-k20-07:40041] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] 


[gpu-k20-07:40041] [ 8]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] 


[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:40041] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
[gpu-k20-07:40041] [11] cudampi_simple[0x400699]
[gpu-k20-07:40041] *** End of error message ***


The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware)
or OpenMPI 1.8.1 (cuda aware).

I know this is more than likely a problem with Cuda than with OpenMPI
(since it does the same for two different versions), but I figured I
would ask here if somebody has a clue of what might be going on. I have
yet to be able to fill a bug report on NVidia's website for Cuda.


Thanks,








--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes


It's building... to be continued tomorrow morning.

Le 2014-08-18 16:45, Rolf vandeVaart a écrit :

Just to help reduce the scope of the problem, can you retest with a non 
CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the 
configure line to help with the stack trace?



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Monday, August 18, 2014 4:23 PM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and simple one.

I reduced the code to the minimal that would reproduce the bug. I have
pasted it here :
http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory, and then
free memory and finalize MPI. Nothing else.

When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the following stack
trace :
[gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal:
Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped
(1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
[gpu-k20-07:40041] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
[gpu-k20-07:40041] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
[gpu-k20-07:40041] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]
[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
[gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] ***
End of error message ***


The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or
OpenMPI 1.8.1 (cuda aware).

I know this is more than likely a problem with Cuda than with OpenMPI (since
it does the same for two different versions), but I figured I would ask here if
somebody has a clue of what might be going on. I have yet to be able to fill a
bug report on NVidia's website for Cuda.


Thanks,


--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2014/08/25064.php

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25065.php



--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes


Same thing :

[mboisson@gpu-k20-07 simple_cuda_mpi]$ export MALLOC_CHECK_=1
[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node 
cudampi_simple

malloc: using debugging hooks
malloc: using debugging hooks
[gpu-k20-07:47628] *** Process received signal ***
[gpu-k20-07:47628] Signal: Segmentation fault (11)
[gpu-k20-07:47628] Signal code: Address not mapped (1)
[gpu-k20-07:47628] Failing at address: 0x8
[gpu-k20-07:47628] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b14cf850710]
[gpu-k20-07:47628] [ 1] 
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2b14d4e9facf]
[gpu-k20-07:47628] [ 2] 
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2b14d4e65a83]
[gpu-k20-07:47628] [ 3] 
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2b14d4d972da]
[gpu-k20-07:47628] [ 4] 
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2b14d4d83933]
[gpu-k20-07:47628] [ 5] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b14cf0cf965]
[gpu-k20-07:47628] [ 6] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b14cf0cfa0a]
[gpu-k20-07:47628] [ 7] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b14cf0cfa3b]
[gpu-k20-07:47628] [ 8] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2b14cf0f0532]

[gpu-k20-07:47628] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:47628] [10] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b14cfa7cd1d]

[gpu-k20-07:47628] [11] cudampi_simple[0x400699]
[gpu-k20-07:47628] *** End of error message ***
... (same segfault from the other node)

Maxime


Le 2014-08-18 16:52, Alex A. Granovsky a écrit :

Try the following:

export MALLOC_CHECK_=1

and then run it again

Kind regards,
Alex Granovsky



-Original Message- From: Maxime Boissonneault
Sent: Tuesday, August 19, 2014 12:23 AM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and simple one.

I reduced the code to the minimal that would reproduce the bug. I have
pasted it here :
http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory,
and then free memory and finalize MPI. Nothing else.

When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the following
stack trace :
[gpu-k20-07:40041] *** Process received signal ***
[gpu-k20-07:40041] Signal: Segmentation fault (11)
[gpu-k20-07:40041] Signal code: Address not mapped (1)
[gpu-k20-07:40041] Failing at address: 0x8
[gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] 


[gpu-k20-07:40041] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] 


[gpu-k20-07:40041] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] 


[gpu-k20-07:40041] [ 8]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532] 


[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:40041] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
[gpu-k20-07:40041] [11] cudampi_simple[0x400699]
[gpu-k20-07:40041] *** End of error message ***


The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware)
or OpenMPI 1.8.1 (cuda aware).

I know this is more than likely a problem with Cuda than with OpenMPI
(since it does the same for two different versions), but I figured I
would ask here if somebody has a clue of what might be going on. I have
yet to be able to fill a bug report on NVidia's website for Cuda.


Thanks,





--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

[OMPI users] Segfault with MPI + Cuda on multiple nodes


Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda 
derailed into two problems, one of which has been addressed, I figured I 
would start a new, more precise and simple one.


I reduced the code to the minimal that would reproduce the bug. I have 
pasted it here :

http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory, 
and then free memory and finalize MPI. Nothing else.


When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the following 
stack trace :

[gpu-k20-07:40041] *** Process received signal ***
[gpu-k20-07:40041] Signal: Segmentation fault (11)
[gpu-k20-07:40041] Signal code: Address not mapped (1)
[gpu-k20-07:40041] Failing at address: 0x8
[gpu-k20-07:40041] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1] 
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2] 
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3] 
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4] 
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
[gpu-k20-07:40041] [ 6] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
[gpu-k20-07:40041] [ 7] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
[gpu-k20-07:40041] [ 8] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]

[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:40041] [10] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]

[gpu-k20-07:40041] [11] cudampi_simple[0x400699]
[gpu-k20-07:40041] *** End of error message ***


The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) 
or OpenMPI 1.8.1 (cuda aware).


I know this is more than likely a problem with Cuda than with OpenMPI 
(since it does the same for two different versions), but I figured I 
would ask here if somebody has a clue of what might be going on. I have 
yet to be able to fill a bug report on NVidia's website for Cuda.



Thanks,


--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


Ok, I confirm that with
mpiexec -mca oob_tcp_if_include lo ring_c

it works.

It also works with
mpiexec -mca oob_tcp_if_include ib0 ring_c

We have 4 interfaces on this node.
- lo, the local loop
- ib0, infiniband
- eth2, a management network
- eth3, the public network

It seems that mpiexec attempts to use the two addresses that do not work 
(eth2, eth3) and does not use the two that do work (ib0 and lo). 
However, according to the logs sent previously, it does see ib0 (despite 
not seeing lo), but does not attempt to use it.



On the compute nodes, we have eth0 (management), ib0 and lo, and it 
works. I am unsure why it does work on the compute nodes and not on the 
login nodes. The only difference is the presence of a public interface 
on the login node.


Maxime


Le 2014-08-18 13:37, Ralph Castain a écrit :

Yeah, there are some issues with the internal connection logic that need to get 
fixed. We haven't had many cases where it's been an issue, but a couple like 
this have cropped up - enough that I need to set aside some time to fix it.

My apologies for the problem.


On Aug 18, 2014, at 10:31 AM, Maxime Boissonneault 
 wrote:


Indeed, that makes sense now.

Why isn't OpenMPI attempting to connect with the local loop for same node ? 
This used to work with 1.6.5.

Maxime

Le 2014-08-18 13:11, Ralph Castain a écrit :

Yep, that pinpointed the problem:

[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer 
[[63019,0],0] on socket 11
[helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: 
connection failed: Connection refused (111)
[helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 
state CONNECTING
[helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer 
[[63019,0],0]


The apps are trying to connect back to mpirun using the following addresses:

tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237

The initial attempt is here

[helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to 
connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries

I know there is a failover bug in the 1.8 series, and so if that connection got 
rejected the proc would abort. Should we be using a different network? If so, 
telling us via the oob_tcp_if_include param would be the solution.


On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault 
 wrote:


Here it is.

Maxime

Le 2014-08-18 12:59, Ralph Castain a écrit :

Ah...now that showed the problem. To pinpoint it better, please add

-mca oob_base_verbose 10

and I think we'll have it

On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
 wrote:


This is all one one node indeed.

Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
output_ringc_verbose.txt


Maxime

Le 2014-08-18 12:48, Ralph Castain a écrit :

This is all on one node, yes?

Try adding the following:

-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5

Lot of garbage, but should tell us what is going on.

On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
 wrote:


Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :

mpirun -np 4 --mca plm_base_verbose 10

[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm components
[helios-login1:27853] mca: base: components_register: found loaded component 
isolated
[helios-login1:27853] mca: base: components_register: component isolated has no 
register or open function
[helios-login1:27853] mca: base: components_register: found loaded component rsh
[helios-login1:27853] mca: base: components_register: component rsh register 
function successful
[helios-login1:27853] mca: base: components_register: found loaded component tm
[helios-login1:27853] mca: base: components_register: component tm register 
function successful
[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component 
isolated
[helios-login1:27853] mca: base: components_open: component isolated open 
function successful
[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open function 
successful
[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open function 
successful
[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:(  plm) Quer

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


Indeed, that makes sense now.

Why isn't OpenMPI attempting to connect with the local loop for same 
node ? This used to work with 1.6.5.


Maxime

Le 2014-08-18 13:11, Ralph Castain a écrit :

Yep, that pinpointed the problem:

[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer 
[[63019,0],0] on socket 11
[helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: 
connection failed: Connection refused (111)
[helios-login1:28558] [[63019,1],0] tcp_peer_close for [[63019,0],0] sd 11 
state CONNECTING
[helios-login1:28558] [[63019,1],0] tcp:lost connection called for peer 
[[63019,0],0]


The apps are trying to connect back to mpirun using the following addresses:

tcp://10.10.1.3,132.219.137.36,10.12.1.3:34237

The initial attempt is here

[helios-login1:28558] [[63019,1],0] orte_tcp_peer_try_connect: attempting to 
connect to proc [[63019,0],0] on 10.10.1.3:34237 - 0 retries

I know there is a failover bug in the 1.8 series, and so if that connection got 
rejected the proc would abort. Should we be using a different network? If so, 
telling us via the oob_tcp_if_include param would be the solution.


On Aug 18, 2014, at 10:04 AM, Maxime Boissonneault 
 wrote:


Here it is.

Maxime

Le 2014-08-18 12:59, Ralph Castain a écrit :

Ah...now that showed the problem. To pinpoint it better, please add

-mca oob_base_verbose 10

and I think we'll have it

On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
 wrote:


This is all one one node indeed.

Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
output_ringc_verbose.txt


Maxime

Le 2014-08-18 12:48, Ralph Castain a écrit :

This is all on one node, yes?

Try adding the following:

-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5

Lot of garbage, but should tell us what is going on.

On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
 wrote:


Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :

mpirun -np 4 --mca plm_base_verbose 10

[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm components
[helios-login1:27853] mca: base: components_register: found loaded component 
isolated
[helios-login1:27853] mca: base: components_register: component isolated has no 
register or open function
[helios-login1:27853] mca: base: components_register: found loaded component rsh
[helios-login1:27853] mca: base: components_register: component rsh register 
function successful
[helios-login1:27853] mca: base: components_register: found loaded component tm
[helios-login1:27853] mca: base: components_register: component tm register 
function successful
[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component 
isolated
[helios-login1:27853] mca: base: components_open: component isolated open 
function successful
[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open function 
successful
[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open function 
successful
[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
[helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. Query 
failed to return a module
[helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
[helios-login1:27853] mca: base: close: component isolated closed
[helios-login1:27853] mca: base: close: unloading component isolated
[helios-login1:27853] mca: base: close: component tm closed
[helios-login1:27853] mca: base: close: unloading component tm
[helios-login1:27853] mca: base: close: component rsh closed
[helios-login1:27853] mca: base: close: unloading component rsh
[mboisson@helios-login1 examples]$ echo $?
65


Maxime
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25052.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


Here it is.

Maxime

Le 2014-08-18 12:59, Ralph Castain a écrit :

Ah...now that showed the problem. To pinpoint it better, please add

-mca oob_base_verbose 10

and I think we'll have it

On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault 
 wrote:


This is all one one node indeed.

Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
output_ringc_verbose.txt


Maxime

Le 2014-08-18 12:48, Ralph Castain a écrit :

This is all on one node, yes?

Try adding the following:

-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5

Lot of garbage, but should tell us what is going on.

On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
 wrote:


Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :

mpirun -np 4 --mca plm_base_verbose 10

[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm components
[helios-login1:27853] mca: base: components_register: found loaded component 
isolated
[helios-login1:27853] mca: base: components_register: component isolated has no 
register or open function
[helios-login1:27853] mca: base: components_register: found loaded component rsh
[helios-login1:27853] mca: base: components_register: component rsh register 
function successful
[helios-login1:27853] mca: base: components_register: found loaded component tm
[helios-login1:27853] mca: base: components_register: component tm register 
function successful
[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component 
isolated
[helios-login1:27853] mca: base: components_open: component isolated open 
function successful
[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open function 
successful
[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open function 
successful
[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
[helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. Query 
failed to return a module
[helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
[helios-login1:27853] mca: base: close: component isolated closed
[helios-login1:27853] mca: base: close: unloading component isolated
[helios-login1:27853] mca: base: close: component tm closed
[helios-login1:27853] mca: base: close: unloading component tm
[helios-login1:27853] mca: base: close: component rsh closed
[helios-login1:27853] mca: base: close: unloading component rsh
[mboisson@helios-login1 examples]$ echo $?
65


Maxime
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25052.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25053.php


--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25054.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25055.php



--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique



output_ringc_verbose2.txt.gz
Description: GNU Zip compressed data

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


This is all one one node indeed.

Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca 
state_base_verbose 5 -mca errmgr_base_verbose 5  ring_c |& tee 
output_ringc_verbose.txt



Maxime

Le 2014-08-18 12:48, Ralph Castain a écrit :

This is all on one node, yes?

Try adding the following:

-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5

Lot of garbage, but should tell us what is going on.

On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault 
 wrote:


Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :

mpirun -np 4 --mca plm_base_verbose 10

[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm components
[helios-login1:27853] mca: base: components_register: found loaded component 
isolated
[helios-login1:27853] mca: base: components_register: component isolated has no 
register or open function
[helios-login1:27853] mca: base: components_register: found loaded component rsh
[helios-login1:27853] mca: base: components_register: component rsh register 
function successful
[helios-login1:27853] mca: base: components_register: found loaded component tm
[helios-login1:27853] mca: base: components_register: component tm register 
function successful
[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component 
isolated
[helios-login1:27853] mca: base: components_open: component isolated open 
function successful
[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open function 
successful
[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open function 
successful
[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
[helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. Query 
failed to return a module
[helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
[helios-login1:27853] mca: base: close: component isolated closed
[helios-login1:27853] mca: base: close: unloading component isolated
[helios-login1:27853] mca: base: close: component tm closed
[helios-login1:27853] mca: base: close: unloading component tm
[helios-login1:27853] mca: base: close: component rsh closed
[helios-login1:27853] mca: base: close: unloading component rsh
[mboisson@helios-login1 examples]$ echo $?
65


Maxime
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25052.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25053.php



--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique



output_ringc_verbose.txt.gz
Description: GNU Zip compressed data

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :
mpirun -np 4 --mca plm_base_verbose 10 
[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 
10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm 
components
[helios-login1:27853] mca: base: components_register: found loaded 
component isolated
[helios-login1:27853] mca: base: components_register: component isolated 
has no register or open function
[helios-login1:27853] mca: base: components_register: found loaded 
component rsh
[helios-login1:27853] mca: base: components_register: component rsh 
register function successful
[helios-login1:27853] mca: base: components_register: found loaded 
component tm
[helios-login1:27853] mca: base: components_register: component tm 
register function successful

[helios-login1:27853] mca: base: components_open: opening plm components
[helios-login1:27853] mca: base: components_open: found loaded component 
isolated
[helios-login1:27853] mca: base: components_open: component isolated 
open function successful

[helios-login1:27853] mca: base: components_open: found loaded component rsh
[helios-login1:27853] mca: base: components_open: component rsh open 
function successful

[helios-login1:27853] mca: base: components_open: found loaded component tm
[helios-login1:27853] mca: base: components_open: component tm open 
function successful

[helios-login1:27853] mca:base:select: Auto-selecting plm components
[helios-login1:27853] mca:base:select:(  plm) Querying component [isolated]
[helios-login1:27853] mca:base:select:(  plm) Query of component 
[isolated] set priority to 0

[helios-login1:27853] mca:base:select:(  plm) Querying component [rsh]
[helios-login1:27853] mca:base:select:(  plm) Query of component [rsh] 
set priority to 10

[helios-login1:27853] mca:base:select:(  plm) Querying component [tm]
[helios-login1:27853] mca:base:select:(  plm) Skipping component [tm]. 
Query failed to return a module

[helios-login1:27853] mca:base:select:(  plm) Selected component [rsh]
[helios-login1:27853] mca: base: close: component isolated closed
[helios-login1:27853] mca: base: close: unloading component isolated
[helios-login1:27853] mca: base: close: component tm closed
[helios-login1:27853] mca: base: close: unloading component tm
[helios-login1:27853] mca: base: close: component rsh closed
[helios-login1:27853] mca: base: close: unloading component rsh
[mboisson@helios-login1 examples]$ echo $?
65


Maxime

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


Hi,
I just did compile without Cuda, and the result is the same. No output, 
exits with code 65.


[mboisson@helios-login1 examples]$ ldd ring_c
linux-vdso.so.1 =>  (0x7fff3ab31000)
libmpi.so.1 => 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libmpi.so.1 
(0x7fab9ec2a000)

libpthread.so.0 => /lib64/libpthread.so.0 (0x00381c00)
libc.so.6 => /lib64/libc.so.6 (0x00381bc0)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x00381c80)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x00381c40)
libopen-rte.so.7 => 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libopen-rte.so.7 
(0x7fab9e932000)

libtorque.so.2 => /usr/lib64/libtorque.so.2 (0x00391820)
libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x003917e0)
libz.so.1 => /lib64/libz.so.1 (0x00381cc0)
libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x00382100)
libssl.so.10 => /usr/lib64/libssl.so.10 (0x00382300)
libopen-pal.so.6 => 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libopen-pal.so.6 
(0x7fab9e64a000)

libdl.so.2 => /lib64/libdl.so.2 (0x00381b80)
librt.so.1 => /lib64/librt.so.1 (0x0035b360)
libm.so.6 => /lib64/libm.so.6 (0x003c25a0)
libutil.so.1 => /lib64/libutil.so.1 (0x003f7100)
/lib64/ld-linux-x86-64.so.2 (0x00381b40)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x003917a0)
libgcc_s.so.1 => 
/software6/compilers/gcc/4.8/lib64/libgcc_s.so.1 (0x7fab9e433000)
libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 
(0x00382240)

libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00382140)
libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00381e40)
libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00382180)
libkrb5support.so.0 => /lib64/libkrb5support.so.0 
(0x003821c0)

libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00382200)
libresolv.so.2 => /lib64/libresolv.so.2 (0x00381dc0)
libselinux.so.1 => /lib64/libselinux.so.1 (0x00381d00)
[mboisson@helios-login1 examples]$ mpiexec ring_c
[mboisson@helios-login1 examples]$ echo $?
65


Maxime


Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit :

Just out of curiosity, I saw that one of the segv stack traces involved the 
cuda stack.

Can you try a build without CUDA and see if that resolves the problem?



On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault 
 wrote:


Hi Jeff,

Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :

On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault 
 wrote:


Correct.

Can it be because torque (pbs_mom) is not running on the head node and mpiexec 
attempts to contact it ?

Not for Open MPI's mpiexec, no.

Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM 
stuff (i.e., Torque stuff) if it sees the environment variable markers 
indicating that it's inside a Torque job.  If not, it just uses rsh/ssh (or 
localhost launch in your case, since you didn't specify any hosts).

If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI 
"hostname" command from Linux), then something is seriously borked with your Open MPI 
installation.

mpirun -np 4 hostname works fine :
[mboisson@helios-login1 ~]$ which mpirun
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun
[mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $?
helios-login1
helios-login1
helios-login1
helios-login1
0


Try running with:

 mpirun -np 4 --mca plm_base_verbose 10 hostname

This should show the steps OMPI is trying to take to launch the 4 copies of 
"hostname" and potentially give some insight into where it's hanging.

Also, just to make sure, you have ensured that you're compiling everything with 
a single compiler toolchain, and the support libraries from that specific 
compiler toolchain are available on any server on which you're running (to 
include the head node and compute nodes), right?

Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 with the 
same results). Almost every software (that is compiler, toolchain, etc.) is 
installed on lustre, from sources and is the same on both the login (head) node 
and the compute.

The few differences between the head node and the compute :
1) Computes are in RAMFS - login is installed on disk
2) Computes and login node have different hardware configuration (computes have 
GPUs, head node does not).
3) Login node has MORE CentOS6 packages than computes (such as the -devel 
packages, some fonts/X11 libraries, etc.), but all the packages that are on the 
computes are also on the login node.


And you've verified that PAT

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-16 Thread Maxime Boissonneault

There is indeed also a problem with MPI + Cuda.

This problem however is deeper, since it happens with Mvapich2 1.9,
OpenMPI 1.6.5/1.8.1/1.8.2rc4, Cuda 5.5.22/6.0.37. From my tests,
everything works fine with MPI + Cuda on a single node, but as soon as I
got to MPI + Cuda accross nodes, I get segv. I suspect something either
with the ofed (we use linux ofed rdma, not the Mellanox stack) or the
nvidia drivers (we are a couple minor versions behind). My next step is
to try and upgrade those.

I do not think this problem is related to not being able to run ring_c
on the head node however, because it runs fine with 1.6.5 and ring_c
does not involve cuda.

Maxime

Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit :

Just out of curiosity, I saw that one of the segv stack traces involved the
cuda stack.

Can you try a build without CUDA and see if that resolves the problem?

On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault
wrote:

Hi Jeff,

Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :

On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault
wrote:

Correct.

Can it be because torque (pbs_mom) is not running on the head node and mpiexec
attempts to contact it ?

Not for Open MPI's mpiexec, no.

Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM
stuff (i.e., Torque stuff) if it sees the environment variable markers
indicating that it's inside a Torque job. If not, it just uses rsh/ssh (or
localhost launch in your case, since you didn't specify any hosts).

If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI
"hostname" command from Linux), then something is seriously borked with your Open MPI
installation.

mpirun -np 4 hostname works fine :
[mboisson@helios-login1 ~]$ which mpirun
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun
[mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $?
helios-login1
helios-login1
helios-login1
helios-login1
0

Try running with:

mpirun -np 4 --mca plm_base_verbose 10 hostname

This should show the steps OMPI is trying to take to launch the 4 copies of
"hostname" and potentially give some insight into where it's hanging.

Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 with the
same results). Almost every software (that is compiler, toolchain, etc.) is
installed on lustre, from sources and is the same on both the login (head) node
and the compute.

The few differences between the head node and the compute :
1) Computes are in RAMFS - login is installed on disk
2) Computes and login node have different hardware configuration (computes have
GPUs, head node does not).
3) Login node has MORE CentOS6 packages than computes (such as the -devel
packages, some fonts/X11 libraries, etc.), but all the packages that are on the
computes are also on the login node.

As per the content of "env.out" in the archive, yes. They point to the OMPI
1.8.2rc4 installation directories, on lustre, and are shared between the head node and
the compute nodes.

Maxime
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25043.php

--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

Hi Jeff,

Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :

On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault
wrote:

Correct.

Can it be because torque (pbs_mom) is not running on the head node and mpiexec
attempts to contact it ?

Not for Open MPI's mpiexec, no.

If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI
"hostname" command from Linux), then something is seriously borked with your Open MPI
installation.

Try running with:

mpirun -np 4 --mca plm_base_verbose 10 hostname

This should show the steps OMPI is trying to take to launch the 4 copies of
"hostname" and potentially give some insight into where it's hanging.

Also, just to make sure, you have ensured that you're compiling everything with
a single compiler toolchain, and the support libraries from that specific
compiler toolchain are available on any server on which you're running (to
include the head node and compute nodes), right?
Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6
with the same results). Almost every software (that is compiler,
toolchain, etc.) is installed on lustre, from sources and is the same on
both the login (head) node and the compute.

The few differences between the head node and the compute :
1) Computes are in RAMFS - login is installed on disk
2) Computes and login node have different hardware configuration
(computes have GPUs, head node does not).
3) Login node has MORE CentOS6 packages than computes (such as the
-devel packages, some fonts/X11 libraries, etc.), but all the packages
that are on the computes are also on the login node.

And you've verified that PATH and LD_LIBRARY_PATH are pointing to the right places -- i.e., to the
Open MPI installation that you expect it to point to. E.g., if you "ldd ring_c", it
shows the libmpi.so that you expect. And "which mpiexec" shows the mpirun that you
expect. Etc.
As per the content of "env.out" in the archive, yes. They point to the
OMPI 1.8.2rc4 installation directories, on lustre, and are shared
between the head node and the compute nodes.

Maxime

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


Correct.

Can it be because torque (pbs_mom) is not running on the head node and 
mpiexec attempts to contact it ?


Maxime

Le 2014-08-15 17:31, Joshua Ladd a écrit :
But OMPI 1.8.x does run the ring_c program successfully on your 
compute node, right? The error only happens on the front-end login 
node if I understood you correctly.


Josh


On Fri, Aug 15, 2014 at 5:20 PM, Maxime Boissonneault 
<mailto:maxime.boissonnea...@calculquebec.ca>> wrote:


Here are the requested files.

In the archive, you will find the output of configure, make, make
install as well as the config.log, the environment when running
ring_c and the ompi_info --all.

Just for a reminder, the ring_c example compiled and ran, but
produced no output when running and exited with code 65.

Thanks,

Maxime

Le 2014-08-14 15:26, Joshua Ladd a écrit :

One more, Maxime, can you please make sure you've covered
everything here:

http://www.open-mpi.org/community/help/

Josh


On Thu, Aug 14, 2014 at 3:18 PM, Joshua Ladd
mailto:jladd.m...@gmail.com>> wrote:

And maybe include your LD_LIBRARY_PATH

Josh


On Thu, Aug 14, 2014 at 3:16 PM, Joshua Ladd
mailto:jladd.m...@gmail.com>> wrote:

Can you try to run the example code "ring_c" across nodes?

Josh


    On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault
mailto:maxime.boissonnea...@calculquebec.ca>> wrote:

Yes,
Everything has been built with GCC 4.8.x, although x
might have changed between the OpenMPI 1.8.1 build
and the gromacs build. For OpenMPI 1.8.2rc4 however,
it was the exact same compiler for everything.

Maxime

Le 2014-08-14 14:57, Joshua Ladd a écrit :

Hmmm...weird. Seems like maybe a mismatch between
libraries. Did you build OMPI with the same compiler
as you did GROMACS/Charm++?

I'm stealing this suggestion from an old Gromacs
forum with essentially the same symptom:

"Did you compile Open MPI and Gromacs with the same
compiler (i.e. both gcc and the same version)? You
write you tried different OpenMPI versions and
different GCC versions but it is unclear whether
those match. Can you provide more detail how you
compiled (including all options you specified)? Have
you tested any other MPI program linked against
those Open MPI versions? Please make sure (e.g. with
ldd) that the MPI and pthread library you compiled
against is also used for execution. If you compiled
and run on different hosts, check whether the error
still occurs when executing on the build host."

http://redmine.gromacs.org/issues/1025

Josh




    On Thu, Aug 14, 2014 at 2:40 PM, Maxime
Boissonneault mailto:maxime.boissonnea...@calculquebec.ca>> wrote:

I just tried Gromacs with two nodes. It crashes,
but with a different error. I get
[gpu-k20-13:142156] *** Process received signal ***
[gpu-k20-13:142156] Signal: Segmentation fault (11)
[gpu-k20-13:142156] Signal code: Address not
mapped (1)
[gpu-k20-13:142156] Failing at address: 0x8
[gpu-k20-13:142156] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710]
[gpu-k20-13:142156] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf]
[gpu-k20-13:142156] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac5ddf82a83]
[gpu-k20-13:142156] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac5ddeb42da]
[gpu-k20-13:142156] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac5ddea0933]
[gpu-k20-13:142156] [ 5]

/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac5d0930965]
[gpu-k20-13:142156] [ 6]

/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac5d0930a0a]
[gpu-k20-13:142156] [ 7]

/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac5d0930a3b]
[gpu-k20-13:142156] [ 8]

/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaDriverGetVersion+0x4a)[0x2ac5d094602a]
[gpu-k20-13:142156] [ 9]

/software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_print_version

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


Here are the requested files.

In the archive, you will find the output of configure, make, make 
install as well as the config.log, the environment when running ring_c 
and the ompi_info --all.


Just for a reminder, the ring_c example compiled and ran, but produced 
no output when running and exited with code 65.


Thanks,

Maxime

Le 2014-08-14 15:26, Joshua Ladd a écrit :

One more, Maxime, can you please make sure you've covered everything here:

http://www.open-mpi.org/community/help/

Josh


On Thu, Aug 14, 2014 at 3:18 PM, Joshua Ladd <mailto:jladd.m...@gmail.com>> wrote:


And maybe include your LD_LIBRARY_PATH

Josh


On Thu, Aug 14, 2014 at 3:16 PM, Joshua Ladd mailto:jladd.m...@gmail.com>> wrote:

Can you try to run the example code "ring_c" across nodes?

Josh


On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault
mailto:maxime.boissonnea...@calculquebec.ca>> wrote:

Yes,
Everything has been built with GCC 4.8.x, although x might
have changed between the OpenMPI 1.8.1 build and the
gromacs build. For OpenMPI 1.8.2rc4 however, it was the
exact same compiler for everything.

Maxime

Le 2014-08-14 14:57, Joshua Ladd a écrit :

Hmmm...weird. Seems like maybe a mismatch between
libraries. Did you build OMPI with the same compiler as
you did GROMACS/Charm++?

I'm stealing this suggestion from an old Gromacs forum
with essentially the same symptom:

"Did you compile Open MPI and Gromacs with the same
compiler (i.e. both gcc and the same version)? You write
you tried different OpenMPI versions and different GCC
versions but it is unclear whether those match. Can you
provide more detail how you compiled (including all
options you specified)? Have you tested any other MPI
program linked against those Open MPI versions? Please
make sure (e.g. with ldd) that the MPI and pthread
library you compiled against is also used for execution.
If you compiled and run on different hosts, check whether
the error still occurs when executing on the build host."

http://redmine.gromacs.org/issues/1025

Josh




    On Thu, Aug 14, 2014 at 2:40 PM, Maxime Boissonneault
mailto:maxime.boissonnea...@calculquebec.ca>> wrote:

I just tried Gromacs with two nodes. It crashes, but
with a different error. I get
[gpu-k20-13:142156] *** Process received signal ***
[gpu-k20-13:142156] Signal: Segmentation fault (11)
[gpu-k20-13:142156] Signal code: Address not mapped (1)
[gpu-k20-13:142156] Failing at address: 0x8
[gpu-k20-13:142156] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710]
[gpu-k20-13:142156] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf]
[gpu-k20-13:142156] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac5ddf82a83]
[gpu-k20-13:142156] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac5ddeb42da]
[gpu-k20-13:142156] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac5ddea0933]
[gpu-k20-13:142156] [ 5]

/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac5d0930965]
[gpu-k20-13:142156] [ 6]

/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac5d0930a0a]
[gpu-k20-13:142156] [ 7]

/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac5d0930a3b]
[gpu-k20-13:142156] [ 8]

/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaDriverGetVersion+0x4a)[0x2ac5d094602a]
[gpu-k20-13:142156] [ 9]

/software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_print_version_info_gpu+0x55)[0x2ac5cf9a90b5]
[gpu-k20-13:142156] [10]

/software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_log_open+0x17e)[0x2ac5cf54b9be]
[gpu-k20-13:142156] [11] mdrunmpi(cmain+0x1cdb)[0x43b4bb]
[gpu-k20-13:142156] [12]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac5d1534d1d]
[gpu-k20-13:142156] [13] mdrunmpi[0x407be1]
[gpu-k20-13:142156] *** End of error message ***

--
mpiexec noticed that process rank 0 with PID 142156
on node gpu-k20-1

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

Hi,
I solved the warning that appeared with OpenMPI 1.6.5 on the login node.
I increased the registrable memory.

Now, with OpenMPI 1.6.5, it does not give any warning. Yet, with OpenMPI
1.8.1 and OpenMPI 1.8.2rc4, it still exits with error code 65 and does
not produce the normal output.

I will recompile it from scratch and provide all the information
requested on the help webpage.

Cheers,

Maxime

Le 2014-08-15 11:58, Maxime Boissonneault a écrit :

Hi Josh,
The ring_c example does not work on our login node :
[mboisson@helios-login1 examples]$ mpiexec -np 10 ring_c
[mboisson@helios-login1 examples]$ echo $?
65

[mboisson@helios-login1 examples]$ echo $LD_LIBRARY_PATH
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib:/usr/lib64/nvidia:/software-gpu/cuda/6.0.37/lib64:/software-gpu/cuda/6.0.37/lib:/software6/compilers/gcc/4.8/lib64:/software6/compilers/gcc/4.8/lib:/software6/apps/buildtools/20140527/lib64:/software6/apps/buildtools/20140527/lib

It does work on our compute nodes however.

If I compile and run this with OpenMPI 1.6.5, it gives a warning, but
it does work on our login note :

[mboisson@helios-login1 examples]$ mpiexec ring_c
--
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory. This can cause MPI
jobs to

run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered. You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel
module

parameters:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host: helios-login1
Registerable memory: 32768 MiB
Total memory:65457 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--
Process 0 sending 10 to 0, tag 201 (1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting

Could the warning be causing a failure with OpenMPI 1.8.x ?
I suspect it does work on our compute nodes because they are
configured to allow more locked pages. I do not understand however how
a simple ring test should require that much memory.

Maxime

Le 2014-08-14 15:16, Joshua Ladd a écrit :

Can you try to run the example code "ring_c" across nodes?

Josh

On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault
<mailto:maxime.boissonnea...@calculquebec.ca>> wrote:

Yes,
Everything has been built with GCC 4.8.x, although x might have
changed between the OpenMPI 1.8.1 build and the gromacs build.
For OpenMPI 1.8.2rc4 however, it was the exact same compiler for
everything.

Maxime

Le 2014-08-14 14:57, Joshua Ladd a écrit :

Hmmm...weird. Seems like maybe a mismatch between libraries. Did
you build OMPI with the same compiler as you did GROMACS/Charm++?

I'm stealing this suggestion from an old Gromacs forum with
essentially the same symptom:

"Did you compile Open MPI and Gromacs with the same compiler
(i.e. both gcc and the same version)? You write you tried
different OpenMPI versions and different GCC versions but it is
unclear whether those match. Can you provide more detail how you
compiled (including all options you specified)? Have you tested
any other MPI program linked against those Open MPI versions?
Please make sure (e.g. with ldd) that the MPI and pthread
library you compiled against is also used for execution. If you
compiled and run on different hosts, check whether the error
still occurs when executing on the build host."

http://redmine.gromacs.org/issues/1025

Josh

On Thu, Aug 14, 2014 at 2:40 PM, Maxime Boissonneault
mailto:maxime.boissonnea...@calculquebec.ca>> wrote:

I just tried Gromacs with two nodes. It crashes, but with a
different error. I get
[gpu-k20-13:142156] *** Process received signal ***
[gpu-k20-13:142156] Signal: Segmentation fault (11)
[gpu-k20-13:142156] Signal code: Address not mapped (1)
[gpu-k20-13:142156] Failing at address: 0x8
[gpu-k20-13:142156] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710]
[gpu-k20-13:142156] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf]
[gp

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


Hi Josh,
The ring_c example does not work on our login node :
[mboisson@helios-login1 examples]$ mpiexec -np 10 ring_c
[mboisson@helios-login1 examples]$ echo $?
65

[mboisson@helios-login1 examples]$ echo $LD_LIBRARY_PATH
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib:/usr/lib64/nvidia:/software-gpu/cuda/6.0.37/lib64:/software-gpu/cuda/6.0.37/lib:/software6/compilers/gcc/4.8/lib64:/software6/compilers/gcc/4.8/lib:/software6/apps/buildtools/20140527/lib64:/software6/apps/buildtools/20140527/lib


It does work on our compute nodes however.



If I compile and run this with OpenMPI 1.6.5, it gives a warning, but it 
does work on our login note :

[mboisson@helios-login1 examples]$ mpiexec ring_c
--
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:  helios-login1
  Registerable memory: 32768 MiB
  Total memory:65457 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--
Process 0 sending 10 to 0, tag 201 (1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting



Could the warning be causing a failure with OpenMPI 1.8.x ?
I suspect it does work on our compute nodes because they are configured 
to allow more locked pages. I do not understand however how a simple 
ring test should require that much memory.



Maxime




Le 2014-08-14 15:16, Joshua Ladd a écrit :

Can you try to run the example code "ring_c" across nodes?

Josh


On Thu, Aug 14, 2014 at 3:14 PM, Maxime Boissonneault 
<mailto:maxime.boissonnea...@calculquebec.ca>> wrote:


Yes,
Everything has been built with GCC 4.8.x, although x might have
changed between the OpenMPI 1.8.1 build and the gromacs build. For
OpenMPI 1.8.2rc4 however, it was the exact same compiler for
everything.

Maxime

Le 2014-08-14 14:57, Joshua Ladd a écrit :

Hmmm...weird. Seems like maybe a mismatch between libraries. Did
you build OMPI with the same compiler as you did GROMACS/Charm++?

I'm stealing this suggestion from an old Gromacs forum with
essentially the same symptom:

"Did you compile Open MPI and Gromacs with the same compiler
(i.e. both gcc and the same version)? You write you tried
different OpenMPI versions and different GCC versions but it is
unclear whether those match. Can you provide more detail how you
compiled (including all options you specified)? Have you tested
any other MPI program linked against those Open MPI versions?
Please make sure (e.g. with ldd) that the MPI and pthread library
you compiled against is also used for execution. If you compiled
and run on different hosts, check whether the error still occurs
when executing on the build host."

http://redmine.gromacs.org/issues/1025

Josh




    On Thu, Aug 14, 2014 at 2:40 PM, Maxime Boissonneault
mailto:maxime.boissonnea...@calculquebec.ca>> wrote:

I just tried Gromacs with two nodes. It crashes, but with a
different error. I get
[gpu-k20-13:142156] *** Process received signal ***
[gpu-k20-13:142156] Signal: Segmentation fault (11)
[gpu-k20-13:142156] Signal code: Address not mapped (1)
[gpu-k20-13:142156] Failing at address: 0x8
[gpu-k20-13:142156] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710]
[gpu-k20-13:142156] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf]
[gpu-k20-13:142156] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac5ddf82a83]
[gpu-k20-13:142156] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac5ddeb42da]
[gpu-k20-13:142156] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac5ddea0933]
[gpu-k20-13:142156] [ 5]

/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac5d0930965]
[gpu-k20-13:142156] [ 6]

/software-gpu/cud

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


Yes,
Everything has been built with GCC 4.8.x, although x might have changed 
between the OpenMPI 1.8.1 build and the gromacs build. For OpenMPI 
1.8.2rc4 however, it was the exact same compiler for everything.


Maxime

Le 2014-08-14 14:57, Joshua Ladd a écrit :
Hmmm...weird. Seems like maybe a mismatch between libraries. Did you 
build OMPI with the same compiler as you did GROMACS/Charm++?


I'm stealing this suggestion from an old Gromacs forum with 
essentially the same symptom:


"Did you compile Open MPI and Gromacs with the same compiler (i.e. 
both gcc and the same version)? You write you tried different OpenMPI 
versions and different GCC versions but it is unclear whether those 
match. Can you provide more detail how you compiled (including all 
options you specified)? Have you tested any other MPI program linked 
against those Open MPI versions? Please make sure (e.g. with ldd) that 
the MPI and pthread library you compiled against is also used for 
execution. If you compiled and run on different hosts, check whether 
the error still occurs when executing on the build host."


http://redmine.gromacs.org/issues/1025

Josh




On Thu, Aug 14, 2014 at 2:40 PM, Maxime Boissonneault 
<mailto:maxime.boissonnea...@calculquebec.ca>> wrote:


I just tried Gromacs with two nodes. It crashes, but with a
different error. I get
[gpu-k20-13:142156] *** Process received signal ***
[gpu-k20-13:142156] Signal: Segmentation fault (11)
[gpu-k20-13:142156] Signal code: Address not mapped (1)
[gpu-k20-13:142156] Failing at address: 0x8
[gpu-k20-13:142156] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710]
[gpu-k20-13:142156] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf]
[gpu-k20-13:142156] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac5ddf82a83]
[gpu-k20-13:142156] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac5ddeb42da]
[gpu-k20-13:142156] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac5ddea0933]
[gpu-k20-13:142156] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac5d0930965]
[gpu-k20-13:142156] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac5d0930a0a]
[gpu-k20-13:142156] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac5d0930a3b]
[gpu-k20-13:142156] [ 8]

/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaDriverGetVersion+0x4a)[0x2ac5d094602a]
[gpu-k20-13:142156] [ 9]

/software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_print_version_info_gpu+0x55)[0x2ac5cf9a90b5]
[gpu-k20-13:142156] [10]

/software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_log_open+0x17e)[0x2ac5cf54b9be]
[gpu-k20-13:142156] [11] mdrunmpi(cmain+0x1cdb)[0x43b4bb]
[gpu-k20-13:142156] [12]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac5d1534d1d]
[gpu-k20-13:142156] [13] mdrunmpi[0x407be1]
[gpu-k20-13:142156] *** End of error message ***
--
mpiexec noticed that process rank 0 with PID 142156 on node
gpu-k20-13 exited on signal 11 (Segmentation fault).
--



We do not have MPI_THREAD_MULTIPLE enabled in our build, so
Charm++ cannot be using this level of threading. The configure
line for OpenMPI was
./configure --prefix=$PREFIX \
  --with-threads --with-verbs=yes --enable-shared
--enable-static \
--with-io-romio-flags="--with-file-system=nfs+lustre" \
   --without-loadleveler --without-slurm --with-tm \
   --with-cuda=$(dirname $(dirname $(which nvcc)))

Maxime


Le 2014-08-14 14:20, Joshua Ladd a écrit :

What about between nodes? Since this is coming from the OpenIB
BTL, would be good to check this.

Do you know what the MPI thread level is set to when used with
the Charm++ runtime? Is it MPI_THREAD_MULTIPLE? The OpenIB BTL is
not thread safe.

Josh


    On Thu, Aug 14, 2014 at 2:17 PM, Maxime Boissonneault
mailto:maxime.boissonnea...@calculquebec.ca>> wrote:

Hi,
I ran gromacs successfully with OpenMPI 1.8.1 and Cuda 6.0.37
on a single node, with 8 ranks and multiple OpenMP threads.

Maxime


Le 2014-08-14 14:15, Joshua Ladd a écrit :

Hi, Maxime

Just curious, are you able to run a vanilla MPI program? Can
you try one one of the example programs in the "examples"
subdirectory. Looks like a threading issue to me.

Thanks,

Josh



___ users
mailing list us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this 
post:http://

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

I just tried Gromacs with two nodes. It crashes, but with a different 
error. I get

[gpu-k20-13:142156] *** Process received signal ***
[gpu-k20-13:142156] Signal: Segmentation fault (11)
[gpu-k20-13:142156] Signal code: Address not mapped (1)
[gpu-k20-13:142156] Failing at address: 0x8
[gpu-k20-13:142156] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2ac5d070c710]
[gpu-k20-13:142156] [ 1] 
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac5ddfbcacf]
[gpu-k20-13:142156] [ 2] 
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac5ddf82a83]
[gpu-k20-13:142156] [ 3] 
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac5ddeb42da]
[gpu-k20-13:142156] [ 4] 
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac5ddea0933]
[gpu-k20-13:142156] [ 5] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac5d0930965]
[gpu-k20-13:142156] [ 6] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac5d0930a0a]
[gpu-k20-13:142156] [ 7] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac5d0930a3b]
[gpu-k20-13:142156] [ 8] 
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaDriverGetVersion+0x4a)[0x2ac5d094602a]
[gpu-k20-13:142156] [ 9] 
/software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_print_version_info_gpu+0x55)[0x2ac5cf9a90b5]
[gpu-k20-13:142156] [10] 
/software-gpu/apps/gromacs/4.6.5_gcc/lib/libgmxmpi.so.8(gmx_log_open+0x17e)[0x2ac5cf54b9be]

[gpu-k20-13:142156] [11] mdrunmpi(cmain+0x1cdb)[0x43b4bb]
[gpu-k20-13:142156] [12] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac5d1534d1d]

[gpu-k20-13:142156] [13] mdrunmpi[0x407be1]
[gpu-k20-13:142156] *** End of error message ***
--
mpiexec noticed that process rank 0 with PID 142156 on node gpu-k20-13 
exited on signal 11 (Segmentation fault).

--



We do not have MPI_THREAD_MULTIPLE enabled in our build, so Charm++ 
cannot be using this level of threading. The configure line for OpenMPI was

./configure --prefix=$PREFIX \
  --with-threads --with-verbs=yes --enable-shared --enable-static \
  --with-io-romio-flags="--with-file-system=nfs+lustre" \
   --without-loadleveler --without-slurm --with-tm \
   --with-cuda=$(dirname $(dirname $(which nvcc)))

Maxime


Le 2014-08-14 14:20, Joshua Ladd a écrit :
What about between nodes? Since this is coming from the OpenIB BTL, 
would be good to check this.


Do you know what the MPI thread level is set to when used with the 
Charm++ runtime? Is it MPI_THREAD_MULTIPLE? The OpenIB BTL is not 
thread safe.


Josh


On Thu, Aug 14, 2014 at 2:17 PM, Maxime Boissonneault 
<mailto:maxime.boissonnea...@calculquebec.ca>> wrote:


Hi,
I ran gromacs successfully with OpenMPI 1.8.1 and Cuda 6.0.37 on a
single node, with 8 ranks and multiple OpenMP threads.

Maxime


Le 2014-08-14 14:15, Joshua Ladd a écrit :

Hi, Maxime

Just curious, are you able to run a vanilla MPI program? Can you
try one one of the example programs in the "examples"
subdirectory. Looks like a threading issue to me.

Thanks,

Josh



___ users mailing
list us...@open-mpi.org <mailto:us...@open-mpi.org> Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this 
post:http://www.open-mpi.org/community/lists/users/2014/08/25023.php




___
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25024.php




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25025.php



--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


Hi,
I ran gromacs successfully with OpenMPI 1.8.1 and Cuda 6.0.37 on a 
single node, with 8 ranks and multiple OpenMP threads.


Maxime


Le 2014-08-14 14:15, Joshua Ladd a écrit :

Hi, Maxime

Just curious, are you able to run a vanilla MPI program? Can you try 
one one of the example programs in the "examples" subdirectory. Looks 
like a threading issue to me.


Thanks,

Josh



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25023.php

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


Hi,
I just did with 1.8.2rc4 and it does the same :

[mboisson@helios-login1 simplearrayhello]$ ./hello
[helios-login1:11739] *** Process received signal ***
[helios-login1:11739] Signal: Segmentation fault (11)
[helios-login1:11739] Signal code: Address not mapped (1)
[helios-login1:11739] Failing at address: 0x30
[helios-login1:11739] [ 0] /lib64/libpthread.so.0[0x381c00f710]
[helios-login1:11739] [ 1] 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xfa238)[0x7f7166a04238]
[helios-login1:11739] [ 2] 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xfbad4)[0x7f7166a05ad4]
[helios-login1:11739] [ 3] 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xcf)[0x7f71669ffddf]
[helios-login1:11739] [ 4] 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xe4773)[0x7f71669ee773]
[helios-login1:11739] [ 5] 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_btl_base_select+0x168)[0x7f71669e46a8]
[helios-login1:11739] [ 6] 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_r2_component_init+0x11)[0x7f71669e3fd1]
[helios-login1:11739] [ 7] 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_base_init+0x7f)[0x7f71669e275f]
[helios-login1:11739] [ 8] 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0x1e602f)[0x7f7166af002f]
[helios-login1:11739] [ 9] 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_pml_base_select+0x3b6)[0x7f7166aedc26]
[helios-login1:11739] [10] 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_mpi_init+0x4e3)[0x7f7166988863]
[helios-login1:11739] [11] 
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/lib/libmpi.so.1(MPI_Init_thread+0x15d)[0x7f71669a86fd]

[helios-login1:11739] [12] ./hello(LrtsInit+0x72)[0x4fcf02]
[helios-login1:11739] [13] ./hello(ConverseInit+0x70)[0x4ff680]
[helios-login1:11739] [14] ./hello(main+0x27)[0x470767]
[helios-login1:11739] [15] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x381bc1ed1d]

[helios-login1:11739] [16] ./hello[0x470b71]
[helios-login1:11739] *** End of error message



Maxime

Le 2014-08-14 10:04, Jeff Squyres (jsquyres) a écrit :

Can you try the latest 1.8.2 rc tarball?  (just released yesterday)

 http://www.open-mpi.org/software/ompi/v1.8/



On Aug 14, 2014, at 8:39 AM, Maxime Boissonneault 
 wrote:


Hi,
I compiled Charm++ 6.6.0rc3 using
./build charm++ mpi-linux-x86_64 smp --with-production

When compiling the simple example
mpi-linux-x86_64-smp/tests/charm++/simplearrayhello/

I get a segmentation fault that traces back to OpenMPI :
[mboisson@helios-login1 simplearrayhello]$ ./hello
[helios-login1:01813] *** Process received signal ***
[helios-login1:01813] Signal: Segmentation fault (11)
[helios-login1:01813] Signal code: Address not mapped (1)
[helios-login1:01813] Failing at address: 0x30
[helios-login1:01813] [ 0] /lib64/libpthread.so.0[0x381c00f710]
[helios-login1:01813] [ 1] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xf78f8)[0x7f2cd1f6b8f8]
[helios-login1:01813] [ 2] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xf8f64)[0x7f2cd1f6cf64]
[helios-login1:01813] [ 3] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xcf)[0x7f2cd1f672af]
[helios-login1:01813] [ 4] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xe1ad7)[0x7f2cd1f55ad7]
[helios-login1:01813] [ 5] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_btl_base_select+0x168)[0x7f2cd1f4bf28]
[helios-login1:01813] [ 6] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_r2_component_init+0x11)[0x7f2cd1f4b851]
[helios-login1:01813] [ 7] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_base_init+0x7f)[0x7f2cd1f4a03f]
[helios-login1:01813] [ 8] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0x1e0d17)[0x7f2cd2054d17]
[helios-login1:01813] [ 9] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_pml_base_select+0x3b6)[0x7f2cd20529d6]
[helios-login1:01813] [10] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_mpi_init+0x4e4)[0x7f2cd1ef0c14]
[helios-login1:01813] [11] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(MPI_Init_thread+0x15d)[0x7f2cd1f1065d]
[helios-login1:01813] [12] ./hello(LrtsInit+0x72)[0x4fcf02]
[helios-login1:01813] [13] ./hello(ConverseInit+0x70)[0x4ff680]
[helios-login1:01813] [14] ./hello(main+0x27)[0x470767]
[helios-login1:01813] [15] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x381bc1ed1d]
[helios-login1:01813] [16] ./hello[0x470b71]


Anyone has a clue how to fix this ?

Thanks,

--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Running a hybrid MPI+openMP program


Hi,
You DEFINITELY need to disable OpenMPI's new default binding. Otherwise, 
your N threads will run on a single core. --bind-to socket would be my 
recommendation for hybrid jobs.


Maxime


Le 2014-08-14 10:04, Jeff Squyres (jsquyres) a écrit :

I don't know much about OpenMP, but do you need to disable Open MPI's default 
bind-to-core functionality (I'm assuming you're using Open MPI 1.8.x)?

You can try "mpirun --bind-to none ...", which will have Open MPI not bind MPI 
processes to cores, which might allow OpenMP to think that it can use all the cores, and 
therefore it will spawn num_cores threads...?


On Aug 14, 2014, at 9:50 AM, Oscar Mojica  wrote:


Hello everybody

I am trying to run a hybrid mpi + openmp program in a cluster.  I created a 
queue with 14 machines, each one with 16 cores. The program divides the work 
among the 14 processors with MPI and within each processor a loop is also 
divided into 8 threads for example, using openmp. The problem is that when I 
submit the job to the queue the MPI processes don't divide the work into 
threads and the program prints the number of threads  that are working within 
each process as one.

I made a simple test program that uses openmp and  I logged in one machine of 
the fourteen. I compiled it using gfortran -fopenmp program.f -o exe,  set the 
OMP_NUM_THREADS environment variable equal to 8  and when I ran directly in the 
terminal the loop was effectively divided among the cores and for example in 
this case the program printed the number of threads equal to 8

This is my Makefile
  
# Start of the makefile

# Defining variables
objects = inv_grav3d.o funcpdf.o gr3dprm.o fdjac.o dsvd.o
#f90comp = /opt/openmpi/bin/mpif90
f90comp = /usr/bin/mpif90
#switch = -O3
executable = inverse.exe
# Makefile
all : $(executable)
$(executable) : $(objects)  
$(f90comp) -fopenmp -g -O -o $(executable) $(objects)
rm $(objects)
%.o: %.f
$(f90comp) -c $<
# Cleaning everything
clean:
rm $(executable)
#   rm $(objects)
# End of the makefile

and the script that i am using is

#!/bin/bash
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -pe orte 14
#$ -N job
#$ -q new.q

export OMP_NUM_THREADS=8
/usr/bin/time -f "%E" /opt/openmpi/bin/mpirun -v -np $NSLOTS ./inverse.exe

am I forgetting something?

Thanks,

Oscar Fabian Mojica Ladino
Geologist M.S. in  Geophysics
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25016.php





--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1


Note that if I do the same build with OpenMPI 1.6.5, it works flawlessly.

Maxime


Le 2014-08-14 08:39, Maxime Boissonneault a écrit :

Hi,
I compiled Charm++ 6.6.0rc3 using
./build charm++ mpi-linux-x86_64 smp --with-production

When compiling the simple example
mpi-linux-x86_64-smp/tests/charm++/simplearrayhello/

I get a segmentation fault that traces back to OpenMPI :
[mboisson@helios-login1 simplearrayhello]$ ./hello
[helios-login1:01813] *** Process received signal ***
[helios-login1:01813] Signal: Segmentation fault (11)
[helios-login1:01813] Signal code: Address not mapped (1)
[helios-login1:01813] Failing at address: 0x30
[helios-login1:01813] [ 0] /lib64/libpthread.so.0[0x381c00f710]
[helios-login1:01813] [ 1] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xf78f8)[0x7f2cd1f6b8f8]
[helios-login1:01813] [ 2] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xf8f64)[0x7f2cd1f6cf64]
[helios-login1:01813] [ 3] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xcf)[0x7f2cd1f672af]
[helios-login1:01813] [ 4] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xe1ad7)[0x7f2cd1f55ad7]
[helios-login1:01813] [ 5] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_btl_base_select+0x168)[0x7f2cd1f4bf28]
[helios-login1:01813] [ 6] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_r2_component_init+0x11)[0x7f2cd1f4b851]
[helios-login1:01813] [ 7] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_base_init+0x7f)[0x7f2cd1f4a03f]
[helios-login1:01813] [ 8] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0x1e0d17)[0x7f2cd2054d17]
[helios-login1:01813] [ 9] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_pml_base_select+0x3b6)[0x7f2cd20529d6]
[helios-login1:01813] [10] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_mpi_init+0x4e4)[0x7f2cd1ef0c14]
[helios-login1:01813] [11] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(MPI_Init_thread+0x15d)[0x7f2cd1f1065d]

[helios-login1:01813] [12] ./hello(LrtsInit+0x72)[0x4fcf02]
[helios-login1:01813] [13] ./hello(ConverseInit+0x70)[0x4ff680]
[helios-login1:01813] [14] ./hello(main+0x27)[0x470767]
[helios-login1:01813] [15] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x381bc1ed1d]

[helios-login1:01813] [16] ./hello[0x470b71]


Anyone has a clue how to fix this ?

Thanks,




--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

[OMPI users] Segmentation fault in OpenMPI 1.8.1


Hi,
I compiled Charm++ 6.6.0rc3 using
./build charm++ mpi-linux-x86_64 smp --with-production

When compiling the simple example
mpi-linux-x86_64-smp/tests/charm++/simplearrayhello/

I get a segmentation fault that traces back to OpenMPI :
[mboisson@helios-login1 simplearrayhello]$ ./hello
[helios-login1:01813] *** Process received signal ***
[helios-login1:01813] Signal: Segmentation fault (11)
[helios-login1:01813] Signal code: Address not mapped (1)
[helios-login1:01813] Failing at address: 0x30
[helios-login1:01813] [ 0] /lib64/libpthread.so.0[0x381c00f710]
[helios-login1:01813] [ 1] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xf78f8)[0x7f2cd1f6b8f8]
[helios-login1:01813] [ 2] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xf8f64)[0x7f2cd1f6cf64]
[helios-login1:01813] [ 3] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_btl_openib_connect_base_select_for_local_port+0xcf)[0x7f2cd1f672af]
[helios-login1:01813] [ 4] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0xe1ad7)[0x7f2cd1f55ad7]
[helios-login1:01813] [ 5] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_btl_base_select+0x168)[0x7f2cd1f4bf28]
[helios-login1:01813] [ 6] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_r2_component_init+0x11)[0x7f2cd1f4b851]
[helios-login1:01813] [ 7] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_bml_base_init+0x7f)[0x7f2cd1f4a03f]
[helios-login1:01813] [ 8] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(+0x1e0d17)[0x7f2cd2054d17]
[helios-login1:01813] [ 9] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(mca_pml_base_select+0x3b6)[0x7f2cd20529d6]
[helios-login1:01813] [10] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(ompi_mpi_init+0x4e4)[0x7f2cd1ef0c14]
[helios-login1:01813] [11] 
/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/libmpi.so.1(MPI_Init_thread+0x15d)[0x7f2cd1f1065d]

[helios-login1:01813] [12] ./hello(LrtsInit+0x72)[0x4fcf02]
[helios-login1:01813] [13] ./hello(ConverseInit+0x70)[0x4ff680]
[helios-login1:01813] [14] ./hello(main+0x27)[0x470767]
[helios-login1:01813] [15] 
/lib64/libc.so.6(__libc_start_main+0xfd)[0x381bc1ed1d]

[helios-login1:01813] [16] ./hello[0x470b71]


Anyone has a clue how to fix this ?

Thanks,

--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

[OMPI users] Filem could not be found for one user

2014-08-11 Thread Maxime Boissonneault


Hi,
I am getting a weird error when running mpiexec with one user :

[mboisson@gpu-k20-14 helios_test]$ mpiexec -np 2 mdrunmpi -ntomp 10 -s 
prod_s6_01kcal_bb_dr -deffnm testout

--
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  gpu-k20-14
Framework: filem
Component: rsh
--
[gpu-k20-14:205673] mca: base: components_register: registering filem 
components
[gpu-k20-14:205673] [[56298,0],0] ORTE_ERROR_LOG: Not found in file 
ess_hnp_module.c at line 673

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_filem_base_open failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--


What is weird is that this same command works for other users, on the 
same node.


Anyone know what might be going on here ?

Thanks,

--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] How to keep multiple installations at same time

2014-08-05 Thread Maxime Boissonneault




The Environment Modules package user base is not negligible,
including many universities, research centers, national labs,
ans private companies, in the US and around the world.
How does the user base of LMod compare?


The user base certainly is much larger for Environment Modules than LMod.
But, as a user of both Lmod and Environment Modules, I can tell you the 
following :

Regardless of any virtues that LMod may have,
currently I don't see any reason to switch to LMod,
install everything over again
Nothing needs reinstalling. Lmod understands Tcl modules and can work 
fine with your old module tree.

, troubleshoot it,
learn Lua, migrate my modules from Tcl,
Again, migration to Lua is not required. Tcl modules gets converted on 
the fly.

educate my users and convince them to use a new
package to achieve the same exact thing that they currently have,

Very little education has to be done. The commands are the same :
module avail
module load/add
module unload/remove
module use
...

and in the end gain little if any
relevant/useful/new functionality.
If you do not want to make any changes, in the way you organize modules, 
then don't. You will also get no benefit from changing to Lmod in that 
situation.


If you do want to use new features, then there are plenty. Most notably is
- the possibility to organize modules in hierarchy   (which you do not 
HAVE to do, but in my opinion, is much more intuitive).
- the possibility to cache the module structure (and avoid reading it 
from a parallel filesystem every time a user type a module command).
- the possibility to color-code modules so that users can find what they 
want easier out of hundreds of modules


IF you do use hierarchy, you get the added benefit of avoiding user 
mistakes such as


"
module load gcc openmpi_gcc
module unload gcc
module load intel

... why is my MPI not working!
"

IF you do use hierarchy, you get the added benefit of not having silly 
module names such as

fftw/3.3_gcc4.8_openmpi1.6.3
fftw/3.3_gcc4.6_openmpi1.8.1
...

Again, you do NOT have to, but the benefits much outweight the changes 
that need to be made to get them.


My 2 cents,

Maxime Boissonneault



My two cents of opinion
Gus Correa


On 08/05/2014 12:54 PM, Ralph Castain wrote:

Check the repo - hasn't been touched in a very long time

On Aug 5, 2014, at 9:42 AM, Fabricio Cannini  wrote:


On 05-08-2014 13:10, Ralph Castain wrote:
Since modules isn't a supported s/w package any more, you might 
consider using LMOD instead:


https://www.tacc.utexas.edu/tacc-projects/lmod


Modules isn't supported anymore? :O

Could you please send a link about it ?
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/24918.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/24919.php




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/24924.php



--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] poor performance using the openib btl

2014-06-25 Thread Maxime Boissonneault


Hi,
I recovered the name of the option that caused problems for us. It is 
--enable-mpi-thread-multiple


This option enables threading within OPAL, which was bugged (at least in 
1.6.x series). I don't know if it has been fixed in 1.8 series.


I do not see your configure line in the attached file, to see if it was 
enabled or not.


Maxime

Le 2014-06-25 10:46, Fischer, Greg A. a écrit :


Attached are the results of "grep thread" on my configure output. 
There appears to be some amount of threading, but is there anything I 
should look for in particular?


I see Mike Dubman's questions on the mailing list website, but his 
message didn't appear to make it to my inbox. The answers to his 
questions are:


[binford:fischega] $ rpm -qa | grep ofed

ofed-doc-1.5.4.1-0.11.5

ofed-kmp-default-1.5.4.1_3.0.76_0.11-0.11.5

ofed-1.5.4.1-0.11.5

Distro: SLES11 SP3

HCA:

[binf102:fischega] $ /usr/sbin/ibstat

CA 'mlx4_0'

CA type: MT26428

Command line (path and LD_LIBRARY_PATH are set correctly):

mpirun -x LD_LIBRARY_PATH -mca btl openib,sm,self -mca 
btl_openib_verbose 1 -np 31 $CTF_EXEC


*From:*users [mailto:users-boun...@open-mpi.org] *On Behalf Of *Maxime 
Boissonneault

*Sent:* Tuesday, June 24, 2014 6:41 PM
*To:* Open MPI Users
*Subject:* Re: [OMPI users] poor performance using the openib btl

What are your threading options for OpenMPI (when it was built) ?

I have seen OpenIB BTL completely lock when some level of threading is 
enabled before.


Maxime Boissonneault


Le 2014-06-24 18:18, Fischer, Greg A. a écrit :

Hello openmpi-users,

A few weeks ago, I posted to the list about difficulties I was
having getting openib to work with Torque (see "openib segfaults
with Torque", June 6, 2014). The issues were related to Torque
imposing restrictive limits on locked memory, and have since been
resolved.

However, now that I've had some time to test the applications, I'm
seeing abysmal performance over the openib layer. Applications run
with the tcp btl execute about 10x faster than with the openib
btl. Clearly something still isn't quite right.

I tried running with "-mca btl_openib_verbose 1", but didn't see
anything resembling a smoking gun. How should I go about
determining the source of the problem? (This uses the same OpenMPI
Version 1.8.1 / SLES11 SP3 / GCC 4.8.3 setup discussed previously.)

Thanks,

Greg




___

users mailing list

us...@open-mpi.org  <mailto:us...@open-mpi.org>

Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users

Link to this 
post:http://www.open-mpi.org/community/lists/users/2014/06/24697.php




--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/06/24700.php



--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] poor performance using the openib btl

2014-06-24 Thread Maxime Boissonneault


What are your threading options for OpenMPI (when it was built) ?

I have seen OpenIB BTL completely lock when some level of threading is 
enabled before.


Maxime Boissonneault


Le 2014-06-24 18:18, Fischer, Greg A. a écrit :


Hello openmpi-users,

A few weeks ago, I posted to the list about difficulties I was having 
getting openib to work with Torque (see "openib segfaults with 
Torque", June 6, 2014). The issues were related to Torque imposing 
restrictive limits on locked memory, and have since been resolved.


However, now that I've had some time to test the applications, I'm 
seeing abysmal performance over the openib layer. Applications run 
with the tcp btl execute about 10x faster than with the openib btl. 
Clearly something still isn't quite right.


I tried running with "-mca btl_openib_verbose 1", but didn't see 
anything resembling a smoking gun. How should I go about determining 
the source of the problem? (This uses the same OpenMPI Version 1.8.1 / 
SLES11 SP3 / GCC 4.8.3 setup discussed previously.)


Thanks,

Greg



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/06/24697.php



--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-23 Thread Maxime Boissonneault


Hi,
I've been following this thread because it may be relevant to our setup.

Is there a drawback of having orte_hetero_nodes=1 as default MCA 
parameter ? Is there a reason why the most generic case is not assumed ?


Maxime Boissonneault

Le 2014-06-20 13:48, Ralph Castain a écrit :

Put "orte_hetero_nodes=1" in your default MCA param file - uses can override by 
setting that param to 0


On Jun 20, 2014, at 10:30 AM, Brock Palen  wrote:


Perfection!  That appears to do it for our standard case.

Now I know how to set MCA options by env var or config file.  How can I make 
this the default, that then a user can override?

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



On Jun 20, 2014, at 1:21 PM, Ralph Castain  wrote:


I think I begin to grok at least part of the problem. If you are assigning 
different cpus on each node, then you'll need to tell us that by setting 
--hetero-nodes otherwise we won't have any way to report that back to mpirun 
for its binding calculation.

Otherwise, we expect that the cpuset of the first node we launch a daemon onto 
(or where mpirun is executing, if we are only launching local to mpirun) 
accurately represents the cpuset on every node in the allocation.

We still might well have a bug in our binding computation - but the above will 
definitely impact what you said the user did.

On Jun 20, 2014, at 10:06 AM, Brock Palen  wrote:


Extra data point if I do:

[brockp@nyx5508 34241]$ mpirun --report-bindings --bind-to core hostname
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

  Bind to: CORE
  Node:nyx5513
  #processes:  2
  #cpus:  1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--

[brockp@nyx5508 34241]$ mpirun -H nyx5513 uptime
13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
13:01:37 up 31 days, 23:06,  0 users,  load average: 10.13, 10.90, 12.38
[brockp@nyx5508 34241]$ mpirun -H nyx5513 --bind-to core hwloc-bind --get
0x0010
0x1000
[brockp@nyx5508 34241]$ cat $PBS_NODEFILE | grep nyx5513
nyx5513
nyx5513

Interesting, if I force bind to core, MPI barfs saying there is only 1 cpu 
available, PBS says it gave it two, and if I force (this is all inside an 
interactive job) just on that node hwloc-bind --get I get what I expect,

Is there a way to get a map of what MPI thinks it has on each host?

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
XSEDE Campus Champion
bro...@umich.edu
(734)936-1985



On Jun 20, 2014, at 12:38 PM, Brock Palen  wrote:


I was able to produce it in my test.

orted affinity set by cpuset:
[root@nyx5874 ~]# hwloc-bind --get --pid 103645
0xc002

This mask (1, 14,15) which is across sockets, matches the cpu set setup by the 
batch system.
[root@nyx5874 ~]# cat /dev/cpuset/torque/12719806.nyx.engin.umich.edu/cpus
1,14-15

The ranks though were then all set to the same core:

[root@nyx5874 ~]# hwloc-bind --get --pid 103871
0x8000
[root@nyx5874 ~]# hwloc-bind --get --pid 103872
0x8000
[root@nyx5874 ~]# hwloc-bind --get --pid 103873
0x8000

Which is core 15:

report-bindings gave me:
You can see how a few nodes were bound to all the same core, the last one in 
each case.  I only gave you the results for the hose nyx5874.

[nyx5526.engin.umich.edu:23726] MCW rank 0 is not bound (or bound to all 
available processors)
[nyx5878.engin.umich.edu:103925] MCW rank 8 is not bound (or bound to all 
available processors)
[nyx5533.engin.umich.edu:123988] MCW rank 1 is not bound (or bound to all 
available processors)
[nyx5879.engin.umich.edu:102808] MCW rank 9 is not bound (or bound to all 
available processors)
[nyx5874.engin.umich.edu:103645] MCW rank 41 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5874.engin.umich.edu:103645] MCW rank 42 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5874.engin.umich.edu:103645] MCW rank 43 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5888.engin.umich.edu:117400] MCW rank 11 is not bound (or bound to all 
available processors)
[nyx5786.engin.umich.edu:30004] MCW rank 19 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5786.engin.umich.edu:30004] MCW rank 18 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5594.engin.umich.edu:33884] MCW rank 24 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5594.engin.umich.edu:33884] MCW rank 25 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5594.engin.umich.edu:33884] MCW rank 26 bound to socket 1[core 15[hwt 0]]: 
[./././././././.][./././././././B]
[nyx5798

Re: [OMPI users] Advices for parameter tuning for CUDA-aware MPI

2014-05-27 Thread Maxime Boissonneault


Answers inline too.

2) Is the absence of btl_openib_have_driver_gdr an indicator of something
missing ?

Yes, that means that somehow the GPU Direct RDMA is not installed correctly. 
All that check does is make sure that the file 
/sys/kernel/mm/memory_peers/nv_mem/version exists.  Does that exist?


It does not. There is no
/sys/kernel/mm/memory_peers/


3) Are the default parameters, especially the rdma limits and such, optimal for
our configuration ?

That is hard to say.  GPU Direct RDMA does not work well when the GPU and IB card are not 
"close" on the system. Can you run "nvidia-smi topo -m" on your system?

nvidia-smi topo -m
gives me the error
[mboisson@login-gpu01 ~]$ nvidia-smi topo -m
Invalid combination of input arguments. Please run 'nvidia-smi -h' for help.

I could not find anything related to topology in the help. However, I 
can tell you the following which I believe to be true

- GPU0 and GPU1 are on PCIe bus 0, socket 0
- GPU2 and GPU3 are on PCIe bus 1, socket 0
- GPU4 and GPU5 are on PCIe bus 2, socket 1
- GPU6 and GPU7 are on PCIe bus 3, socket 1

There is one IB card which I believe is on socket 0.


I know that we do not have the Mellanox Ofed. We use the Linux RDMA from 
CentOS 6.5. However, should that completely disable GDR within a single 
node ? i.e. does GDR _have_ to go through IB ? I would assume that our 
lack of Mellanox OFED would result in no-GDR inter-node, but GDR intra-node.



Thanks


--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

[OMPI users] Advices for parameter tuning for CUDA-aware MPI

2014-05-23 Thread Maxime Boissonneault

async_send" 
(current value: "true", data source: default, level: 9 dev/all, type: bool)
 MCA btl: parameter "btl_openib_cuda_async_recv" 
(current value: "true", data source: default, level: 9 dev/all, type: bool)
 MCA btl: informational "btl_openib_have_cuda_gdr" 
(current value: "true", data source: default, level: 5 tuner/detail, 
type: bool)
 MCA btl: parameter "btl_openib_want_cuda_gdr" (current 
value: "false", data source: default, level: 9 dev/all, type: bool)
 MCA btl: parameter "btl_openib_cuda_eager_limit" 
(current value: "0", data source: default, level: 5 tuner/detail, type: 
size_t)
 MCA btl: parameter "btl_openib_cuda_rdma_limit" 
(current value: "18446744073709551615", data source: default, level: 5 
tuner/detail, type: size_t)
 MCA btl: parameter "btl_vader_cuda_eager_limit" 
(current value: "0", data source: default, level: 5 tuner/detail, type: 
size_t)
 MCA btl: parameter "btl_vader_cuda_rdma_limit" 
(current value: "18446744073709551615", data source: default, level: 5 
tuner/detail, type: size_t)
MCA coll: parameter "coll_ml_config_file" (current 
value: 
"/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/share/openmpi/mca-coll-ml.config", 
data source: default, level: 9 dev/all, type: string)
  MCA io: informational 
"io_romio_complete_configure_params" (current value: 
"--with-file-system=nfs+lustre  FROM_OMPI=yes 
CC='/software6/compilers/gcc/4.8/bin/gcc -std=gnu99' CFLAGS='-O3 
-DNDEBUG -finline-functions -fno-strict-aliasing -pthread' CPPFLAGS=' 
-I/software-gpu/src/openmpi-1.8.1/opal/mca/hwloc/hwloc172/hwloc/include 
-I/software-gpu/src/openmpi-1.8.1/opal/mca/event/libevent2021/libevent 
-I/software-gpu/src/openmpi-1.8.1/opal/mca/event/libevent2021/libevent/include' 
FFLAGS='' LDFLAGS=' ' --enable-shared --enable-static 
--with-file-system=nfs+lustre 
--prefix=/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37 
--disable-aio", data source: default, level: 9 dev/all, type: string)

[login-gpu01.calculquebec.ca:11486] mca: base: close: unloading component Q


--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] openmpi configuration error?

2014-05-16 Thread Maxime Boissonneault

Instead of using the outdated and not maintained Module environment, why 
not use Lmod : https://www.tacc.utexas.edu/tacc-projects/lmod


It is a drop-in replacement for Module environment that supports all of 
their features and much, much more, such as :

- module hierarchies
- module properties and color highlighting (we use it to higlight 
bioinformatic modules or tools for example)
- module caching (very useful for a parallel filesystem with tons of 
modules)
- path priorities (useful to make sure personal modules take precendence 
over system modules)

- export module tree to json

It works like a charm, understand both TCL and Lua modules and is 
actively developped and debugged. There are litteraly new features every 
month or so. If it does not do what you want, odds are that the 
developper will add it shortly (I've had it happen).


Maxime

Le 2014-05-16 17:58, Douglas L Reeder a écrit :

Ben,

You might want to use module (source forge) to manage paths to 
different mpi implementations. It is fairly easy to set up and very 
robust for this type of problem. You would remove contentious 
application paths from you standard PATH and then use module to switch 
them in and out as needed.


Doug Reeder
On May 16, 2014, at 3:39 PM, Ben Lash <mailto:b...@rice.edu>> wrote:


My cluster has just upgraded to a new version of MPI, and I'm using 
an old one. It seems that I'm having trouble compiling due to the 
compiler wrapper file moving (full error here: 
http://pastebin.com/EmwRvCd9)
"Cannot open configuration file 
/opt/apps/openmpi/1.4.4-intel/share/openmpi/mpif90-wrapper-data.txt"


I've found the file on the cluster at 
 /opt/apps/openmpi/retired/1.4.4-intel/share/openmpi/mpif90-wrapper-data.txt

How do I tell the old mpi wrapper where this file is?
I've already corrected one link to mpich -> 
/opt/apps/openmpi/retired/1.4.4-intel/, which is in the software I'm 
trying to recompile's lib folder 
(/home/bl10/CMAQv5.0.1/lib/x86_64/ifort). Thanks for any ideas. I 
also tried changing $pkgdatadir based on what I read here:
http://www.open-mpi.org/faq/?category=mpi-apps#default-wrapper-compiler-flags 



Thanks.

--Ben L
___
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Question about scheduler support

2014-05-16 Thread Maxime Boissonneault


Le 2014-05-16 09:06, Jeff Squyres (jsquyres) a écrit :

On May 15, 2014, at 8:00 PM, Fabricio Cannini  wrote:


Nobody is disagreeing that one could find a way to make CMake work - all we are 
saying is that (a) CMake has issues too, just like autotools, and (b) we have 
yet to see a compelling reason to undertake the transition...which would have 
to be a *very* compelling one.

I was simply agreeing with Maxime about why it could work. ;)

But if you and the other devels are fine with it, i'm fine too.

FWIW, simply for my own curiosity's sake, if someone could confirm deny whether 
cmake:

1. Supports the following compiler suites: GNU (that's a given, I assume), 
Clang, OS X native (which is variants of GNU and Clang), Absoft, PGI, Intel, 
Cray, HP-UX, Oracle Solaris (Linux and Solaris), Tru64, Microsoft Visual, IBM 
BlueGene (I think that's gcc, but am not entirely sure).  (some of these matter 
mainly to hwloc, not necessarily OMPI)
I have built projects with CMake using GNU, Intel, PGI, OS X native. 
CMake claims to make MSV projects, so I'm assuming MS Visual works. I 
can't say about the others.

2. Bootstrap a tarball such that an end user does not need to have cmake 
installed.

That, I have no clue, but they do have a page about bootstrapping cmake 
itself

http://www.cmake.org/cmake/help/install.html
I am not sure if this is what you mean.

If there is no existing CMake installation, a bootstrap script is provided:

   ./bootstrap
make
make install

(Note: the make install step is optional, cmake will run from the build 
directory.)


According to this, you could have a tarball including CMake and instruct 
the users to run some variant of (or make your own bootstrap script 
including this)

 ./bootstrap && make && ./cmake . && make && make install

Now that I think about it, OpenFOAM uses CMake and bootstraps it if it 
is not install, so it is certainly possible.



Maxime

Re: [OMPI users] Question about scheduler support


Le 2014-05-15 18:27, Jeff Squyres (jsquyres) a écrit :

On May 15, 2014, at 6:14 PM, Fabricio Cannini  wrote:


Alright, but now I'm curious as to why you decided against it.
Could please elaborate on it a bit ?

OMPI has a long, deep history with the GNU Autotools.  It's a very long, 
complicated story, but the high points are:

1. The GNU Autotools community has given us very good support over the years.
2. The GNU Autotools support all compilers that we want to support, including 
shared library support (others did not, back in 2004 when we started OMPI).
3. The GNU Autotools can fully bootstrap a tarball such that the end user does 
not need to have the GNU Autotools installed to build an OMPI tarball.

You mean some people do NOT have GNU Autotools ? :P

Jokes aside, CMake has certainly matured enough for point #2 and is used 
by very big projects (KDE for example). Not sure about point #3 though. 
I am wondering though, how do you handle Windows with OpenMPI and GNU 
Autotools ? I know CMake is famous for being cross-plateform (that's 
what the C means) and can generate builds for Windows, Visual Studio and 
such.


In any case, I do not see any need to change from one toolchain to 
another, although I have seen many projects providing both and that did 
not seem to be too much of a hassle. It's still probably more work than 
what you want to get into though.


Maxime

Re: [OMPI users] Question about scheduler support





Please allow me to chip in my $0.02 and suggest to not reinvent the 
wheel, but instead consider to migrate the build system to cmake :


http://www.cmake.org/
I agree that menu-wise, CMake does a pretty good job with ccmake, and is 
much, much easier to create than autoconf/automake/m4 stuff (I speak 
from experience).


However, for the command-line arguments, I find cmake non-intuitive and 
pretty cumbersome. As example, to say

--with-tm=/usr/local/torque

with CMAKE, you would have to do something like
-DWITH_TM:STRING=/usr/local/torque


Maxime

Re: [OMPI users] Question about scheduler support

A file would do the trick, but from my experience of building programs, 
I always prefer configure options. Maybe just an option

--disable-optional

that disables anything that is optional and non-explicitely requested.

Maxime

Le 2014-05-15 08:22, Bennet Fauber a écrit :

Would a separate file that contains each scheduler option and is
included by configure do the trick?  It might read

include-slurm=YES
include-torque=YES
etc.

If all options are set to default to YES, then the people who want no
options are satisfied, but those of us who would like to change the
config would have an easy and scriptable way to change the option
using sed or whatever.

I agree with Maxime about requiring an interactive system to turn
things off.  It makes things difficult to script and document exactly
what was done.  I think providing the kitchen sink is fine for
default, but a simple switch or config file that flips it to including
nothing that wasn't requested might satisfy the other side.

I suspect that something similar would (or could) be part of a menu
configuration scheme, so the menu could be tacked on later, if it
turns out to be desired, and the menu would just modify the list of
things to build, so any work toward that scheme might not be lost.

-- bennet



On Thu, May 15, 2014 at 7:41 AM, Maxime Boissonneault
 wrote:

Le 2014-05-15 06:29, Jeff Squyres (jsquyres) a écrit :

I think Ralph's email summed it up pretty well -- we unfortunately have
(at least) two distinct groups of people who install OMPI:

a) those who know exactly what they want and don't want anything else
b) those who don't know exactly what they want and prefer to have
everything installed, and have OMPI auto-select at run time exactly what to
use based on the system on which it's running

We've traditionally catered to the b) crowd, and made some
not-very-easy-to-use capabilities for the a) crowd (i.e., you can manually
disable each plugin you don't want to build via configure, but the syntax is
fairly laborious).

Ralph and I talked about the possibility of something analogous to "make
menuconfig" for Linux kernels, where you get a menu-like system (UI TBD) to
pick exactly what options you want/don't want.  That will output a text
config file that can be fed to configure, something along the lines of

./configure --only-build-exactly-this-stuff=file-output-from-menuconfig

This idea is *very* rough; I anticipate that it will change quite a bit
over time, and it'll take us a bit of time to design and implement it.

A menu-like system is not going to be very useful at least for us, since we
script all of our installations. Scripting a menu is not very handy.

Maxime




On May 14, 2014, at 8:56 PM, Bennet Fauber  wrote:


I think Maxime's suggestion is sane and reasonable.  Just in case
you're taking ha'penny's worth from the groundlings.  I think I would
prefer not to have capability included that we won't use.

-- bennet



On Wed, May 14, 2014 at 7:43 PM, Maxime Boissonneault
 wrote:

For the scheduler issue, I would be happy with something like, if I ask
for
support for X, disable support for Y, Z and W. I am assuming that very
rarely will someone use more than one scheduler.

Maxime

Le 2014-05-14 19:09, Ralph Castain a écrit :

Jeff and I have talked about this and are approaching a compromise.
Still
more thinking to do, perhaps providing new configure options to "only
build
what I ask for" and/or a tool to support a menu-driven selection of
what to
build - as opposed to today's "build everything you don't tell me to
not-build"

Tough set of compromises as it depends on the target audience. Sys
admins
prefer the "build only what I say", while users (who frequently aren't
that
familiar with the inners of a system) prefer the "build all" mentality.


On May 14, 2014, at 3:16 PM, Ralph Castain  wrote:


Indeed, a quick review indicates that the new policy for scheduler
support was not uniformly applied. I'll update it.

To reiterate: we will only build support for a scheduler if the user
specifically requests it. We did this because we are increasingly
seeing
distros include header support for various schedulers, and so just
finding
the required headers isn't enough to know that the scheduler is
intended for
use. So we wind up building a bunch of useless modules.


On May 14, 2014, at 3:09 PM, Ralph Castain  wrote:


FWIW: I believe we no longer build the slurm support by default,
though
I'd have to check to be sure. The intent is definitely not to do so.

The plan we adjusted to a while back was to *only* build support for
schedulers upon request. Can't swear that they are all correctly
updated,
but that was the intent.


On May 14, 2014, at 2:52 PM, Jeff Squyres (jsquyres)
 wrote:


Here's a bit of our rational, from the README file:

   Note that for many of Open MPI

Re: [OMPI users] Question about scheduler support


Le 2014-05-15 06:29, Jeff Squyres (jsquyres) a écrit :

I think Ralph's email summed it up pretty well -- we unfortunately have (at 
least) two distinct groups of people who install OMPI:

a) those who know exactly what they want and don't want anything else
b) those who don't know exactly what they want and prefer to have everything 
installed, and have OMPI auto-select at run time exactly what to use based on 
the system on which it's running

We've traditionally catered to the b) crowd, and made some not-very-easy-to-use 
capabilities for the a) crowd (i.e., you can manually disable each plugin you 
don't want to build via configure, but the syntax is fairly laborious).

Ralph and I talked about the possibility of something analogous to "make 
menuconfig" for Linux kernels, where you get a menu-like system (UI TBD) to pick 
exactly what options you want/don't want.  That will output a text config file that can 
be fed to configure, something along the lines of

   ./configure --only-build-exactly-this-stuff=file-output-from-menuconfig

This idea is *very* rough; I anticipate that it will change quite a bit over 
time, and it'll take us a bit of time to design and implement it.
A menu-like system is not going to be very useful at least for us, since 
we script all of our installations. Scripting a menu is not very handy.


Maxime





On May 14, 2014, at 8:56 PM, Bennet Fauber  wrote:


I think Maxime's suggestion is sane and reasonable.  Just in case
you're taking ha'penny's worth from the groundlings.  I think I would
prefer not to have capability included that we won't use.

-- bennet



On Wed, May 14, 2014 at 7:43 PM, Maxime Boissonneault
 wrote:

For the scheduler issue, I would be happy with something like, if I ask for
support for X, disable support for Y, Z and W. I am assuming that very
rarely will someone use more than one scheduler.

Maxime

Le 2014-05-14 19:09, Ralph Castain a écrit :

Jeff and I have talked about this and are approaching a compromise.  Still
more thinking to do, perhaps providing new configure options to "only build
what I ask for" and/or a tool to support a menu-driven selection of what to
build - as opposed to today's "build everything you don't tell me to
not-build"

Tough set of compromises as it depends on the target audience. Sys admins
prefer the "build only what I say", while users (who frequently aren't that
familiar with the inners of a system) prefer the "build all" mentality.


On May 14, 2014, at 3:16 PM, Ralph Castain  wrote:


Indeed, a quick review indicates that the new policy for scheduler
support was not uniformly applied. I'll update it.

To reiterate: we will only build support for a scheduler if the user
specifically requests it. We did this because we are increasingly seeing
distros include header support for various schedulers, and so just finding
the required headers isn't enough to know that the scheduler is intended for
use. So we wind up building a bunch of useless modules.


On May 14, 2014, at 3:09 PM, Ralph Castain  wrote:


FWIW: I believe we no longer build the slurm support by default, though
I'd have to check to be sure. The intent is definitely not to do so.

The plan we adjusted to a while back was to *only* build support for
schedulers upon request. Can't swear that they are all correctly updated,
but that was the intent.


On May 14, 2014, at 2:52 PM, Jeff Squyres (jsquyres)
 wrote:


Here's a bit of our rational, from the README file:

  Note that for many of Open MPI's --with- options, Open MPI will,
  by default, search for header files and/or libraries for .  If
  the relevant files are found, Open MPI will built support for ;
  if they are not found, Open MPI will skip building support for .
  However, if you specify --with- on the configure command line
and
  Open MPI is unable to find relevant support for , configure will
  assume that it was unable to provide a feature that was specifically
  requested and will abort so that a human can resolve out the issue.

In some cases, we don't need header or library files.  For example,
with SLURM and LSF, our native support is actually just fork/exec'ing the
SLURM/LSF executables under the covers (e.g., as opposed to using rsh/ssh).
So we can basically *always* build them.  So we do.

In general, OMPI builds support for everything that it can find on the
rationale that a) we can't know ahead of time exactly what people want, and
b) most people want to just "./configure && make -j 32 install" and be done
with it -- so build as much as possible.


On May 14, 2014, at 5:31 PM, Maxime Boissonneault
 wrote:


Hi Gus,
Oh, I know that, what I am refering to is that slurm and loadleveler
support are enabled by default, and it seems that if we're using
Torque/Moab, we have no use for slurm an

Re: [OMPI users] Question about scheduler support

2014-05-14 Thread Maxime Boissonneault

For the scheduler issue, I would be happy with something like, if I ask
for support for X, disable support for Y, Z and W. I am assuming that
very rarely will someone use more than one scheduler.

Maxime

Le 2014-05-14 19:09, Ralph Castain a écrit :

Jeff and I have talked about this and are approaching a compromise. Still more thinking to do,
perhaps providing new configure options to "only build what I ask for" and/or a tool to
support a menu-driven selection of what to build - as opposed to today's "build everything you
don't tell me to not-build"

Tough set of compromises as it depends on the target audience. Sys admins prefer the "build
only what I say", while users (who frequently aren't that familiar with the inners of a
system) prefer the "build all" mentality.

On May 14, 2014, at 3:16 PM, Ralph Castain wrote:

Indeed, a quick review indicates that the new policy for scheduler support was
not uniformly applied. I'll update it.

To reiterate: we will only build support for a scheduler if the user
specifically requests it. We did this because we are increasingly seeing
distros include header support for various schedulers, and so just finding the
required headers isn't enough to know that the scheduler is intended for use.
So we wind up building a bunch of useless modules.

On May 14, 2014, at 3:09 PM, Ralph Castain wrote:

FWIW: I believe we no longer build the slurm support by default, though I'd
have to check to be sure. The intent is definitely not to do so.

The plan we adjusted to a while back was to *only* build support for schedulers
upon request. Can't swear that they are all correctly updated, but that was the
intent.

On May 14, 2014, at 2:52 PM, Jeff Squyres (jsquyres) wrote:

Here's a bit of our rational, from the README file:

Note that for many of Open MPI's --with- options, Open MPI will,
by default, search for header files and/or libraries for . If
the relevant files are found, Open MPI will built support for ;
if they are not found, Open MPI will skip building support for .
However, if you specify --with- on the configure command line and
Open MPI is unable to find relevant support for , configure will
assume that it was unable to provide a feature that was specifically
requested and will abort so that a human can resolve out the issue.

In some cases, we don't need header or library files. For example, with SLURM
and LSF, our native support is actually just fork/exec'ing the SLURM/LSF
executables under the covers (e.g., as opposed to using rsh/ssh). So we can
basically *always* build them. So we do.

In general, OMPI builds support for everything that it can find on the rationale that a) we can't
know ahead of time exactly what people want, and b) most people want to just "./configure
&& make -j 32 install" and be done with it -- so build as much as possible.

On May 14, 2014, at 5:31 PM, Maxime Boissonneault
wrote:

Hi Gus,
Oh, I know that, what I am refering to is that slurm and loadleveler support
are enabled by default, and it seems that if we're using Torque/Moab, we have
no use for slurm and loadleveler support.

My point is not that it is hard to compile it with torque support, my point is
that it is compiling support for many schedulers while I'm rather convinced
that very few sites actually use multiple schedulers at the same time.

Maxime

Le 2014-05-14 16:51, Gus Correa a écrit :

On 05/14/2014 04:25 PM, Maxime Boissonneault wrote:

Hi,
I was compiling OpenMPI 1.8.1 today and I noticed that pretty much every
single scheduler has its support enabled by default at configure (except
the one I need, which is Torque). Is there a reason for that ? Why not
have a single scheduler enabled and require to specify it at configure
time ?

Is there any reason for me to build with loadlever or slurm if we're
using torque ?

Thanks,

Maxime Boisssonneault

Hi Maxime

I haven't tried 1.8.1 yet.
However, for all previous versions of OMPI I tried, up to 1.6.5,
all it took to configure it with Torque support was to point configure
to the Torque installation directory (which is non-standard in my case):

--with-tm=/opt/torque/bla/bla

My two cents,
Gus Correa

_______
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org
http://www.open

Re: [OMPI users] Question about scheduler support

2014-05-14 Thread Maxime Boissonneault


Hi Gus,
Oh, I know that, what I am refering to is that slurm and loadleveler 
support are enabled by default, and it seems that if we're using 
Torque/Moab, we have no use for slurm and loadleveler support.


My point is not that it is hard to compile it with torque support, my 
point is that it is compiling support for many schedulers while I'm 
rather convinced that very few sites actually use multiple schedulers at 
the same time.



Maxime

Le 2014-05-14 16:51, Gus Correa a écrit :

On 05/14/2014 04:25 PM, Maxime Boissonneault wrote:

Hi,
I was compiling OpenMPI 1.8.1 today and I noticed that pretty much every
single scheduler has its support enabled by default at configure (except
the one I need, which is Torque). Is there a reason for that ? Why not
have a single scheduler enabled and require to specify it at configure
time ?

Is there any reason for me to build with loadlever or slurm if we're
using torque ?

Thanks,

Maxime Boisssonneault


Hi Maxime

I haven't tried 1.8.1 yet.
However, for all previous versions of OMPI I tried, up to 1.6.5,
all it took to configure it with Torque support was to point configure
to the Torque installation directory (which is non-standard in my case):

--with-tm=/opt/torque/bla/bla

My two cents,
Gus Correa

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

[OMPI users] Question about scheduler support

2014-05-14 Thread Maxime Boissonneault


Hi,
I was compiling OpenMPI 1.8.1 today and I noticed that pretty much every 
single scheduler has its support enabled by default at configure (except 
the one I need, which is Torque). Is there a reason for that ? Why not 
have a single scheduler enabled and require to specify it at configure 
time ?


Is there any reason for me to build with loadlever or slurm if we're 
using torque ?


Thanks,

Maxime Boisssonneault

Re: [OMPI users] checkpoint/restart facility - blcr

2014-02-27 Thread Maxime Boissonneault

I heard that c/r support in OpenMPI was being dropped after version 
1.6.x. Is this not still the case ?


Maxime Boissonneault

Le 2014-02-27 13:09, George Bosilca a écrit :
Both were supported at some point. I'm not sure if any is still in a 
workable state in the trunk today. However, there is an ongoing effort 
to reinstate the coordinated approach.


  George.

On Feb 27, 2014, at 18:50 , basma a.azeem <mailto:basmaabdelaz...@hotmail.com>> wrote:


i have a question about the checkpoint/restart facility of BLCR with 
OPEN MPI , does the checkpoint/restart solution as a whole can be 
considered as a coordinated or uncoordinated approach

___
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-17 Thread Maxime Boissonneault


Hi,
Do you have thread multiples enabled in your OpenMPI installation ?

Maxime Boissonneault

Le 2013-12-16 17:40, Noam Bernstein a écrit :

Has anyone tried to use openmpi 1.7.3 with the latest CentOS kernel
(well, nearly latest: 2.6.32-431.el6.x86_64), and especially with infiniband?

I'm seeing lots of weird slowdowns, especially when using infiniband,
but even when running with "--mca btl self,sm" (it's much worse with
IB, though), so I was wondering if anyone else has tested this kernel yet?

Once I have some more detailed information I'll follow up.

Noam
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Problem compiling against torque 4.2.4

2013-12-04 Thread Maxime Boissonneault


Hi,
You are probably missing the moab-torque-devel package (or torque-devel 
package if there is one).


You need the *-devel to have the headers in order to compile against torque.

Maxime

Le 2013-12-04 15:06, Matt Burgess a écrit :


Hello,

I can't seem to compile openmpi version 1.6.5 against torque 4.2.4. 
Here's the configure line I'm using:


./configure --with-tm=/dg/local/cots/torque/torque-4.2.4/

The relevant portion of config.log appears to be:

configure:92031: checking --with-tm value

configure:92051: result: sanity check ok 
(/dg/local/cots/torque/torque-4.2.4/)


configure:92076: checking for pbs-config

configure:92086: result: 
/dg/local/cots/torque/torque-4.2.4//bin/pbs-config


configure:92099: ess_tm_CPPFLAGS from pbs-config:

configure:92122: ess_tm_LDFLAGS from pbs-config:

configure:92145: ess_tm_LIBS from pbs-config:

configure:92160: checking tm.h usability

configure:92160: gcc -c -DNDEBUG -g -O2 -finline-functions 
-fno-strict-aliasing -pthread 
-I/root/openmpi-1.6.5/opal/mca/hwloc/hwloc132/hwloc/include conftest.c >&5


conftest.c:597:16: error: tm.h: No such file or directory

configure:92160: $? = 1

Thanks in advance for any help anybody can provide.

DigitalGlobe logo



http://www.digitalglobe.com/images/dg_02.gif

*Matt Burgess***

/Linux/HPC Engineer/
+1.303.684.1132 office
+1.919.355.8672 mobile
matt.burg...@digitalglobe.com <mailto:matt.burg...@digitalglobe.com>

This electronic communication and any attachments may contain confidential and 
proprietary
information of DigitalGlobe, Inc. If you are not the intended recipient, or an 
agent or employee
responsible for delivering this communication to the intended recipient, or if 
you have received
this communication in error, please do not print, copy, retransmit, disseminate 
or
otherwise use the information. Please indicate to the sender that you have 
received this
communication in error, and delete the copy you received. DigitalGlobe reserves 
the
right to monitor any electronic communication sent or received by its 
employees, agents
or representatives.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
---------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] MPI_THREAD_MULTIPLE causes deadlock in simple MPI_Barrier case (ompi 1.6.5 and 1.7.3)

2013-11-29 Thread Maxime Boissonneault


Hi Jean-François ;)
Do you have the same behavior if you disable openib at run time ? :

--mca btl ^openib

My experience with the OpenIB BTL is that its inner threading is bugged.

Maxime

Le 2013-11-28 19:21, Jean-Francois St-Pierre a écrit :

Hi,
I've compiled ompi1.6.5 with multi-thread support (using Intel
compilers 12.0.4.191, but I get the same result with gcc) :

./configure --with-tm/opt/torque --with-openib
--enable-mpi-thread-multiple CC=icc CXX=icpc F77=ifort FC=ifort

And i've built a simple sample code that only does the Init and one
MPI_Barrier. The core of the code is :

   setbuf(stdout, NULL);
   MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
   fprintf(stdout,"%6d: Provided thread support %d ", getpid(), provided);
   int flag,claimed;
   MPI_Is_thread_main( &flag );
   MPI_Query_thread( &claimed );

   fprintf(stdout,"%6d: Before Comm_rank, flag %d, claimed %d \n",
getpid(), flag, claimed);
   MPI_Comm_rank(MPI_COMM_WORLD, &gRank);

   fprintf(stdout,"%6d: Comm_size rank %d\n",getpid(), gRank);
   MPI_Comm_size(MPI_COMM_WORLD, &gNTasks);

   fprintf(stdout,"%6d: Before Barrier\n", getpid());
   MPI_Barrier( MPI_COMM_WORLD );

   fprintf(stdout,"%6d: After Barrier\n", getpid());
   MPI_Finalize();

When I launch it on 2 nodes (mono-core per node) I get this sample output :

/***  Output
  mpirun -pernode -np 2 sample_code
  7356: Provided thread support 3 MPI_THREAD_MULTIPLE
  7356: Before Comm_rank, flag 1, claimed 3
  7356: Comm_size rank 0
  7356: Before Barrier
  26277: Provided thread support 3 MPI_THREAD_MULTIPLE
  26277: Before Comm_rank, flag 1, claimed 3
  26277: Comm_size rank 1
  26277: Before Barrier
  ^Cmpirun: killing job...
  */

The deadlock never gets over the MPI_Barrier when I use
MPI_THREAD_MULTIPLE, but it runs fine using MPI_THREAD_SERIALIZED .  I
get the same behavior with ompi 1.7.3. I don't get a deadlock when the
2 MPI processes are hosted on the same node.

In attachement, you'll find my config.out, config.log, environment
variables on the execution node, both make.out, sample code and output
etc.

Thanks,

Jeff


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Very high latency with openib btl

2013-02-15 Thread Maxime Boissonneault


Hi again,
I managed to reproduce the "bug" with a simple case (see the cpp file 
attached).


I am running this on 2 nodes with 8 cores each. If I run with
mpiexec ./test-mpi-latency.out

then the MPI_Ssend operations take about ~1e-5 second for intra-node 
ranks, and ~11 seconds for inter-node ranks. Note that 11 seconds is 
roughly the time required to execute the loop that is after the 
MPI_Recv. The average time required for the MPI_Ssend to return is 5.1 
seconds.


If I run with :
mpiexec --mca btl ^openib ./test-mpi-latency.out

then intra-node communications take ~0.5-1e-5 seconds, while internode 
communications take ~1e-6 seconds, for an average of ~5e-5 seconds.


I compiled this with gcc 4.7.2 + openmpi 1.6.3, as well as gcc 4.6.1 + 
openmpi 1.4.5. Both show the same behavior.


However, on the same machine, with gcc 4.6.1 + mvapich2/1.8, the latency 
is always quite low.


The fact that mvapich2 does not show this behavior points out to a 
problem with the openib btl within openmpi, and not with our setup.


Can anyone try to reproduce this on a different machine ?

Thanks,

Maxime Boissonneault

Le 2013-02-15 14:29, Maxime Boissonneault a écrit :

Hi again,
I found out that if I add an
MPI_Barrier after the MPI_Recv part, then there is no minute-long 
latency.


Is it possible that even if MPI_Recv returns, the openib btl does not 
guarantee that the acknowledgement is sent promptly ? In other words, 
is it possible that the computation following the MPI_Recv delays the 
acknowledgement ? If so, is it supposed to be this way, or is it 
normal, and why isn't the same behavior observed with the tcp btl ?


Maxime Boissonneault


Le 2013-02-14 11:50, Maxime Boissonneault a écrit :

Hi,
I have a strange case here. The application is "plink" 
(http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml). The 
computation/communication pattern of the application is the following :


1- MPI_Init
2- Some single rank computation
3- MPI_Bcast
4- Some single rank computation
5- MPI_Barrier
6- rank 0 sends data to each other rank with MPI_Ssend, one rank at a 
time.

6- other ranks use MPI_Recv
7- Some single rank computation
8- other ranks send result to rank 0 with MPI_Ssend
8- rank 0 receives data with MPI_Recv
9- rank 0 analyses result
10- MPI_Finalize

The amount of data being sent is of the order of the kilobytes, and 
we have IB.


The problem we observe is in step 6. I've output dates before and 
after each MPI operation. With the openib btl, the behavior I observe 
is that :

- rank 0 starts sending
- rank n receives almost instantly, and MPI_Recv returns.
- rank 0's MPI_Ssend often returns _minutes_.

It looks like the acknowledgement from rank n takes minutes to reach 
rank 0.


Now, the tricky part is that if I disable the openib btl to use 
instead tcp over IB, there is no such latency and the acknowledgement 
comes back within a fraction of a second. Also, if rank 0 and rank n 
are on the same node, the acknowledgement is also quasi-instantaneous 
(I guess it goes through the SM btl instead of openib).


I tried to reproduce this in a simple case, but I observed no such 
latency. The duration that I got for the whole communication is of 
the order of milliseconds.


Does anyone have an idea of what could cause such very high latencies 
when using the OpenIB BTL ?


Also, I tried replacing step 6 by explicitly sending a confirmation :
- rank 0 does MPI_Isend to rank n followed by MPI_Recv from rank n
- rank n does MPI_Recv from rank 0 followed by MPI_Isend to rank 0

In this case also, rank n's MPI_Isend executes quasi-instantaneously, 
and rank 0's MPI_Recv only returns a few minutes later.


Thanks,

Maxime Boissonneault






--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include "mpi.h"

using namespace std;
static struct timespec start, end, duration;
static int my_rank, nrank;
static int my_mpi_tag_send=0;
void clock_start() {
clock_gettime(CLOCK_MONOTONIC,&start);
}
double clock_end(const string & op, int rank_print=0) {
double duration_in_sec;
clock_gettime(CLOCK_MONOTONIC,&end);
duration.tv_sec = end.tv_sec - start.tv_sec;
duration.tv_nsec = end.tv_nsec - start.tv_nsec;
while (duration.tv_nsec > 10) { duration.tv_sec++; 
duration.tv_nsec -= 10; }
while (duration.tv_nsec < 0) { duration.tv_sec--; duration.tv_nsec += 
10; }
duration_in_sec = duration.tv_sec + 
double(duration.tv_nsec)/10.;
if (my_rank == rank_print)
cout << "Operation \"" << op << "\" done. Took: " << 
duration_in_sec << " seconds." << endl;
return durat

Re: [OMPI users] Very high latency with openib btl

2013-02-15 Thread Maxime Boissonneault


Hi again,
I found out that if I add an
MPI_Barrier after the MPI_Recv part, then there is no minute-long latency.

Is it possible that even if MPI_Recv returns, the openib btl does not 
guarantee that the acknowledgement is sent promptly ? In other words, is 
it possible that the computation following the MPI_Recv delays the 
acknowledgement ? If so, is it supposed to be this way, or is it normal, 
and why isn't the same behavior observed with the tcp btl ?


Maxime Boissonneault


Le 2013-02-14 11:50, Maxime Boissonneault a écrit :

Hi,
I have a strange case here. The application is "plink" 
(http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml). The 
computation/communication pattern of the application is the following :


1- MPI_Init
2- Some single rank computation
3- MPI_Bcast
4- Some single rank computation
5- MPI_Barrier
6- rank 0 sends data to each other rank with MPI_Ssend, one rank at a 
time.

6- other ranks use MPI_Recv
7- Some single rank computation
8- other ranks send result to rank 0 with MPI_Ssend
8- rank 0 receives data with MPI_Recv
9- rank 0 analyses result
10- MPI_Finalize

The amount of data being sent is of the order of the kilobytes, and we 
have IB.


The problem we observe is in step 6. I've output dates before and 
after each MPI operation. With the openib btl, the behavior I observe 
is that :

- rank 0 starts sending
- rank n receives almost instantly, and MPI_Recv returns.
- rank 0's MPI_Ssend often returns _minutes_.

It looks like the acknowledgement from rank n takes minutes to reach 
rank 0.


Now, the tricky part is that if I disable the openib btl to use 
instead tcp over IB, there is no such latency and the acknowledgement 
comes back within a fraction of a second. Also, if rank 0 and rank n 
are on the same node, the acknowledgement is also quasi-instantaneous 
(I guess it goes through the SM btl instead of openib).


I tried to reproduce this in a simple case, but I observed no such 
latency. The duration that I got for the whole communication is of the 
order of milliseconds.


Does anyone have an idea of what could cause such very high latencies 
when using the OpenIB BTL ?


Also, I tried replacing step 6 by explicitly sending a confirmation :
- rank 0 does MPI_Isend to rank n followed by MPI_Recv from rank n
- rank n does MPI_Recv from rank 0 followed by MPI_Isend to rank 0

In this case also, rank n's MPI_Isend executes quasi-instantaneously, 
and rank 0's MPI_Recv only returns a few minutes later.


Thanks,

Maxime Boissonneault



--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

[OMPI users] Very high latency with openib btl

2013-02-14 Thread Maxime Boissonneault


Hi,
I have a strange case here. The application is "plink" 
(http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml). The 
computation/communication pattern of the application is the following :


1- MPI_Init
2- Some single rank computation
3- MPI_Bcast
4- Some single rank computation
5- MPI_Barrier
6- rank 0 sends data to each other rank with MPI_Ssend, one rank at a time.
6- other ranks use MPI_Recv
7- Some single rank computation
8- other ranks send result to rank 0 with MPI_Ssend
8- rank 0 receives data with MPI_Recv
9- rank 0 analyses result
10- MPI_Finalize

The amount of data being sent is of the order of the kilobytes, and we 
have IB.


The problem we observe is in step 6. I've output dates before and after 
each MPI operation. With the openib btl, the behavior I observe is that :

- rank 0 starts sending
- rank n receives almost instantly, and MPI_Recv returns.
- rank 0's MPI_Ssend often returns _minutes_.

It looks like the acknowledgement from rank n takes minutes to reach 
rank 0.


Now, the tricky part is that if I disable the openib btl to use instead 
tcp over IB, there is no such latency and the acknowledgement comes back 
within a fraction of a second. Also, if rank 0 and rank n are on the 
same node, the acknowledgement is also quasi-instantaneous (I guess it 
goes through the SM btl instead of openib).


I tried to reproduce this in a simple case, but I observed no such 
latency. The duration that I got for the whole communication is of the 
order of milliseconds.


Does anyone have an idea of what could cause such very high latencies 
when using the OpenIB BTL ?


Also, I tried replacing step 6 by explicitly sending a confirmation :
- rank 0 does MPI_Isend to rank n followed by MPI_Recv from rank n
- rank n does MPI_Recv from rank 0 followed by MPI_Isend to rank 0

In this case also, rank n's MPI_Isend executes quasi-instantaneously, 
and rank 0's MPI_Recv only returns a few minutes later.


Thanks,

Maxime Boissonneault

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-01-30 Thread Maxime Boissonneault


Le 2013-01-29 21:02, Ralph Castain a écrit :


On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault 
<mailto:maxime.boissonnea...@calculquebec.ca>> wrote:


While our filesystem and management nodes are on UPS, our compute 
nodes are not. With one average generic (power/cooling mostly) 
failure every one or two months, running for weeks is just asking for 
trouble. If you add to that typical dimm/cpu/networking failures (I 
estimated about 1 node goes down per day because of some sort 
hardware failure, for a cluster of 960 nodes). With these numbers, a 
job running on 32 nodes for 7 days has a ~35% chance of failing 
before it is done.


I've been running this in my head all day - it just doesn't fit 
experience, which really bothered me. So I spent a little time running 
the calculation, and I came up with a number much lower (more like 
around 5%). I'm not saying my rough number is correct, but it is at 
least a little closer to what we see in the field.


Given that there are a lot of assumptions required when doing these 
calculations, I would like to suggest you conduct a very simply and 
quick experiment before investing tons of time on FT solutions. All 
you have to do is:


Thanks for the calculation. However, this is a cluster that I manage, I 
do not use it per say, and running such statistical jobs on a large part 
of the cluster for a long period of time is impossible. We do have the 
numbers however. The cluster has 960 nodes. We experience roughly one 
power or cooling failure per month or two months. Assuming one such 
failure per two months, if you run for 1 month, you have a 50% chance 
your job will be killed before it ends. If you run for 2 weeks, 25%, 
etc. These are very rough estimates obviously, but it is way more than 5%.


In addition to that, we have a failure rate of ~0.1%/day, meaning that 
out of 960, on average, one node will have a hardware failure every day. 
Most of the time, this is a failure of one of the dimms. Considering 
each node has 12 dimms of 2GB of memory, it means a dimm failure rate of 
~0.0001 per day. I don't know if that's bad or not, but this is roughly 
what we have.
If it turns out you see power failure problems, then a simple, 
low-cost, ride-thru power stabilizer might be a good solution. 
Flywheels and capacitor-based systems can provide support for 
momentary power quality issues at reasonably low costs for a cluster 
of your size.
I doubt there is anything low cost for a 330 kW system, and in any case, 
hardware upgrade is not an option since this a mid-life cluster. Again, 
as I said, the filesystem (2 x 500 TB lustre partitions) and the 
management nodes are on UPS, but there is no way to put the compute 
nodes on UPS.


If your node hardware is the problem, or you decide you do want/need 
to pursue an FT solution, then you might look at the OMPI-based 
solutions from parties such as http://fault-tolerance.org or the 
MPICH2 folks.

Thanks for the tip.

Best regards,

Maxime

Re: [OMPI users] Checkpointing an MPI application with OMPI


Hi George,
The problem here is not the bandwidth, but the number of IOPs. I wrote 
to the BLCR list, and they confirmed that :
"While ideally the checkpoint would be written in sizable chunks, the 
current code in BLCR will issue a single write operation for each 
contiguous range of user memory, and many quite small writes for various 
meta-data and non-memory state (registers, signal-handlers,etc).  As 
show in Table 1 of the paper cited above, the writes in the 10s of KB 
range will dominate performance."


(Reference being : X. Ouyang, R. Rajachandrasekhar, X. Besseron, H. 
Wang, J. Huang and D. K. Panda, CRFS: A Lightweight User-Level 
Filesystem for Generic Checkpoint/Restart, Int'l Conference on Parallel 
Processing (ICPP '11), Sept. 2011. (PDF 
<http://nowlab.cse.ohio-state.edu/publications/conf-papers/2011/ouyangx-icpp2011.pdf>))


We did run parallel IO benchmarks. Our filesystem can reach a speed of 
~15GB/s, but only with large IO operations (at least bigger than 1MB, 
optimally in the 100MB-1GB range). For small (<1MB) operations, the 
filesystem is considerably slower. I believe this is precisely what is 
killing the performance here.


Not sure there is anything to be done about it.

Best regards,


Maxime

Le 2013-01-28 15:40, George Bosilca a écrit :

At the scale you address you should have no trouble with the C/R if
the file system is correctly configured. We get more bandwidth per
node out of an NFS over 1Gb/s at 32 nodes. Have you run some parallel
benchmarks on your cluster ?

  George.

PS: You can some MPI I/O benchmarks at
http://www.mcs.anl.gov/~thakur/pio-benchmarks.html



On Mon, Jan 28, 2013 at 2:04 PM, Ralph Castain  wrote:

On Jan 28, 2013, at 10:53 AM, Maxime Boissonneault 
 wrote:


Le 2013-01-28 13:15, Ralph Castain a écrit :

On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault 
 wrote:


Le 2013-01-28 12:46, Ralph Castain a écrit :

On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault 
 wrote:


Hello Ralph,
I agree that ideally, someone would implement checkpointing in the application 
itself, but that is not always possible (commercial applications, use of 
complicated libraries, algorithms with no clear progression points at which you 
can interrupt the algorithm and start it back from there).

Hmmm...well, most apps can be adjusted to support it - we have some very 
complex apps that were updated that way. Commercial apps are another story, but 
we frankly don't find much call for checkpointing those as they typically just 
don't run long enough - especially if you are only running 256 ranks, so your 
cluster is small. Failure rates just don't justify it in such cases, in our 
experience.

Is there some particular reason why you feel you need checkpointing?

This specific case is that the jobs run for days. The risk of a hardware or 
power failure for that kind of duration goes too high (we allow for no more 
than 48 hours of run time).

I'm surprised by that - we run with UPS support on the clusters, but for a 
small one like you describe, we find the probability that a job will be 
interrupted even during a multi-week app is vanishingly small.

FWIW: I do work with the financial industry where we regularly run apps that 
execute non-stop for about a month with no reported failures. Are you actually 
seeing failures, or are you anticipating them?

While our filesystem and management nodes are on UPS, our compute nodes are 
not. With one average generic (power/cooling mostly) failure every one or two 
months, running for weeks is just asking for trouble.

Wow, that is high


If you add to that typical dimm/cpu/networking failures (I estimated about 1 
node goes down per day because of some sort hardware failure, for a cluster of 
960 nodes).

That is incredibly high


With these numbers, a job running on 32 nodes for 7 days has a ~35% chance of 
failing before it is done.

I've never seen anything like that behavior in practice - a 32 node cluster 
typically runs for quite a few months without a failure. It is a typical size 
for the financial sector, so we have a LOT of experience with such clusters.

I suspect you won't see anything like that behavior...


Having 24GB of ram per node, even if a 32 nodes job takes close to 100% of the 
ram, that's merely 640 GB of data. Writing that on a lustre filesystem capable 
of reaching ~15GB/s should take no more than a few minutes if written 
correctly. Right now, I am getting a few minutes for a hundredth of this amount 
of data!


Agreed - but I don't think you'll get that bandwidth for checkpointing. I 
suspect you'll find that checkpointing really has troubles when scaling, which 
is why you don't see it used in production (at least, I haven't). Mostly used 
for research by a handful of organizations, which is why we haven't been too 
concerned about its loss.



While it is true we can dig through the code of the library to make

Re: [OMPI users] Checkpointing an MPI application with OMPI

Le 2013-01-28 13:15, Ralph Castain a écrit :

On Jan 28, 2013, at 9:52 AM, Maxime Boissonneault
wrote:

Le 2013-01-28 12:46, Ralph Castain a écrit :

On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault
wrote:

Hello Ralph,
I agree that ideally, someone would implement checkpointing in the application
itself, but that is not always possible (commercial applications, use of
complicated libraries, algorithms with no clear progression points at which you
can interrupt the algorithm and start it back from there).

Hmmm...well, most apps can be adjusted to support it - we have some very
complex apps that were updated that way. Commercial apps are another story, but
we frankly don't find much call for checkpointing those as they typically just
don't run long enough - especially if you are only running 256 ranks, so your
cluster is small. Failure rates just don't justify it in such cases, in our
experience.

Is there some particular reason why you feel you need checkpointing?

This specific case is that the jobs run for days. The risk of a hardware or
power failure for that kind of duration goes too high (we allow for no more
than 48 hours of run time).

I'm surprised by that - we run with UPS support on the clusters, but for a
small one like you describe, we find the probability that a job will be
interrupted even during a multi-week app is vanishingly small.

FWIW: I do work with the financial industry where we regularly run apps that
execute non-stop for about a month with no reported failures. Are you actually
seeing failures, or are you anticipating them?
While our filesystem and management nodes are on UPS, our compute nodes
are not. With one average generic (power/cooling mostly) failure every
one or two months, running for weeks is just asking for trouble. If you
add to that typical dimm/cpu/networking failures (I estimated about 1
node goes down per day because of some sort hardware failure, for a
cluster of 960 nodes). With these numbers, a job running on 32 nodes for
7 days has a ~35% chance of failing before it is done.

Having 24GB of ram per node, even if a 32 nodes job takes close to 100%
of the ram, that's merely 640 GB of data. Writing that on a lustre
filesystem capable of reaching ~15GB/s should take no more than a few
minutes if written correctly. Right now, I am getting a few minutes for
a hundredth of this amount of data!

While it is true we can dig through the code of the library to make it
checkpoint, BLCR checkpointing just seemed easier.

I see - just be aware that checkpoint support in OMPI will disappear in v1.7
and there is no clear timetable for restoring it.

That is very good to know. Thanks for the information. It is too bad though.

There certainly must be a better way to write the information down on disc
though. Doing 500 IOPs seems to be completely broken. Why isn't there buffering
involved ?

I don't know - that's all done in BLCR, I believe. Either way, it isn't
something we can address due to the loss of our supporter for c/r.

I suppose I should contact BLCR instead then.

For the disk op problem, I think that's the way to go - though like I said, I
could be wrong and the disk writes could be something we do inside OMPI. I'm
not familiar enough with the c/r code to state it with certainty.

Thank you,

Maxime

Sorry we can't be of more help :-(
Ralph

Thanks,

Maxime

Le 2013-01-28 10:58, Ralph Castain a écrit :

Our c/r person has moved on to a different career path, so we may not have
anyone who can answer this question.

What we can say is that checkpointing at any significant scale will always be a losing
proposition. It just takes too long and hammers the file system. People have been working
on extending the capability with things like "burst buffers" (basically putting
an SSD in front of the file system to absorb the checkpoint surge), but that hasn't
become very common yet.

Frankly, what people have found to be the "best" solution is for your app to periodically
write out its intermediate results, and then take a flag that indicates "read prior
results" when it starts. This minimizes the amount of data being written to the disk. If done
correctly, you would only lose whatever work was done since the last intermediate result was
written - which is about equivalent to losing whatever works was done since the last checkpoint.

HTH
Ralph

On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault
wrote:

Hello,
I am doing checkpointing tests (with BLCR) with an MPI application compiled
with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange.

Re: [OMPI users] Checkpointing an MPI application with OMPI

Le 2013-01-28 12:46, Ralph Castain a écrit :

On Jan 28, 2013, at 8:25 AM, Maxime Boissonneault
wrote:

Is there some particular reason why you feel you need checkpointing?
This specific case is that the jobs run for days. The risk of a hardware
or power failure for that kind of duration goes too high (we allow for
no more than 48 hours of run time).
While it is true we can dig through the code of the library to make it
checkpoint, BLCR checkpointing just seemed easier.

There certainly must be a better way to write the information down on disc
though. Doing 500 IOPs seems to be completely broken. Why isn't there buffering
involved ?

I don't know - that's all done in BLCR, I believe. Either way, it isn't
something we can address due to the loss of our supporter for c/r.

I suppose I should contact BLCR instead then.

Thank you,

Maxime

Sorry we can't be of more help :-(
Ralph

Thanks,

Maxime

Le 2013-01-28 10:58, Ralph Castain a écrit :

Our c/r person has moved on to a different career path, so we may not have
anyone who can answer this question.

HTH
Ralph

On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault
wrote:

Hello,
I am doing checkpointing tests (with BLCR) with an MPI application compiled
with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange.

First, some details about the tests :
- The only filesystem available on the nodes are 1) one tmpfs, 2) one lustre
shared filesystem (tested to be able to provide ~15GB/s for writing and support
~40k IOPs).
- The job was running with 8 or 16 MPI ranks on nodes with 8 cores (1 or 2
nodes). Each MPI rank was using approximately 200MB of memory.
- I was doing checkpoints with ompi-checkpoint and restarting with ompi-restart.
- I was starting with mpirun -am ft-enable-cr
- The nodes are monitored by ganglia, which allows me to see the number of IOPs
and the read/write speed on the filesystem.

I tried a few different mca settings, but I consistently observed that :
- The checkpoints lasted ~4-5 minutes each time
- During checkpoint, each node (8 ranks) was doing ~500 IOPs, and writing at
~15MB/s.

I am worried by the number of IOPs and the very slow writing speed. This was a
very small test. We have jobs running with 128 or 256 MPI ranks, each using 1-2
GB of ram per rank. With such jobs, the overall number of IOPs would reach tens
of thousands and would completely overload our lustre filesystem. Moreover,
with 15MB/s per node, the checkpointing process would take hours.

How can I improve on that ? Is there an MCA setting that I am missing ?

Thanks,

--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
-
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Checkpointing an MPI application with OMPI

Hello Ralph,
I agree that ideally, someone would implement checkpointing in the
application itself, but that is not always possible (commercial
applications, use of complicated libraries, algorithms with no clear
progression points at which you can interrupt the algorithm and start it
back from there).

There certainly must be a better way to write the information down on
disc though. Doing 500 IOPs seems to be completely broken. Why isn't
there buffering involved ?

Thanks,

Maxime

Le 2013-01-28 10:58, Ralph Castain a écrit :

Our c/r person has moved on to a different career path, so we may not have
anyone who can answer this question.

HTH
Ralph

On Jan 28, 2013, at 7:47 AM, Maxime Boissonneault
wrote:

Hello,
I am doing checkpointing tests (with BLCR) with an MPI application compiled
with OpenMPI 1.6.3, and I am seeing behaviors that are quite strange.

How can I improve on that ? Is there an MCA setting that I am missing ?

Thanks,

--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
-----
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

[OMPI users] Checkpointing an MPI application with OMPI