[OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-04 Thread Fengguang Song
Hi,

I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. My program 
uses MPI to exchange data
between nodes, and uses cudaMemcpyAsync to exchange data between Host and GPU 
devices within a node.
When the MPI message size is less than 1MB, everything works fine. However, 
when the message size
is > 1MB, the program hangs (i.e., an MPI send never reaches its destination 
based on my trace).

The issue may be related to locked-memory contention between OpenMPI and CUDA.
Does anyone have the experience to solve the problem? Which MCA parameters 
should I tune to increase 
the message size to be > 1MB (to avoid the program hang)? Any help would be 
appreciated.

Thanks,
Fengguang


Re: [OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-05 Thread Brice Goglin
Le 05/06/2011 00:15, Fengguang Song a écrit :
> Hi,
>
> I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. My 
> program uses MPI to exchange data
> between nodes, and uses cudaMemcpyAsync to exchange data between Host and GPU 
> devices within a node.
> When the MPI message size is less than 1MB, everything works fine. However, 
> when the message size
> is > 1MB, the program hangs (i.e., an MPI send never reaches its destination 
> based on my trace).
>
> The issue may be related to locked-memory contention between OpenMPI and CUDA.
> Does anyone have the experience to solve the problem? Which MCA parameters 
> should I tune to increase 
> the message size to be > 1MB (to avoid the program hang)? Any help would be 
> appreciated.
>
> Thanks,
> Fengguang

Hello,

I may have seen the same problem when testing GPU direct. Do you use the
same host buffer for copying from/to GPU and for sending/receiving on
the network ? If so, you need a GPUDirect enabled kernel and mellanox
drivers, but it only helps before 1MB.

You can work around the problem with one of the following solution:
* add --mca btl_openib_flags 304 to force OMPI to always send/recv
through an intermediate (internal buffer), but it'll decrease
performance before 1MB too
* use different host buffers for the GPU and the network and manually
copy between them

I never got any reply from NVIDIA/Mellanox/here when I reported this
problem with GPUDirect and messages larger than 1MB.
http://www.open-mpi.org/community/lists/users/2011/03/15823.php

Brice



Re: [OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-05 Thread Fengguang Song
Hi Brice,

Thank you! I saw your previous discussion and actually have tried "--mca 
btl_openib_flags 304".
It didn't solve the problem unfortunately. In our case, the MPI buffer is 
different from the cudaMemcpy
buffer and we do manually copy between them. I'm still trying to figure out how 
to configure OpenMPI's mca 
parameters to solve the problem...

Thanks,
Fengguang


On Jun 5, 2011, at 2:20 AM, Brice Goglin wrote:

> Le 05/06/2011 00:15, Fengguang Song a écrit :
>> Hi,
>> 
>> I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. My 
>> program uses MPI to exchange data
>> between nodes, and uses cudaMemcpyAsync to exchange data between Host and 
>> GPU devices within a node.
>> When the MPI message size is less than 1MB, everything works fine. However, 
>> when the message size
>> is > 1MB, the program hangs (i.e., an MPI send never reaches its destination 
>> based on my trace).
>> 
>> The issue may be related to locked-memory contention between OpenMPI and 
>> CUDA.
>> Does anyone have the experience to solve the problem? Which MCA parameters 
>> should I tune to increase 
>> the message size to be > 1MB (to avoid the program hang)? Any help would be 
>> appreciated.
>> 
>> Thanks,
>> Fengguang
> 
> Hello,
> 
> I may have seen the same problem when testing GPU direct. Do you use the
> same host buffer for copying from/to GPU and for sending/receiving on
> the network ? If so, you need a GPUDirect enabled kernel and mellanox
> drivers, but it only helps before 1MB.
> 
> You can work around the problem with one of the following solution:
> * add --mca btl_openib_flags 304 to force OMPI to always send/recv
> through an intermediate (internal buffer), but it'll decrease
> performance before 1MB too
> * use different host buffers for the GPU and the network and manually
> copy between them
> 
> I never got any reply from NVIDIA/Mellanox/here when I reported this
> problem with GPUDirect and messages larger than 1MB.
> http://www.open-mpi.org/community/lists/users/2011/03/15823.php
> 
> Brice
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-06 Thread Rolf vandeVaart
Hi Fengguang:

That is odd that you see the problem even when running with the openib flags 
set as Brice indicated.  Just to be extra sure there are no typo errors in your 
flag settings, maybe you can verify with the ompi_info command like this?

ompi_info -mca btl_openib_flags 304 -param btl openib | grep btl_openib_flags

When running with the 304 setting, then all communications travel through a 
regular send/receive protocol on IB.  The message is broken up into a 12K 
fragment, followed by however many 64K fragments it takes to move the message.

I will try and find to time to reproduce the other 1 Mbyte issue that Brice 
reported.

Rolf



PS: Not sure if you are interested, but in the trunk, you can configure in 
support so that you can send and receive GPU buffers directly.  There are still 
many performance issues to be worked out, but just thought I would mention it.


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Fengguang Song
Sent: Sunday, June 05, 2011 9:54 AM
To: Open MPI Users
Subject: Re: [OMPI users] Program hangs when using OpenMPI and CUDA

Hi Brice,

Thank you! I saw your previous discussion and actually have tried "--mca 
btl_openib_flags 304".
It didn't solve the problem unfortunately. In our case, the MPI buffer is 
different from the cudaMemcpy buffer and we do manually copy between them. I'm 
still trying to figure out how to configure OpenMPI's mca parameters to solve 
the problem...

Thanks,
Fengguang


On Jun 5, 2011, at 2:20 AM, Brice Goglin wrote:

> Le 05/06/2011 00:15, Fengguang Song a écrit :
>> Hi,
>> 
>> I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. 
>> My program uses MPI to exchange data between nodes, and uses cudaMemcpyAsync 
>> to exchange data between Host and GPU devices within a node.
>> When the MPI message size is less than 1MB, everything works fine. 
>> However, when the message size is > 1MB, the program hangs (i.e., an MPI 
>> send never reaches its destination based on my trace).
>> 
>> The issue may be related to locked-memory contention between OpenMPI and 
>> CUDA.
>> Does anyone have the experience to solve the problem? Which MCA 
>> parameters should I tune to increase the message size to be > 1MB (to avoid 
>> the program hang)? Any help would be appreciated.
>> 
>> Thanks,
>> Fengguang
> 
> Hello,
> 
> I may have seen the same problem when testing GPU direct. Do you use 
> the same host buffer for copying from/to GPU and for sending/receiving 
> on the network ? If so, you need a GPUDirect enabled kernel and 
> mellanox drivers, but it only helps before 1MB.
> 
> You can work around the problem with one of the following solution:
> * add --mca btl_openib_flags 304 to force OMPI to always send/recv 
> through an intermediate (internal buffer), but it'll decrease 
> performance before 1MB too
> * use different host buffers for the GPU and the network and manually 
> copy between them
> 
> I never got any reply from NVIDIA/Mellanox/here when I reported this 
> problem with GPUDirect and messages larger than 1MB.
> http://www.open-mpi.org/community/lists/users/2011/03/15823.php
> 
> Brice
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---



Re: [OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-06 Thread Fengguang Song
Hi Rolf,

I  double checked the flag just now. It was set correctly, but the hanging 
problem is still there.
But I found another way to solve the hanging problem. Just setting environment 
CUDA_NIC_INTEROP 1
could solve the issue.

Thanks,
Fengguang


On Jun 6, 2011, at 10:44 AM, Rolf vandeVaart wrote:

> Hi Fengguang:
> 
> That is odd that you see the problem even when running with the openib flags 
> set as Brice indicated.  Just to be extra sure there are no typo errors in 
> your flag settings, maybe you can verify with the ompi_info command like this?
> 
> ompi_info -mca btl_openib_flags 304 -param btl openib | grep btl_openib_flags
> 
> When running with the 304 setting, then all communications travel through a 
> regular send/receive protocol on IB.  The message is broken up into a 12K 
> fragment, followed by however many 64K fragments it takes to move the message.
> 
> I will try and find to time to reproduce the other 1 Mbyte issue that Brice 
> reported.
> 
> Rolf
> 
> 
> 
> PS: Not sure if you are interested, but in the trunk, you can configure in 
> support so that you can send and receive GPU buffers directly.  There are 
> still many performance issues to be worked out, but just thought I would 
> mention it.
> 
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Fengguang Song
> Sent: Sunday, June 05, 2011 9:54 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Program hangs when using OpenMPI and CUDA
> 
> Hi Brice,
> 
> Thank you! I saw your previous discussion and actually have tried "--mca 
> btl_openib_flags 304".
> It didn't solve the problem unfortunately. In our case, the MPI buffer is 
> different from the cudaMemcpy buffer and we do manually copy between them. 
> I'm still trying to figure out how to configure OpenMPI's mca parameters to 
> solve the problem...
> 
> Thanks,
> Fengguang
> 
> 
> On Jun 5, 2011, at 2:20 AM, Brice Goglin wrote:
> 
>> Le 05/06/2011 00:15, Fengguang Song a écrit :
>>> Hi,
>>> 
>>> I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. 
>>> My program uses MPI to exchange data between nodes, and uses 
>>> cudaMemcpyAsync to exchange data between Host and GPU devices within a node.
>>> When the MPI message size is less than 1MB, everything works fine. 
>>> However, when the message size is > 1MB, the program hangs (i.e., an MPI 
>>> send never reaches its destination based on my trace).
>>> 
>>> The issue may be related to locked-memory contention between OpenMPI and 
>>> CUDA.
>>> Does anyone have the experience to solve the problem? Which MCA 
>>> parameters should I tune to increase the message size to be > 1MB (to avoid 
>>> the program hang)? Any help would be appreciated.
>>> 
>>> Thanks,
>>> Fengguang
>> 
>> Hello,
>> 
>> I may have seen the same problem when testing GPU direct. Do you use 
>> the same host buffer for copying from/to GPU and for sending/receiving 
>> on the network ? If so, you need a GPUDirect enabled kernel and 
>> mellanox drivers, but it only helps before 1MB.
>> 
>> You can work around the problem with one of the following solution:
>> * add --mca btl_openib_flags 304 to force OMPI to always send/recv 
>> through an intermediate (internal buffer), but it'll decrease 
>> performance before 1MB too
>> * use different host buffers for the GPU and the network and manually 
>> copy between them
>> 
>> I never got any reply from NVIDIA/Mellanox/here when I reported this 
>> problem with GPUDirect and messages larger than 1MB.
>> http://www.open-mpi.org/community/lists/users/2011/03/15823.php
>> 
>> Brice
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ---
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> ---
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users