Re: [OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs

2015-03-30 Thread Rolf vandeVaart
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Rolf
>vandeVaart
>Sent: Monday, March 30, 2015 9:37 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] segfault during MPI_Isend when transmitting GPU
>arrays between multiple GPUs
>
>>-Original Message-
>>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>>Sent: Sunday, March 29, 2015 10:11 PM
>>To: Open MPI Users
>>Subject: Re: [OMPI users] segfault during MPI_Isend when transmitting
>>GPU arrays between multiple GPUs
>>
>>Received from Rolf vandeVaart on Fri, Mar 27, 2015 at 04:09:58PM EDT:
>>> >-Original Message-
>>> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev
>>> >Givon
>>> >Sent: Friday, March 27, 2015 3:47 PM
>>> >To: us...@open-mpi.org
>>> >Subject: [OMPI users] segfault during MPI_Isend when transmitting
>>> >GPU arrays between multiple GPUs
>>> >
>>> >I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded
>>> >today) built against OpenMPI 1.8.4 with CUDA support activated to
>>> >asynchronously send GPU arrays between multiple Tesla GPUs (Fermi
>>> >generation). Each MPI process is associated with a single GPU; the
>>> >process has a run loop that starts several Isends to transmit the
>>> >contents of GPU arrays to destination processes and several Irecvs
>>> >to receive data from source processes into GPU arrays on the process'
>>> >GPU. Some of the sends/recvs use one tag, while the remainder use a
>>> >second tag. A single Waitall invocation is used to wait for all of
>>> >these sends and receives to complete before the next iteration of
>>> >the loop
>>can commence. All GPU arrays are preallocated before the run loop starts.
>>> >While this pattern works most of the time, it sometimes fails with a
>>> >segfault that appears to occur during an Isend:
>>
>>(snip)
>>
>>> >Any ideas as to what could be causing this problem?
>>> >
>>> >I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04.
>>>
>>> Hi Lev:
>>>
>>> I am not sure what is happening here but there are a few things we
>>> can do to try and narrow things done.
>>> 1. If you run with --mca btl_smcuda_use_cuda_ipc 0 then I assume this
>error
>>>will go away?
>>
>>Yes - that appears to be the case.
>>
>>> 2. Do you know if when you see this error it happens on the first
>>> pass
>>through
>>>your communications?  That is, you mention how there are multiple
>>>iterations through the loop and I am wondering when you see failures if 
>>> it
>>>is the first pass through the loop.
>>
>>When the segfault occurs, it appears to always happen during the second
>>iteration of the loop, i.e., at least one slew of Isends (and
>>presumably Irecvs) is successfully performed.
>>
>>Some more details regarding the Isends: each process starts two Isends
>>for each destination process to which it transmits data. The Isends use
>>two different tags, respectively; one is passed None (by design), while
>>the other is passed the pointer to a GPU array with nonzero length. The
>>segfault appears to occur during the latter Isend.
>>--
>
>Lev, can you send me the test program off list.  I may try to create a C 
>version
>of the test and see if I can reproduce the problem.
>Not sure at this point what is happening.
>
>Thanks,
>Rolf
>
We figured out what was going on and I figured I would post here in case others 
see it.

After running for a while, some CUDA files related to CUDA IPC may get left in 
the /dev/shm directory.  These files can sometimes cause problems with later 
runs causing errors (or SEGVs) when calling some CUDA APIs.  The solution is 
clear out that directory periodically.

This issue is fixed in CUDA 7.0
Rolf
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs

2015-03-30 Thread Rolf vandeVaart
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Sunday, March 29, 2015 10:11 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] segfault during MPI_Isend when transmitting GPU
>arrays between multiple GPUs
>
>Received from Rolf vandeVaart on Fri, Mar 27, 2015 at 04:09:58PM EDT:
>> >-Original Message-
>> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev
>> >Givon
>> >Sent: Friday, March 27, 2015 3:47 PM
>> >To: us...@open-mpi.org
>> >Subject: [OMPI users] segfault during MPI_Isend when transmitting GPU
>> >arrays between multiple GPUs
>> >
>> >I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded
>> >today) built against OpenMPI 1.8.4 with CUDA support activated to
>> >asynchronously send GPU arrays between multiple Tesla GPUs (Fermi
>> >generation). Each MPI process is associated with a single GPU; the
>> >process has a run loop that starts several Isends to transmit the
>> >contents of GPU arrays to destination processes and several Irecvs to
>> >receive data from source processes into GPU arrays on the process'
>> >GPU. Some of the sends/recvs use one tag, while the remainder use a
>> >second tag. A single Waitall invocation is used to wait for all of
>> >these sends and receives to complete before the next iteration of the loop
>can commence. All GPU arrays are preallocated before the run loop starts.
>> >While this pattern works most of the time, it sometimes fails with a
>> >segfault that appears to occur during an Isend:
>
>(snip)
>
>> >Any ideas as to what could be causing this problem?
>> >
>> >I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04.
>>
>> Hi Lev:
>>
>> I am not sure what is happening here but there are a few things we can
>> do to try and narrow things done.
>> 1. If you run with --mca btl_smcuda_use_cuda_ipc 0 then I assume this error
>>will go away?
>
>Yes - that appears to be the case.
>
>> 2. Do you know if when you see this error it happens on the first pass
>through
>>your communications?  That is, you mention how there are multiple
>>iterations through the loop and I am wondering when you see failures if it
>>is the first pass through the loop.
>
>When the segfault occurs, it appears to always happen during the second
>iteration of the loop, i.e., at least one slew of Isends (and presumably 
>Irecvs)
>is successfully performed.
>
>Some more details regarding the Isends: each process starts two Isends for
>each destination process to which it transmits data. The Isends use two
>different tags, respectively; one is passed None (by design), while the other 
>is
>passed the pointer to a GPU array with nonzero length. The segfault appears
>to occur during the latter Isend.
>--

Lev, can you send me the test program off list.  I may try to create a C 
version of the test and see if I can reproduce the problem.
Not sure at this point what is happening.

Thanks,
Rolf


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs

2015-03-29 Thread Lev Givon
Received from Rolf vandeVaart on Fri, Mar 27, 2015 at 04:09:58PM EDT:
> >-Original Message-
> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
> >Sent: Friday, March 27, 2015 3:47 PM
> >To: us...@open-mpi.org
> >Subject: [OMPI users] segfault during MPI_Isend when transmitting GPU
> >arrays between multiple GPUs
> >
> >I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded today)
> >built against OpenMPI 1.8.4 with CUDA support activated to asynchronously
> >send GPU arrays between multiple Tesla GPUs (Fermi generation). Each MPI
> >process is associated with a single GPU; the process has a run loop that 
> >starts
> >several Isends to transmit the contents of GPU arrays to destination
> >processes and several Irecvs to receive data from source processes into GPU
> >arrays on the process' GPU. Some of the sends/recvs use one tag, while the
> >remainder use a second tag. A single Waitall invocation is used to wait for 
> >all of
> >these sends and receives to complete before the next iteration of the loop
> >can commence. All GPU arrays are preallocated before the run loop starts.
> >While this pattern works most of the time, it sometimes fails with a segfault
> >that appears to occur during an Isend:

(snip)

> >Any ideas as to what could be causing this problem?
> >
> >I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04.
>
> Hi Lev:
>
> I am not sure what is happening here but there are a few things we can do to
> try and narrow things done.
> 1. If you run with --mca btl_smcuda_use_cuda_ipc 0 then I assume this error
>will go away?

Yes - that appears to be the case.

> 2. Do you know if when you see this error it happens on the first pass through
>your communications?  That is, you mention how there are multiple
>iterations through the loop and I am wondering when you see failures if it
>is the first pass through the loop.

When the segfault occurs, it appears to always happen during the second
iteration of the loop, i.e., at least one slew of Isends (and presumably Irecvs)
is successfully performed.

Some more details regarding the Isends: each process starts two Isends for each
destination process to which it transmits data. The Isends use two different
tags, respectively; one is passed None (by design), while the other is passed
the pointer to a GPU array with nonzero length. The segfault appears to occur
during the latter Isend.
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/



Re: [OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs

2015-03-27 Thread Rolf vandeVaart
Hi Lev:
I am not sure what is happening here but there are a few things we can do to 
try and narrow things done.
1. If you run with --mca btl_smcuda_use_cuda_ipc 0 then I assume this error 
will go away?
2. Do you know if when you see this error it happens on the first pass through 
your communications?  That is, you mention how there are multiple iterations 
through the loop and I am wondering when you see failures if it is the first 
pass through the loop.

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Friday, March 27, 2015 3:47 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] segfault during MPI_Isend when transmitting GPU
>arrays between multiple GPUs
>
>I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded today)
>built against OpenMPI 1.8.4 with CUDA support activated to asynchronously
>send GPU arrays between multiple Tesla GPUs (Fermi generation). Each MPI
>process is associated with a single GPU; the process has a run loop that starts
>several Isends to transmit the contents of GPU arrays to destination
>processes and several Irecvs to receive data from source processes into GPU
>arrays on the process' GPU. Some of the sends/recvs use one tag, while the
>remainder use a second tag. A single Waitall invocation is used to wait for 
>all of
>these sends and receives to complete before the next iteration of the loop
>can commence. All GPU arrays are preallocated before the run loop starts.
>While this pattern works most of the time, it sometimes fails with a segfault
>that appears to occur during an Isend:
>
>[myhost:05471] *** Process received signal *** [myhost:05471] Signal:
>Segmentation fault (11) [myhost:05471] Signal code:  (128) [myhost:05471]
>Failing at address: (nil) [myhost:05471] [ 0] /lib/x86_64-linux-
>gnu/libpthread.so.0(+0x10340)[0x2ac2bb176340]
>[myhost:05471] [ 1]
>/usr/lib/x86_64-linux-gnu/libcuda.so.1(+0x1f6b18)[0x2ac2c48bfb18]
>[myhost:05471] [ 2]
>/usr/lib/x86_64-linux-gnu/libcuda.so.1(+0x16dcc3)[0x2ac2c4836cc3]
>[myhost:05471] [ 3]
>/usr/lib/x86_64-linux-
>gnu/libcuda.so.1(cuIpcGetEventHandle+0x5d)[0x2ac2c480bccd]
>[myhost:05471] [ 4]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_common_cuda_construct_event_and_handle+0x27
>)[0x2ac2c27d3087]
>[myhost:05471] [ 5]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(ompi_free_list_grow+0x199)[0x2ac2c277b8e9]
>[myhost:05471] [ 6]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_mpool_gpusm_register+0xf4)[0x2ac2c28c9fd4]
>[myhost:05471] [ 7]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_pml_ob1_rdma_cuda_btls+0xcd)[0x2ac2c28f8afd]
>[myhost:05471] [ 8]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_pml_ob1_send_request_start_cuda+0xbf)[0x2ac2c
>28f8d5f]
>[myhost:05471] [ 9]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_pml_ob1_isend+0x60e)[0x2ac2c28eb6fe]
>[myhost:05471] [10]
>/opt/openmpi-1.8.4/lib/libmpi.so.1(MPI_Isend+0x137)[0x2ac2c27b7cc7]
>[myhost:05471] [11]
>/home/lev/Work/miniconda/envs/MYENV/lib/python2.7/site-
>packages/mpi4py/MPI.so(+0xd3bb2)[0x2ac2c24b3bb2]
>(Python-related debug lines omitted.)
>
>Any ideas as to what could be causing this problem?
>
>I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04.
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/03/26553.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


[OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs

2015-03-27 Thread Lev Givon
I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded today) built
against OpenMPI 1.8.4 with CUDA support activated to asynchronously send GPU
arrays between multiple Tesla GPUs (Fermi generation). Each MPI process is
associated with a single GPU; the process has a run loop that starts several
Isends to transmit the contents of GPU arrays to destination processes and
several Irecvs to receive data from source processes into GPU arrays on the
process' GPU. Some of the sends/recvs use one tag, while the remainder use a
second tag. A single Waitall invocation is used to wait for all of these sends
and receives to complete before the next iteration of the loop can commence. All
GPU arrays are preallocated before the run loop starts. While this pattern works
most of the time, it sometimes fails with a segfault that appears to occur
during an Isend:

[myhost:05471] *** Process received signal ***
[myhost:05471] Signal: Segmentation fault (11)
[myhost:05471] Signal code:  (128)
[myhost:05471] Failing at address: (nil)
[myhost:05471] [ 0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x2ac2bb176340]
[myhost:05471] [ 1]
/usr/lib/x86_64-linux-gnu/libcuda.so.1(+0x1f6b18)[0x2ac2c48bfb18]
[myhost:05471] [ 2]
/usr/lib/x86_64-linux-gnu/libcuda.so.1(+0x16dcc3)[0x2ac2c4836cc3]
[myhost:05471] [ 3]
/usr/lib/x86_64-linux-gnu/libcuda.so.1(cuIpcGetEventHandle+0x5d)[0x2ac2c480bccd]
[myhost:05471] [ 4]
/opt/openmpi-1.8.4/lib/libmpi.so.1(mca_common_cuda_construct_event_and_handle+0x27)[0x2ac2c27d3087]
[myhost:05471] [ 5]
/opt/openmpi-1.8.4/lib/libmpi.so.1(ompi_free_list_grow+0x199)[0x2ac2c277b8e9]
[myhost:05471] [ 6]
/opt/openmpi-1.8.4/lib/libmpi.so.1(mca_mpool_gpusm_register+0xf4)[0x2ac2c28c9fd4]
[myhost:05471] [ 7]
/opt/openmpi-1.8.4/lib/libmpi.so.1(mca_pml_ob1_rdma_cuda_btls+0xcd)[0x2ac2c28f8afd]
[myhost:05471] [ 8]
/opt/openmpi-1.8.4/lib/libmpi.so.1(mca_pml_ob1_send_request_start_cuda+0xbf)[0x2ac2c28f8d5f]
[myhost:05471] [ 9]
/opt/openmpi-1.8.4/lib/libmpi.so.1(mca_pml_ob1_isend+0x60e)[0x2ac2c28eb6fe]
[myhost:05471] [10]
/opt/openmpi-1.8.4/lib/libmpi.so.1(MPI_Isend+0x137)[0x2ac2c27b7cc7]
[myhost:05471] [11]
/home/lev/Work/miniconda/envs/MYENV/lib/python2.7/site-packages/mpi4py/MPI.so(+0xd3bb2)[0x2ac2c24b3bb2]
(Python-related debug lines omitted.)

Any ideas as to what could be causing this problem?

I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04.
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/