Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-21 Thread Rolf vandeVaart
Answers below...
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Thursday, May 21, 2015 2:19 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] cuIpcOpenMemHandle failure when using
>OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
>
>Received from Lev Givon on Thu, May 21, 2015 at 11:32:33AM EDT:
>> Received from Rolf vandeVaart on Wed, May 20, 2015 at 07:48:15AM EDT:
>>
>> (snip)
>>
>> > I see that you mentioned you are starting 4 MPS daemons.  Are you
>> > following the instructions here?
>> >
>> > http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-se
>> > rvice-mps.html
>>
>> Yes - also
>>
>https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overvie
>w
>> .pdf
>>
>> > This relies on setting CUDA_VISIBLE_DEVICES which can cause problems
>> > for CUDA IPC. Since you are using CUDA 7 there is no more need to
>> > start multiple daemons. You simply leave CUDA_VISIBLE_DEVICES
>> > untouched and start a single MPS control daemon which will handle all
>GPUs.  Can you try that?
>>
>> I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value
>> should be passed to all MPI processes.
There is no need to do anything with CUDA_MPS_PIPE_DIRECTORY with CUDA 7.  

>>
>> Several questions related to your comment above:
>>
>> - Should the MPI processes select and initialize the GPUs they respectively 
>> need
>>   to access as they normally would when MPS is not in use?
Yes.  

>> - Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS 
>> (and
>>   hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES 
>> to
>>   control GPU resource allocation, and I would like to run my program (and 
>> the
>>   MPS control daemon) on a cluster via SLURM.
Yes, I believe that is true.  

>> - Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that
>>   MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU 
>> setting
>>   with CUDA 6.5 even when one starts multiple MPS control daemons as  
>> described
>>   in the aforementioned blog post?
>
>Using a single control daemon with CUDA_VISIBLE_DEVICES unset appears to
>solve the problem when IPC is enabled.
>--
Glad to see this worked.  And you are correct that CUDA IPC will not work 
between devices if they are segregated by the use of CUDA_VISIBLE_DEVICES as we 
do with MPS in 6.5.

Rolf
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-21 Thread Lev Givon
Received from Lev Givon on Thu, May 21, 2015 at 11:32:33AM EDT:
> Received from Rolf vandeVaart on Wed, May 20, 2015 at 07:48:15AM EDT:
> 
> (snip)
> 
> > I see that you mentioned you are starting 4 MPS daemons.  Are you following
> > the instructions here?
> > 
> > http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html
> >  
> 
> Yes - also
> https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
> 
> > This relies on setting CUDA_VISIBLE_DEVICES which can cause problems for 
> > CUDA
> > IPC. Since you are using CUDA 7 there is no more need to start multiple
> > daemons. You simply leave CUDA_VISIBLE_DEVICES untouched and start a single
> > MPS control daemon which will handle all GPUs.  Can you try that?  
> 
> I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value should be
> passed to all MPI processes. 
> 
> Several questions related to your comment above:
> 
> - Should the MPI processes select and initialize the GPUs they respectively 
> need
>   to access as they normally would when MPS is not in use?
> - Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS 
> (and
>   hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES 
> to
>   control GPU resource allocation, and I would like to run my program (and the
>   MPS control daemon) on a cluster via SLURM.
> - Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that
>   MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU 
> setting
>   with CUDA 6.5 even when one starts multiple MPS control daemons as described
>   in the aforementioned blog post?

Using a single control daemon with CUDA_VISIBLE_DEVICES unset appears to solve
the problem when IPC is enabled.
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/



Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-21 Thread Lev Givon
Received from Rolf vandeVaart on Wed, May 20, 2015 at 07:48:15AM EDT:

(snip)

> I see that you mentioned you are starting 4 MPS daemons.  Are you following
> the instructions here?
> 
> http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html
>  

Yes - also
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

> This relies on setting CUDA_VISIBLE_DEVICES which can cause problems for CUDA
> IPC. Since you are using CUDA 7 there is no more need to start multiple
> daemons. You simply leave CUDA_VISIBLE_DEVICES untouched and start a single
> MPS control daemon which will handle all GPUs.  Can you try that?  

I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value should be
passed to all MPI processes. 

Several questions related to your comment above:

- Should the MPI processes select and initialize the GPUs they respectively need
  to access as they normally would when MPS is not in use?
- Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS (and
  hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES to
  control GPU resource allocation, and I would like to run my program (and the
  MPS control daemon) on a cluster via SLURM.
- Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that
  MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU setting
  with CUDA 6.5 even when one starts multiple MPS control daemons as described
  in the aforementioned blog post?

> Because of this question, we realized we need to update our documentation as
> well.
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/



Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-20 Thread Rolf vandeVaart
-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Tuesday, May 19, 2015 10:25 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] cuIpcOpenMemHandle failure when using
>OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
>
>Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT:
>> >-Original Message-
>> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev
>> >Givon
>> >Sent: Tuesday, May 19, 2015 6:30 PM
>> >To: us...@open-mpi.org
>> >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
>> >1.8.5 with CUDA 7.0 and Multi-Process Service
>> >
>> >I'm encountering intermittent errors while trying to use the
>> >Multi-Process Service with CUDA 7.0 for improving concurrent access
>> >to a Kepler K20Xm GPU by multiple MPI processes that perform
>> >GPU-to-GPU communication with each other (i.e., GPU pointers are
>passed to the MPI transmission primitives).
>> >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI
>> >1.8.5, which is in turn built against CUDA 7.0. In my current
>> >configuration, I have 4 MPS server daemons running, each of which
>> >controls access to one of 4 GPUs; the MPI processes spawned by my
>> >program are partitioned into 4 groups (which might contain different
>> >numbers of processes) that each talk to a separate daemon. For
>> >certain transmission patterns between these processes, the program
>> >runs without any problems. For others (e.g., 16 processes partitioned into
>4 groups), however, it dies with the following error:
>> >
>> >[node05:20562] Failed to register remote memory, rc=-1
>> >-
>> >- The call to cuIpcOpenMemHandle failed. This is an unrecoverable
>> >error and will cause the program to abort.
>> >  cuIpcOpenMemHandle return value:   21199360
>> >  address: 0x1
>> >Check the cuda.h file for what the return value means. Perhaps a
>> >reboot of the node will clear the problem.
>
>(snip)
>
>> >After the above error occurs, I notice that /dev/shm/ is littered
>> >with
>> >cuda.shm.* files. I tried cleaning up /dev/shm before running my
>> >program, but that doesn't seem to have any effect upon the problem.
>> >Rebooting the machine also doesn't have any effect. I should also add
>> >that my program runs without any error if the groups of MPI processes
>> >talk directly to the GPUs instead of via MPS.
>> >
>> >Does anyone have any ideas as to what could be going on?
>>
>> I am not sure why you are seeing this.  One thing that is clear is
>> that you have found a bug in the error reporting.  The error message
>> is a little garbled and I see a bug in what we are reporting. I will fix 
>> that.
>>
>> If possible, could you try running with --mca btl_smcuda_use_cuda_ipc
>> 0.  My expectation is that you will not see any errors, but may lose
>> some performance.
>
>The error does indeed go away when IPC is disabled, although I do want to
>avoid degrading the performance of data transfers between GPU memory
>locations.
>
>> What does your hardware configuration look like?  Can you send me
>> output from "nvidia-smi topo -m"
>--

I see that you mentioned you are starting 4 MPS daemons.  Are you following the 
instructions here?

http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html
 

This relies on setting CUDA_VISIBLE_DEVICES which can cause problems for CUDA 
IPC. Since you are using CUDA 7 there is no more need to start multiple 
daemons. You simply leave CUDA_VISIBLE_DEVICES untouched and start a single MPS 
control daemon which will handle all GPUs.  Can you try that?  Because of this 
question, we realized we need to update our documentation as well.

Thanks,
Rolf


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-19 Thread Lev Givon
Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT:
> >-Original Message-
> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
> >Sent: Tuesday, May 19, 2015 6:30 PM
> >To: us...@open-mpi.org
> >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
> >1.8.5 with CUDA 7.0 and Multi-Process Service
> >
> >I'm encountering intermittent errors while trying to use the Multi-Process
> >Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU
> >by multiple MPI processes that perform GPU-to-GPU communication with
> >each other (i.e., GPU pointers are passed to the MPI transmission 
> >primitives).
> >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5,
> >which is in turn built against CUDA 7.0. In my current configuration, I have 
> >4
> >MPS server daemons running, each of which controls access to one of 4 GPUs;
> >the MPI processes spawned by my program are partitioned into 4 groups
> >(which might contain different numbers of processes) that each talk to a
> >separate daemon. For certain transmission patterns between these
> >processes, the program runs without any problems. For others (e.g., 16
> >processes partitioned into 4 groups), however, it dies with the following 
> >error:
> >
> >[node05:20562] Failed to register remote memory, rc=-1
> >--
> >The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
> >will cause the program to abort.
> >  cuIpcOpenMemHandle return value:   21199360
> >  address: 0x1
> >Check the cuda.h file for what the return value means. Perhaps a reboot of
> >the node will clear the problem.

(snip)

> >After the above error occurs, I notice that /dev/shm/ is littered with
> >cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
> >but that doesn't seem to have any effect upon the problem. Rebooting the
> >machine also doesn't have any effect. I should also add that my program runs
> >without any error if the groups of MPI processes talk directly to the GPUs
> >instead of via MPS.
> >
> >Does anyone have any ideas as to what could be going on?
>
> I am not sure why you are seeing this.  One thing that is clear is that you
> have found a bug in the error reporting.  The error message is a little
> garbled and I see a bug in what we are reporting. I will fix that.
> 
> If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0.  My
> expectation is that you will not see any errors, but may lose some
> performance.

The error does indeed go away when IPC is disabled, although I do want to
avoid degrading the performance of data transfers between GPU memory locations.

> What does your hardware configuration look like?  Can you send me output from
> "nvidia-smi topo -m"
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/



Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-19 Thread Lev Givon
Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT:
> 
> >-Original Message-
> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
> >Sent: Tuesday, May 19, 2015 6:30 PM
> >To: us...@open-mpi.org
> >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
> >1.8.5 with CUDA 7.0 and Multi-Process Service
> >
> >I'm encountering intermittent errors while trying to use the Multi-Process
> >Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU
> >by multiple MPI processes that perform GPU-to-GPU communication with
> >each other (i.e., GPU pointers are passed to the MPI transmission 
> >primitives).
> >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5,
> >which is in turn built against CUDA 7.0. In my current configuration, I have 
> >4
> >MPS server daemons running, each of which controls access to one of 4 GPUs;
> >the MPI processes spawned by my program are partitioned into 4 groups
> >(which might contain different numbers of processes) that each talk to a
> >separate daemon. For certain transmission patterns between these
> >processes, the program runs without any problems. For others (e.g., 16
> >processes partitioned into 4 groups), however, it dies with the following 
> >error:
> >
> >[node05:20562] Failed to register remote memory, rc=-1
> >--
> >The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
> >will cause the program to abort.
> >  cuIpcOpenMemHandle return value:   21199360
> >  address: 0x1
> >Check the cuda.h file for what the return value means. Perhaps a reboot of
> >the node will clear the problem.

(snip)

> >After the above error occurs, I notice that /dev/shm/ is littered with
> >cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
> >but that doesn't seem to have any effect upon the problem. Rebooting the
> >machine also doesn't have any effect. I should also add that my program runs
> >without any error if the groups of MPI processes talk directly to the GPUs
> >instead of via MPS.
> >
> >Does anyone have any ideas as to what could be going on?
>
> I am not sure why you are seeing this.  One thing that is clear is that you
> have found a bug in the error reporting.  The error message is a little
> garbled and I see a bug in what we are reporting. I will fix that.
> 
> If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0.  My
> expectation is that you will not see any errors, but may lose some
> performance.
> 
> What does your hardware configuration look like?  Can you send me output from
> "nvidia-smi topo -m"

GPU0GPU1GPU2GPU3CPU Affinity
GPU0 X  PHB SOC SOC 0-23
GPU1PHB  X  SOC SOC 0-23
GPU2SOC SOC  X  PHB 0-23
GPU3SOC SOC PHB  X  0-23

Legend:

  X   = Self
  SOC = Path traverses a socket-level link (e.g. QPI)
  PHB = Path traverses a PCIe host bridge
  PXB = Path traverses multiple PCIe internal switches
  PIX = Path traverses a PCIe internal switch
-- 
Lev Givon
Bionet Group | Neurokernel Project
http://www.columbia.edu/~lev/
http://lebedov.github.io/
http://neurokernel.github.io/



Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-19 Thread Rolf vandeVaart
I am not sure why you are seeing this.  One thing that is clear is that you 
have found a bug in the error reporting.  The error message is a little garbled 
and I see a bug in what we are reporting. I will fix that.

If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0.  My 
expectation is that you will not see any errors, but may lose some performance.

What does your hardware configuration look like?  Can you send me output from 
"nvidia-smi topo -m"

Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Tuesday, May 19, 2015 6:30 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
>1.8.5 with CUDA 7.0 and Multi-Process Service
>
>I'm encountering intermittent errors while trying to use the Multi-Process
>Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU
>by multiple MPI processes that perform GPU-to-GPU communication with
>each other (i.e., GPU pointers are passed to the MPI transmission primitives).
>I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5,
>which is in turn built against CUDA 7.0. In my current configuration, I have 4
>MPS server daemons running, each of which controls access to one of 4 GPUs;
>the MPI processes spawned by my program are partitioned into 4 groups
>(which might contain different numbers of processes) that each talk to a
>separate daemon. For certain transmission patterns between these
>processes, the program runs without any problems. For others (e.g., 16
>processes partitioned into 4 groups), however, it dies with the following 
>error:
>
>[node05:20562] Failed to register remote memory, rc=-1
>--
>The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
>will cause the program to abort.
>  cuIpcOpenMemHandle return value:   21199360
>  address: 0x1
>Check the cuda.h file for what the return value means. Perhaps a reboot of
>the node will clear the problem.
>--
>[node05:20562] [[58522,2],4] ORTE_ERROR_LOG: Error in file
>pml_ob1_recvreq.c at line 477
>---
>Child job 2 terminated normally, but 1 process returned a non-zero exit code..
>Per user-direction, the job has been aborted.
>---
>[node05][[58522,2],5][btl_tcp_frag.c:142:mca_btl_tcp_frag_send]
>mca_btl_tcp_frag_send: writev failed: Connection reset by peer (104)
>[node05:20564] Failed to register remote memory, rc=-1 [node05:20564]
>[[58522,2],6] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20566] Failed to register remote memory, rc=-1 [node05:20566]
>[[58522,2],8] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20567] Failed to register remote memory, rc=-1 [node05:20567]
>[[58522,2],9] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05][[58522,2],11][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
>mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>[node05:20569] Failed to register remote memory, rc=-1 [node05:20569]
>[[58522,2],11] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20571] Failed to register remote memory, rc=-1 [node05:20571]
>[[58522,2],13] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20572] Failed to register remote memory, rc=-1 [node05:20572]
>[[58522,2],14] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>
>After the above error occurs, I notice that /dev/shm/ is littered with
>cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
>but that doesn't seem to have any effect upon the problem. Rebooting the
>machine also doesn't have any effect. I should also add that my program runs
>without any error if the groups of MPI processes talk directly to the GPUs
>instead of via MPS.
>
>Does anyone have any ideas as to what could be going on?
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/05/26881.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---