Thanks Andreas for the hint. Actually, what I am trying is little bit
complex than that. I have two python processes running on two GPUs. In a
simpler setting, I have array x in gpu0's Python process to be transferred
to gpu1's process and vice versa.

I solved it in this schema:
* alloc-host-memory
* memcpy from device to host (gpu0 to host; gpu1 to host)
* send/receive objects in host memory to the Python process in the other
gpu
* memcpy from host to device within respective gpu

The solution and output from a sample run follow. Now, I wonder if it is
possible to improve this further. One possibility is whether the device to
host copy can be eliminated. Because, I need to transfer several theano
tensors between multiple (up to 4) gpus and I need to do this quite
frequently (say every nth mini batch) during training.

Note: Not all gpus are P2P capable and so memcpy_peer wouldn't work.

import multiprocessing as mp
import numpy as np
import zmq
import time

import pycuda
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray

def proc1():
    import theano
    sock = zmq.Context().socket(zmq.PAIR)
    sock.connect('tcp://localhost:5003')

    drv.init()
    ctx = drv.Context.attach()

    x_gpu = gpuarray.to_gpu(np.random.rand(8))
    y_gpu_copy = gpuarray.zeros_like(x_gpu)

    x_host = drv.pagelocked_zeros_like(x_gpu)
    drv.memcpy_dtoh_async(x_host, x_gpu.ptr)

    sock.send_pyobj(x_host)
    y_host_copy = sock.recv_pyobj()
    drv.memcpy_htod_async(y_gpu_copy.ptr, y_host_copy)

    print "Proc-1: value before transfer\n", x_gpu
    print "Proc-1: value after transfer\n", y_gpu_copy
    print "Proc-1: sum after transfer\n", x_gpu + y_gpu_copy

    ctx.detach()

def proc2():
    import theano
    sock = zmq.Context().socket(zmq.PAIR)
    sock.bind('tcp://*:5003')

    drv.init()
    ctx = drv.Context.attach()

    y_gpu = gpuarray.to_gpu(np.random.rand(8) * 0.9)
    x_gpu_copy = gpuarray.zeros_like(y_gpu)

    y_host = drv.pagelocked_zeros_like(y_gpu)
    drv.memcpy_dtoh_async(y_host, y_gpu.ptr)

    sock.send_pyobj(y_host)
    x_host_copy = sock.recv_pyobj()
    drv.memcpy_htod_async(x_gpu_copy.ptr, x_host_copy)

    time.sleep(10)
    print "\nProc-2: value before transfer\n", y_gpu
    print "Proc-2: value after transfer\n", x_gpu_copy
    print "Proc-2: sum after transfer\n", y_gpu + x_gpu_copy

    ctx.detach()

if __name__ == '__main__':
    p1 = mp.Process(target=proc1)
    p2 = mp.Process(target=proc2)

    p1.start()
    p2.start()

Here is the output from a sample run. As expected, the sum value in both
processes are same in the end.

[dccxc090] ~/multi-GPUs $ /opt/share/Python-2.7.9/bin/python
multi_pycuda_d2d_demo.py
Using gpu device 0: Tesla K40m (CNMeM is disabled)
Using gpu device 1: Tesla K40m (CNMeM is disabled)
Proc-1: value before transfer
[ 0.64424104  0.98413032  0.46654151  0.40943486  0.6895878   0.81006672
  0.00907435  0.88727554]
Proc-1: value after transfer
[ 0.57981693  0.88571729  0.41988736  0.36849138  0.62062902  0.72906005
  0.00816691  0.79854798]
Proc-1: sum after transfer
[ 1.22405797  1.86984761  0.88642887  0.77792624  1.31021682  1.53912676
  0.01724126  1.68582352]

Proc-2: value before transfer
[ 0.57981693  0.88571729  0.41988736  0.36849138  0.62062902  0.72906005
  0.00816691  0.79854798]
Proc-2: value after transfer
[ 0.64424104  0.98413032  0.46654151  0.40943486  0.6895878   0.81006672
  0.00907435  0.88727554]
Proc-2: sum after transfer
[ 1.22405797  1.86984761  0.88642887  0.77792624  1.31021682  1.53912676
  0.01724126  1.68582352]

- Baskaran

On Wed, Nov 11, 2015 at 2:40 AM, Andreas Kloeckner <li...@informa.tiker.net>
wrote:

> Baskaran Sankaran <baskar...@gmail.com> writes:
>
> > Hi all,
> >
> > I am looking for a solution for exchanging some tensors between two gpus,
> > that do not have P2P enabled. Assuming two GPUs on the same node, I
> guess I
> > have to do it in two steps; first copy to host memory from GPU (gpu-0)
> and
> > then copy from host memory to the other GPU (gpu-1). However it is not
> > exactly clear to me as to how I can go about this.
>
> (1) Allocate memory on host
> (2) memcpy(host mem, gpu0_mem)
> (3) memcpy(gpu1_mem, host_mem)
> (4) (Optionally) free host mem
>
> Not sure what you're asking...
>
> Andreas
>
_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to