Re: [PyCUDA] Performance Issues

Leandro Demarco Vedelago Tue, 31 Jul 2012 06:58:18 -0700

Just to add a concrete and simple example, that I gues will clarify mi
situation. The following code creates two buffer on the host side, one
pagelocked and the other a common one, and then copies/writes to a gpu
buffer and evaluates performance using events for time measuring.
It's really simple indeed, there's no execution on multiple gpu,  but
i would expect it to run in more or less the same time in the server
using just one of the Teslas.
However, it takes less than a second to run in my laptop and nearly 15
seconds on the server!!!


import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np

def benchmark(up):
        """ Up is a boolean flag. If set to True, benchmark is ran copying from
                host to device; if false, the benchmark is ran the other way 
round
        """
        
        # Buffers size
        size = 10*1024*1024
        
        # Host and device buffer, equally-shaped. We don't care about their 
contents
        cpu_buff = np.empty(size, np.dtype('u1'))
        cpu_locked_buff = cuda.pagelocked_empty(size, np.dtype('u1'))
        gpu_buff = cuda.mem_alloc(cpu_buff.nbytes)
        
        # Events for measuring execution time; first two, for not pinned buffer,
        # las 2 for pinned(locked) buffer
        startn = cuda.Event()
        endn = cuda.Event()
        startl = cuda.Event()
        endl = cuda.Event()
        
        if (up):
                startn.record()
                cuda.memcpy_htod(gpu_buff, cpu_buff)
                endn.record()
                endn.synchronize()
                t1 = endn.time_since(startn)
                
                startl.record()
                cuda.memcpy_htod(gpu_buff, cpu_locked_buff)
                endl.record()
                endl.synchronize()
                t2 = endl.time_since(startl)
                
                print "From host to device benchmark results: \n"
                print "Time for copying from normal host mem: %i ms\n" % t1
                print "Time for copying from pinned host mem: %i ms\n" % t2
                
                diff = t1-t2
                if (diff > 0):
                        print "Copy from pinned memory was %i ms faster\n" % 
diff
                else:
                        print "Copy from pinned memory was %i ms slower\n" % 
diff
                        
        else:
                startn.record()
                cuda.memcpy_dtoh(cpu_buff, gpu_buff)
                endn.record()
                endn.synchronize()
                t1 = endn.time_since(startn)
                
                startl.record()
                cuda.memcpy_dtoh(cpu_locked_buff, gpu_buff)
                endl.record()
                endl.synchronize()
                t2 = endl.time_since(startl)
                
                print "From device to host benchmark results: \n"
                print "Time for copying to normal host mem: %i ms\n" % t1
                print "Time for copying to pinned host mem: %i ms\n" % t2
                
                diff = t1-t2
                if (diff > 0):
                        print "Copy to pinned memory was %i ms faster\n" % diff
                else:
                        print "Copy to pinned memory was %i ms slower\n" % diff 
                        
benchmark(up=False)


On Mon, Jul 30, 2012 at 3:22 PM, Leandro Demarco Vedelago
<[email protected]> wrote:
> ---------- Forwarded message ----------
> From: Leandro Demarco Vedelago <[email protected]>
> Date: Mon, Jul 30, 2012 at 2:57 PM
> Subject: Re: [PyCUDA] Performance Issues
> To: Brendan Wood <[email protected]>, [email protected]
>
>
> Brendan:
> Basically, all the examples are computing the dot product of 2 large
> vectors. But in each example some new concept is introduced (pinned
> memory, streams, etc).
> The last example is the one that incorporates multiple-gpu.
>
> As for the work done, I am generating the data randomly and, making
> some tests at the end in the host side, which considerably increases
> ex ecution time, but as this are "learning examples" I was not
> specially worried about it. But I would have expected that given that
> the server has way more powerful hardware (the 3 teslas 2075 and 4
> intel xeon with 6 cores each and 48 GB ram) programs would run faster,
> in particular this last example that is designed to work with
> multiples-gpu's.
>
> I compiled and ran the bandwith test and the queryDevice samples from
> the SDK and they both passed, if that is what you meant.
>
> Now answering to Andreas:
> yes, I'm using one thread per each GPU (as the way it's done in the
> wiki example) and yes, the server has way more than 3 CPU's. As for
> the SCHED_BLOCKING_SYNC flag, should I pass it as an argument for each
> device context. What does this flag do?
>
> Thank you both for your answers
>
> On Mon, Jul 30, 2012 at 12:47 AM, Brendan Wood <[email protected]> wrote:
>> Hi Leandro,
>>
>> Without knowing exactly what examples you're running, it may be hard to
>> say what the problem is.  In fact, you may not really have a problem.
>>
>> How much work is being done in each example program?  Is it enough to
>> really work the GPU, or is communication and other overhead dominating
>> runtime?  Note that laptops may have lower communication latency over
>> the PCI bus than desktops/servers, which can make small programs run
>> much faster on laptops regardless of how much processing power the GPU
>> has.
>>
>> Have you tried running the sample code from the SDK, so that you can
>> verify that it's not a code problem?
>>
>> Regards,
>>
>> Brendan Wood
>>
>>
>> On Sun, 2012-07-29 at 23:59 -0300, Leandro Demarco Vedelago wrote:
>>> Hello: I've been reading and learning CUDA in the last few weeks and
>>> last week I started writing (translating to Pycuda from Cuda-C) some
>>> examples taken from the book "Cuda by Example".
>>> I started coding on a laptop with just one nvidia GPU (a gtx 560M if
>>> my memory is allright) with Windows 7.
>>>
>>> But in the project I'm currently working at, we intend to run (py)cuda
>>> on a multi-gpu server that has three Tesla C2075 cards.
>>>
>>> So I installed Ubuntu server 10.10 (with no  GUI) and managed to
>>> install and get running the very same examples I ran on the single-gpu
>>> laptop. However they run really slow, in some cases it takes 3 times
>>> more than in the laptop. And this happens with most, if not all, the
>>> examples I wrote.
>>>
>>> I thought it could be a driver issue but I double-checked and I've
>>> installed the correct ones, meaning those listed on the CUDA Zone
>>> section of nvidia.com for linux 64-bits. So I'm kind of lost right now
>>> and was wondering if anyone has had this or somewhat similar problem
>>> running on a server.
>>>
>>> Sorry for the English, but it's not my native language.
>>>
>>> Thanks in advance, Leandro Demarco
>>>
>>> _______________________________________________
>>> PyCUDA mailing list
>>> [email protected]
>>> http://lists.tiker.net/listinfo/pycuda
>>
>>

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] Performance Issues

Reply via email to