Ok, I think you found the source of my problem Apostolis.

I profiled the execution both on the server and on the laptop and the
calls to memcpy with no pinned were considerably faster on the servers
Tesla than in the laptop's GT 540M and pinned memory transfer took
about the same time in both.

>From your previous email, I decided to give a look into the (py)cuda
initialization. So I removed the pycuda.autoinit import and made the
initialization "by hand" to perform some rustic time-measuring. I
added the following lines at the start of the benchmark() function:
   print "Starting initialization"
   cuda.init()
   dev = cuda.Device(0)
   ctx = dev.make_context()
   print "Initialization finished"


So I ran this modified code, and in the laptop, it executed pretty
fast, the time ellapsed between both prints less than a second. But
when I ran it on the server, there were about 10 seconds between the
first print and the last one.

Upon the receiving of your last e-mail I ran nvidia-smi first and then
the program with no changes. But then I tried leaving nvidia-smi
looping with the -l argument and run the program on another tty and,
to my surprise it ran in a little less than 2 seconds, against those
nearly 15 when nvidia-smi ain't looping/executing.
This is still slower than the laptop, but this particular code is not
optimized for multi-gpu and there could be other factors like the
communication latency over the PCI bus(which they told me in this list
it's sometimes lower on laptops) and the fact that I am executing
remotely via ssh.

As for what you told me about mounting /dev/nvidia, I had to do that
previously, because as I didn't install a GUI they wouldn't mount on
boot-time and therefore cuda programs would not detect the devices (I
had this problem after finishing CUDA installation and running the
deviceQuery example from the SDK which gave me the "Non capable CUDA
devices found").

Any further ideas on why this nvidia-smi execution at the same time
boosts initialization so much? You've been really helpful and I really
appreciate your help, even if you cannot help me any more (I'll just
have to wait those damned Nvidia forums come back :) )


On Tue, Jul 31, 2012 at 1:52 PM, Apostolis Glenis <[email protected]> wrote:
> I think is the same case.
> The NVIDIA driver is initialized when X-windows starts or at the first
> execution of a GPU program.
> Could you try nvidia-smi first and then your program.
> I have read somewhere (i think in the thrust-users mailing list) that you
> have to load /dev/nvidia first or something like that.
> The closer thing I could find was that:
> http://www.gpugrid.net/forum_thread.php?id=266
>
>
> 2012/7/31 Leandro Demarco Vedelago <[email protected]>
>>
>> Apostolis, I'm not using X windows, as I did not install any GUI on the
>> server
>>
>> On Tue, Jul 31, 2012 at 11:46 AM, Apostolis Glenis
>> <[email protected]> wrote:
>> > maybe it has to do with the initialization of the GPU if another gpu is
>> > responsible for X windows.
>> >
>> >
>> > 2012/7/31 Leandro Demarco Vedelago <[email protected]>
>> >>
>> >> Just to add a concrete and simple example, that I gues will clarify mi
>> >> situation. The following code creates two buffer on the host side, one
>> >> pagelocked and the other a common one, and then copies/writes to a gpu
>> >> buffer and evaluates performance using events for time measuring.
>> >> It's really simple indeed, there's no execution on multiple gpu,  but
>> >> i would expect it to run in more or less the same time in the server
>> >> using just one of the Teslas.
>> >> However, it takes less than a second to run in my laptop and nearly 15
>> >> seconds on the server!!!
>> >>
>> >> import pycuda.driver as cuda
>> >> import pycuda.autoinit
>> >> import numpy as np
>> >>
>> >> def benchmark(up):
>> >>         """ Up is a boolean flag. If set to True, benchmark is ran
>> >> copying
>> >> from
>> >>                 host to device; if false, the benchmark is ran the
>> >> other
>> >> way round
>> >>         """
>> >>
>> >>         # Buffers size
>> >>         size = 10*1024*1024
>> >>
>> >>         # Host and device buffer, equally-shaped. We don't care about
>> >> their contents
>> >>         cpu_buff = np.empty(size, np.dtype('u1'))
>> >>         cpu_locked_buff = cuda.pagelocked_empty(size, np.dtype('u1'))
>> >>         gpu_buff = cuda.mem_alloc(cpu_buff.nbytes)
>> >>
>> >>         # Events for measuring execution time; first two, for not
>> >> pinned
>> >> buffer,
>> >>         # las 2 for pinned(locked) buffer
>> >>         startn = cuda.Event()
>> >>         endn = cuda.Event()
>> >>         startl = cuda.Event()
>> >>         endl = cuda.Event()
>> >>
>> >>         if (up):
>> >>                 startn.record()
>> >>                 cuda.memcpy_htod(gpu_buff, cpu_buff)
>> >>                 endn.record()
>> >>                 endn.synchronize()
>> >>                 t1 = endn.time_since(startn)
>> >>
>> >>                 startl.record()
>> >>                 cuda.memcpy_htod(gpu_buff, cpu_locked_buff)
>> >>                 endl.record()
>> >>                 endl.synchronize()
>> >>                 t2 = endl.time_since(startl)
>> >>
>> >>                 print "From host to device benchmark results: \n"
>> >>                 print "Time for copying from normal host mem: %i ms\n"
>> >> %
>> >> t1
>> >>                 print "Time for copying from pinned host mem: %i ms\n"
>> >> %
>> >> t2
>> >>
>> >>                 diff = t1-t2
>> >>                 if (diff > 0):
>> >>                         print "Copy from pinned memory was %i ms
>> >> faster\n"
>> >> % diff
>> >>                 else:
>> >>                         print "Copy from pinned memory was %i ms
>> >> slower\n"
>> >> % diff
>> >>
>> >>         else:
>> >>                 startn.record()
>> >>                 cuda.memcpy_dtoh(cpu_buff, gpu_buff)
>> >>                 endn.record()
>> >>                 endn.synchronize()
>> >>                 t1 = endn.time_since(startn)
>> >>
>> >>                 startl.record()
>> >>                 cuda.memcpy_dtoh(cpu_locked_buff, gpu_buff)
>> >>                 endl.record()
>> >>                 endl.synchronize()
>> >>                 t2 = endl.time_since(startl)
>> >>
>> >>                 print "From device to host benchmark results: \n"
>> >>                 print "Time for copying to normal host mem: %i ms\n" %
>> >> t1
>> >>                 print "Time for copying to pinned host mem: %i ms\n" %
>> >> t2
>> >>
>> >>                 diff = t1-t2
>> >>                 if (diff > 0):
>> >>                         print "Copy to pinned memory was %i ms
>> >> faster\n" %
>> >> diff
>> >>                 else:
>> >>                         print "Copy to pinned memory was %i ms
>> >> slower\n" %
>> >> diff
>> >>
>> >> benchmark(up=False)
>> >>
>> >>
>> >> On Mon, Jul 30, 2012 at 3:22 PM, Leandro Demarco Vedelago
>> >> <[email protected]> wrote:
>> >> > ---------- Forwarded message ----------
>> >> > From: Leandro Demarco Vedelago <[email protected]>
>> >> > Date: Mon, Jul 30, 2012 at 2:57 PM
>> >> > Subject: Re: [PyCUDA] Performance Issues
>> >> > To: Brendan Wood <[email protected]>, [email protected]
>> >> >
>> >> >
>> >> > Brendan:
>> >> > Basically, all the examples are computing the dot product of 2 large
>> >> > vectors. But in each example some new concept is introduced (pinned
>> >> > memory, streams, etc).
>> >> > The last example is the one that incorporates multiple-gpu.
>> >> >
>> >> > As for the work done, I am generating the data randomly and, making
>> >> > some tests at the end in the host side, which considerably increases
>> >> > ex ecution time, but as this are "learning examples" I was not
>> >> > specially worried about it. But I would have expected that given that
>> >> > the server has way more powerful hardware (the 3 teslas 2075 and 4
>> >> > intel xeon with 6 cores each and 48 GB ram) programs would run
>> >> > faster,
>> >> > in particular this last example that is designed to work with
>> >> > multiples-gpu's.
>> >> >
>> >> > I compiled and ran the bandwith test and the queryDevice samples from
>> >> > the SDK and they both passed, if that is what you meant.
>> >> >
>> >> > Now answering to Andreas:
>> >> > yes, I'm using one thread per each GPU (as the way it's done in the
>> >> > wiki example) and yes, the server has way more than 3 CPU's. As for
>> >> > the SCHED_BLOCKING_SYNC flag, should I pass it as an argument for
>> >> > each
>> >> > device context. What does this flag do?
>> >> >
>> >> > Thank you both for your answers
>> >> >
>> >> > On Mon, Jul 30, 2012 at 12:47 AM, Brendan Wood
>> >> > <[email protected]>
>> >> > wrote:
>> >> >> Hi Leandro,
>> >> >>
>> >> >> Without knowing exactly what examples you're running, it may be hard
>> >> >> to
>> >> >> say what the problem is.  In fact, you may not really have a
>> >> >> problem.
>> >> >>
>> >> >> How much work is being done in each example program?  Is it enough
>> >> >> to
>> >> >> really work the GPU, or is communication and other overhead
>> >> >> dominating
>> >> >> runtime?  Note that laptops may have lower communication latency
>> >> >> over
>> >> >> the PCI bus than desktops/servers, which can make small programs run
>> >> >> much faster on laptops regardless of how much processing power the
>> >> >> GPU
>> >> >> has.
>> >> >>
>> >> >> Have you tried running the sample code from the SDK, so that you can
>> >> >> verify that it's not a code problem?
>> >> >>
>> >> >> Regards,
>> >> >>
>> >> >> Brendan Wood
>> >> >>
>> >> >>
>> >> >> On Sun, 2012-07-29 at 23:59 -0300, Leandro Demarco Vedelago wrote:
>> >> >>> Hello: I've been reading and learning CUDA in the last few weeks
>> >> >>> and
>> >> >>> last week I started writing (translating to Pycuda from Cuda-C)
>> >> >>> some
>> >> >>> examples taken from the book "Cuda by Example".
>> >> >>> I started coding on a laptop with just one nvidia GPU (a gtx 560M
>> >> >>> if
>> >> >>> my memory is allright) with Windows 7.
>> >> >>>
>> >> >>> But in the project I'm currently working at, we intend to run
>> >> >>> (py)cuda
>> >> >>> on a multi-gpu server that has three Tesla C2075 cards.
>> >> >>>
>> >> >>> So I installed Ubuntu server 10.10 (with no  GUI) and managed to
>> >> >>> install and get running the very same examples I ran on the
>> >> >>> single-gpu
>> >> >>> laptop. However they run really slow, in some cases it takes 3
>> >> >>> times
>> >> >>> more than in the laptop. And this happens with most, if not all,
>> >> >>> the
>> >> >>> examples I wrote.
>> >> >>>
>> >> >>> I thought it could be a driver issue but I double-checked and I've
>> >> >>> installed the correct ones, meaning those listed on the CUDA Zone
>> >> >>> section of nvidia.com for linux 64-bits. So I'm kind of lost right
>> >> >>> now
>> >> >>> and was wondering if anyone has had this or somewhat similar
>> >> >>> problem
>> >> >>> running on a server.
>> >> >>>
>> >> >>> Sorry for the English, but it's not my native language.
>> >> >>>
>> >> >>> Thanks in advance, Leandro Demarco
>> >> >>>
>> >> >>> _______________________________________________
>> >> >>> PyCUDA mailing list
>> >> >>> [email protected]
>> >> >>> http://lists.tiker.net/listinfo/pycuda
>> >> >>
>> >> >>
>> >>
>> >> _______________________________________________
>> >> PyCUDA mailing list
>> >> [email protected]
>> >> http://lists.tiker.net/listinfo/pycuda
>> >
>> >
>
>

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to