Re: [PyCUDA] Histograms with PyCUDA

Francisco Villaescusa Wed, 11 Apr 2012 11:54:21 -0700

Thanks a lot,

I see... It is amazing how a "trivial" thing for a serial code becomes a 
"nightmare" in a GPU.


Thanks again,

Fran.

On abr 11, 2012, at 20:23, "Pazzula, Dominic J " <dominic.j.pazz...@citi.com> 
wrote:

> So far, I cannot improve upon what you originally had.
> 
> I made this change to see what happened:
> grid_gpu_template = """
> __global__ void grid(float *values, int size, float *temp_grid) {
>     unsigned int id = threadIdx.x;
>     unsigned int bid = blockIdx.x;
>     const uint threads = %(max_number_of_threads)s;
> 
>     int i,bin;
>     const uint interv =%(interv)s ;
>     __shared__ float grd[interv];
>     unsigned int row = bid*interv;
>     unsigned int idx ;
> 
>     for(i=0;i<interv;i++){
>        grd[i] = 0;
>     }
> 
>     for (i=0;i<threads;i++)
>     {
>        idx = bid*threads + i;
>        if (idx < size){
>           bin=(int)(values[idx]*interv);
>           if (bin==interv){
>              bin=interv-1;
>           }
>           grd[bin] +=1;
>        }
>     }
> 
>     for(i=0;i<interv;i++){
>        temp_grid[row+i] = grd[i];
>     }
> 
> }
> """
> (calling like grid(values_gpu, 
> np.int32(number_of_points),temp_grid_gpu,grid=(blocks,1),block=(1,1,1)) )
> 
> This was about 3.5 times slower.
> 
> I commented out the write to the grd[bin] (/*grd[bin] +=1;*/) and it ran in 
> less than 10ms.  What this tells me is that the contention of writing to the 
> shared memory far outweighs the slow write to global memory.
> 
> Maybe someone else can find a faster way...
> 
> -----Original Message-----
> From: Francisco Villaescusa Navarro [mailto:villaescusa.franci...@gmail.com]
> Sent: Wednesday, April 11, 2012 11:54 AM
> To: Pazzula, Dominic J [ICG-IT]
> Subject: Re: [PyCUDA] Histograms with PyCUDA
> 
> that would be great!
> 
> Thanks a lot for your help, I really appreciate it.
> 
> Best,
> 
> Fran.
> 
> El 11/04/2012, a las 18:51, Pazzula, Dominic J escribió:
> 
>> n/p.  I am going to have some time later today.  I'm going to look
>> harder at this and make sure we are not going down a rabbit hole.
>> 
>> -----Original Message-----
>> From: Francisco Villaescusa Navarro [mailto:villaescusa.franci...@gmail.com
>> ]
>> Sent: Wednesday, April 11, 2012 11:48 AM
>> To: Pazzula, Dominic J [ICG-IT]
>> Subject: Re: [PyCUDA] Histograms with PyCUDA
>> 
>> Thanks a lot!
>> 
>> I will take a look at it.
>> 
>> Fran.
>> 
>> 
>> El 11/04/2012, a las 18:44, Pazzula, Dominic J escribió:
>> 
>>> atomicAdd() on the race condition question.
>>> 
>>> http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/
>>> 
>>> 
>>> -----Original Message-----
>>> From: Francisco Villaescusa Navarro [mailto:villaescusa.franci...@gmail.com
>>> ]
>>> Sent: Wednesday, April 11, 2012 11:01 AM
>>> To: Pazzula, Dominic J [ICG-IT]
>>> Cc: pycuda@tiker.net
>>> Subject: Re: [PyCUDA] Histograms with PyCUDA
>>> 
>>> Thanks a lot for the detailed answer!
>>> 
>>> So you are suggesting creating an histogram per block, right? My
>>> (probably stupid) question is how do you manage it to properly
>>> account
>>> all elements. Imagine that thread 0 is analyzing a element array
>>> whose
>>> bin position is zero, then it will make something as A[0]++, but if
>>> thread 25 is analyzing other element whose bin is also zero, how can
>>> you sum properly this bin taking into account that thread 0 has found
>>> another element for bin 0?
>>> 
>>> I have tried the following modification to the kernel:
>>> 
>>> grid_gpu_template = """
>>> __global__ void grid(float *values, int size, float *temp_grid)
>>> {
>>>   unsigned int id = threadIdx.x;
>>>   int i,bin;
>>>   const uint interv = %(interv)s;
>>> 
>>>   float A[interv];
>>>   for(i=0;i<interv;i++){
>>>      A[i]=0.0;
>>>   }
>>> 
>>>   for(i=id;i<size;i+=blockDim.x){
>>>       bin=(int)(values[i]*interv);
>>>       if (bin==interv){
>>>          bin=interv-1;
>>>       }
>>>       A[bin]+=1.0;
>>>   }
>>>   for(i=0;i<interv;i++){
>>>      temp_grid[id*interv+i]=A[i];
>>>   }
>>> 
>>> }
>>> """
>>> 
>>> As you can see, the "local variable" A is used to make the histogram,
>>> and when it finishes, its values are transfered to global memory. The
>>> time to create the histogram with variable A is very very short
>>> (smaller than 1 ms), whereas the time to transfer it to global memory
>>> (the last loop) becomes very large. At the end of the day, with this
>>> kernel, the total time is even worst than with in the previous one.
>>> 
>>> Thanks again,
>>> 
>>> Fran.
>>> 
>>> El 11/04/2012, a las 16:59, Pazzula, Dominic J escribió:
>>> 
>>>> Yes and no.  Any variable you declare inside the kernel is "local."
>>>> However, the number of registers are limited to only a few kb.
>>>> These registers are the memory locations located on the chip, next
>>>> to the processors.  If the compiler detects that you are over your
>>>> limit, it will move that memory into the slower Global bank (I.E.
>>>> those big ram chips you see stuck on the card).
>>>> 
>>>> What I suggested about using shared memory is basically what you are
>>>> suggesting.  Shared memory is pseudo local to the processor.  It is
>>>> shared across threads in a block.  Physically, it is located on the
>>>> chip. It is very fast, much faster than accessing memory in the
>>>> global.
>>>> 
>>>> Writing into shared memory is where you get the memory contentions I
>>>> was talking about.  Sorry about the confusion.
>>>> 
>>>> So what you should do is have each block tally up its bin counts
>>>> into a shared array.  Your output matrix should be (Blocks x Bins)
>>>> instead of (Threads x Bins).  At the end of the kernel, call
>>>> __syncthreads() and then write the values from the block's shared
>>>> memory into the output matrix.  You should only do it once, so only
>>>> 1 thread needs do the write.
>>>> 
>>>> This should give you a factor of 2 or more improvement.
>>>> 
>>>> For now, don't worry about optimizing the shared memory access.  I
>>>> don't think you will be able to optimize this because of the random
>>>> nature of the input data.  You could possibly sort the data in a way
>>>> that could minimize the contention, but the sort probably takes more
>>>> time than it saves.
>>>> 
>>>> HTH
>>>> Dom
>>>> 
>>>> -----Original Message-----
>>>> From: Francisco Villaescusa Navarro [mailto:villaescusa.franci...@gmail.com
>>>> ]
>>>> Sent: Wednesday, April 11, 2012 9:39 AM
>>>> To: Pazzula, Dominic J [ICG-IT]
>>>> Cc: 'Francisco Villaescusa Navarro'; 'Thomas Wiecki'; 'pycuda@tiker.net
>>>> '
>>>> Subject: Re: [PyCUDA] Histograms with PyCUDA
>>>> 
>>>> OK thanks,
>>>> 
>>>> I understand. Is there any way to tell PyCUDA which parts of an
>>>> array
>>>> (or matrix) should go into each memory bank?
>>>> 
>>>> I guess that if I were able to place each thread histogram into a
>>>> different bank memory, there should not be these problems and the
>>>> calculation should be much faster?
>>>> 
>>>> It is possible to declare a "local" variable for each thread such as
>>>> each thread works with "its own memory" instead of using the global
>>>> memory (and transfer the data at the end, when threads finish)?
>>>> 
>>>> Thanks a lot!
>>>> 
>>>> Fran.
>>>> 
>>>> 
>>>> El 11/04/2012, a las 16:13, Pazzula, Dominic J escribió:
>>>> 
>>>>> Looking at it, memory bank conflicts may only exist in shared
>>>>> memory
>>>>> structures.  If you think of the memory as a matrix NxM.  Memory
>>>>> addresses in column i are controlled by the same memory controller.
>>>>> The column is said to be a bank.  If two processes are trying to
>>>>> write to addresses (0,i) and (1,i), then it will take 2 memory
>>>>> cycles to put those in.  The controller can only handle 1 write
>>>>> into
>>>>> the bank at a time.
>>>>> 
>>>>> If I comment out the line:
>>>>> /*temp_grid[id*interv+bin]+=1.0;*/
>>>>> 
>>>>> Processing time drops from 140.7ms to 1ms.  The time spent in the
>>>>> binning kernel goes from 139.9 to .17ms.  That write into the
>>>>> global
>>>>> memory is taking all the time in the process.
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Francisco Villaescusa Navarro 
>>>>> [mailto:villaescusa.franci...@gmail.com
>>>>> ]
>>>>> Sent: Wednesday, April 11, 2012 3:33 AM
>>>>> To: Pazzula, Dominic J [ICG-IT]
>>>>> Cc: 'Francisco Villaescusa Navarro'; 'Thomas Wiecki'; 'pycuda@tiker.net
>>>>> '
>>>>> Subject: Re: [PyCUDA] Histograms with PyCUDA
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Thanks for checking it.
>>>>> 
>>>>> I'm not sure to fully understand your comment. The matrix temp_grid
>>>>> is
>>>>> a matrix of size (threads, intervals). The thread 0 will create an
>>>>> histogram and will save it on the first column of that matrix, the
>>>>> thread 1 will create an histogram and will save it on the second
>>>>> row
>>>>> of that matrix....
>>>>> 
>>>>> Once you have the matrix filled, the reduction part will just sum
>>>>> all
>>>>> the histograms to produce the final result.
>>>>> 
>>>>> So I think there is no memory conflict between different threads
>>>>> (the
>>>>> elements filled by one thread are never touched by other threads).
>>>>> 
>>>>> Are you saying that accessing those positions of memory every time,
>>>>> in
>>>>> the same thread is a low step?
>>>>> 
>>>>> Thanks a lot,
>>>>> 
>>>>> Fran.
>>>>> 
>>>>> El 10/04/2012, a las 22:04, Pazzula, Dominic J escribió:
>>>>> 
>>>>>> I'm seeing performance of around a 7.5x speed up on my 8500GT
>>>>>> and 1
>>>>>> year old Xeon.  140ms GPU vs 1047ms CPU on 10,000,000 points.
>>>>>> 
>>>>>> For me, the slower memory throughput kills the idea of using MKL
>>>>>> and
>>>>>> numpy for the reduction.  MKL can do the reduction step in less
>>>>>> than .01ms and the GPU is doing it in .9ms.  But the transfer time
>>>>>> of that 2MB (512x1024 floats) is about 2ms, giving the GPU the
>>>>>> edge.
>>>>>> 
>>>>>> I am guessing here, but my thought on the slower than expected
>>>>>> performance comes from the writing of the bins.
>>>>>> 
>>>>>>   temp_grid[id*interv+bin]+=1.0;
>>>>>> 
>>>>>> If I am remembering correctly, the GPU cannot write to the same
>>>>>> memory slot simultaneously.  If the memory is aligned such that
>>>>>> bin=0 for each core is in the same slot, then if two cores were
>>>>>> attempting to adjust the value in bin=0, it would take 2 cycles
>>>>>> instead of 1.  10 writing means 10 cycles.  Etc.  Even if the
>>>>>> memory
>>>>>> is not aligned, you will be attempting to write to the same
>>>>>> block a
>>>>>> large number of times.
>>>>>> 
>>>>>> Utilizing shared memory to temporarily hold the bin counts and
>>>>>> then
>>>>>> putting them into the global output matrix SHOULD increase your
>>>>>> time.  If I have a chance, I'll work on it and post an updated
>>>>>> kernel.
>>>>>> 
>>>>>> Dominic
>>>>>> 
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: pycuda-boun...@tiker.net [mailto:pycuda-boun...@tiker.net]
>>>>>> On
>>>>>> Behalf Of Francisco Villaescusa Navarro
>>>>>> Sent: Friday, April 06, 2012 11:26 AM
>>>>>> To: Thomas Wiecki
>>>>>> Cc: pycuda@tiker.net
>>>>>> Subject: Re: [PyCUDA] Histograms with PyCUDA
>>>>>> 
>>>>>> Thanks for all the suggestions!
>>>>>> 
>>>>>> Regarding removing sqrt: it seems that the code only gains about
>>>>>> ~1%,
>>>>>> and you lose the capacity to easily define linear intervals...
>>>>>> 
>>>>>> I have tried with sqrt and sqrtf, but there is not difference in
>>>>>> the
>>>>>> total time (or it is very small).
>>>>>> 
>>>>>> The code to find the histogram of an array with values between 0
>>>>>> and 1
>>>>>> should be something as:
>>>>>> 
>>>>>> import numpy as np
>>>>>> import time
>>>>>> import pycuda.driver as cuda
>>>>>> import pycuda.autoinit
>>>>>> import pycuda.gpuarray as gpuarray
>>>>>> import pycuda.cumath as cumath
>>>>>> from pycuda.compiler import SourceModule
>>>>>> from pycuda import compiler
>>>>>> 
>>>>>> grid_gpu_template = """
>>>>>> __global__ void grid(float *values, int size, float *temp_grid)
>>>>>> {
>>>>>> unsigned int id = threadIdx.x;
>>>>>> int i,bin;
>>>>>> const uint interv = %(interv)s;
>>>>>> 
>>>>>> for(i=id;i<size;i+=blockDim.x){
>>>>>>    bin=(int)(values[i]*interv);
>>>>>>    if (bin==interv){
>>>>>>       bin=interv-1;
>>>>>>    }
>>>>>>    temp_grid[id*interv+bin]+=1.0;
>>>>>> }
>>>>>> }
>>>>>> """
>>>>>> 
>>>>>> reduction_gpu_template = """
>>>>>> __global__ void reduction(float *temp_grid, float *his)
>>>>>> {
>>>>>> unsigned int id = blockIdx.x*blockDim.x+threadIdx.x;
>>>>>> const uint interv = %(interv)s;
>>>>>> const uint threads = %(max_number_of_threads)s;
>>>>>> 
>>>>>> if(id<interv){
>>>>>>    for(int i=0;i<threads;i++){
>>>>>>        his[id]+=temp_grid[id+interv*i];
>>>>>>    }
>>>>>> }
>>>>>> }
>>>>>> """
>>>>>> 
>>>>>> number_of_points=100000000
>>>>>> max_number_of_threads=512
>>>>>> interv=1024
>>>>>> 
>>>>>> blocks=interv/max_number_of_threads
>>>>>> if interv%max_number_of_threads!=0:
>>>>>> blocks+=1
>>>>>> 
>>>>>> values=np.random.random(number_of_points).astype(np.float32)
>>>>>> 
>>>>>> grid_gpu = grid_gpu_template % {
>>>>>> 'interv': interv,
>>>>>> }
>>>>>> mod_grid = compiler.SourceModule(grid_gpu)
>>>>>> grid = mod_grid.get_function("grid")
>>>>>> 
>>>>>> reduction_gpu = reduction_gpu_template % {
>>>>>> 'interv': interv,
>>>>>> 'max_number_of_threads': max_number_of_threads,
>>>>>> }
>>>>>> mod_redt = compiler.SourceModule(reduction_gpu)
>>>>>> redt = mod_redt.get_function("reduction")
>>>>>> 
>>>>>> values_gpu=gpuarray.to_gpu(values)
>>>>>> temp_grid_gpu
>>>>>> =gpuarray.zeros((max_number_of_threads,interv),dtype=np.float32)
>>>>>> hist=np.zeros(interv,dtype=np.float32)
>>>>>> hist_gpu=gpuarray.to_gpu(hist)
>>>>>> 
>>>>>> start=time.clock()*1e3
>>>>>> grid
>>>>>> (values_gpu
>>>>>> ,np
>>>>>> .int32
>>>>>> (number_of_points
>>>>>> ),temp_grid_gpu,grid=(1,1),block=(max_number_of_threads,1,1))
>>>>>> redt(temp_grid_gpu,hist_gpu,grid=(blocks,
>>>>>> 1),block=(max_number_of_threads,1,1))
>>>>>> hist=hist_gpu.get()
>>>>>> print 'Time used to grid with GPU:',time.clock()*1e3-start,' ms'
>>>>>> 
>>>>>> 
>>>>>> start=time.clock()*1e3
>>>>>> bins_histo=np.linspace(0.0,1.0,interv+1)
>>>>>> hist_CPU=np.histogram(values,bins=bins_histo)[0]
>>>>>> print 'Time used to grid with CPU:',time.clock()*1e3-start,' ms'
>>>>>> 
>>>>>> print 'max difference between methods=',np.max(hist_CPU-hist)
>>>>>> 
>>>>>> 
>>>>>> ################
>>>>>> 
>>>>>> Results:
>>>>>> 
>>>>>> Time used to grid with GPU: 680.0  ms
>>>>>> Time used to grid with CPU: 9320.0  ms
>>>>>> max difference between methods= 0.0
>>>>>> 
>>>>>> So it seems that with this algorithm we can't achieve factors
>>>>>> larger
>>>>>> than ~15
>>>>>> 
>>>>>> Fran.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> PyCUDA mailing list
>>>>>> PyCUDA@tiker.net
>>>>>> http://lists.tiker.net/listinfo/pycuda
>>>>> 
>>>> 
>>> 
>> 
> 

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] Histograms with PyCUDA

Reply via email to