In the case of what you have, A[] is local to the thread.  You would declare it
        __shared__ float A[interv];
for it to be a shared array.

You are still binning by thread and not by block.

Good question on the simultaneous addition.  I will have to think about it and 
get back to you.

-----Original Message-----
From: Francisco Villaescusa Navarro [mailto:villaescusa.franci...@gmail.com]
Sent: Wednesday, April 11, 2012 11:01 AM
To: Pazzula, Dominic J [ICG-IT]
Cc: pycuda@tiker.net
Subject: Re: [PyCUDA] Histograms with PyCUDA

Thanks a lot for the detailed answer!

So you are suggesting creating an histogram per block, right? My
(probably stupid) question is how do you manage it to properly account
all elements. Imagine that thread 0 is analyzing a element array whose
bin position is zero, then it will make something as A[0]++, but if
thread 25 is analyzing other element whose bin is also zero, how can
you sum properly this bin taking into account that thread 0 has found
another element for bin 0?

I have tried the following modification to the kernel:

grid_gpu_template = """
__global__ void grid(float *values, int size, float *temp_grid)
{
     unsigned int id = threadIdx.x;
     int i,bin;
     const uint interv = %(interv)s;

     float A[interv];
     for(i=0;i<interv;i++){
        A[i]=0.0;
     }

     for(i=id;i<size;i+=blockDim.x){
         bin=(int)(values[i]*interv);
         if (bin==interv){
            bin=interv-1;
         }
         A[bin]+=1.0;
     }
     for(i=0;i<interv;i++){
        temp_grid[id*interv+i]=A[i];
     }

}
"""

As you can see, the "local variable" A is used to make the histogram,
and when it finishes, its values are transfered to global memory. The
time to create the histogram with variable A is very very short
(smaller than 1 ms), whereas the time to transfer it to global memory
(the last loop) becomes very large. At the end of the day, with this
kernel, the total time is even worst than with in the previous one.

Thanks again,

Fran.

El 11/04/2012, a las 16:59, Pazzula, Dominic J escribió:

> Yes and no.  Any variable you declare inside the kernel is "local."
> However, the number of registers are limited to only a few kb.
> These registers are the memory locations located on the chip, next
> to the processors.  If the compiler detects that you are over your
> limit, it will move that memory into the slower Global bank (I.E.
> those big ram chips you see stuck on the card).
>
> What I suggested about using shared memory is basically what you are
> suggesting.  Shared memory is pseudo local to the processor.  It is
> shared across threads in a block.  Physically, it is located on the
> chip. It is very fast, much faster than accessing memory in the
> global.
>
> Writing into shared memory is where you get the memory contentions I
> was talking about.  Sorry about the confusion.
>
> So what you should do is have each block tally up its bin counts
> into a shared array.  Your output matrix should be (Blocks x Bins)
> instead of (Threads x Bins).  At the end of the kernel, call
> __syncthreads() and then write the values from the block's shared
> memory into the output matrix.  You should only do it once, so only
> 1 thread needs do the write.
>
> This should give you a factor of 2 or more improvement.
>
> For now, don't worry about optimizing the shared memory access.  I
> don't think you will be able to optimize this because of the random
> nature of the input data.  You could possibly sort the data in a way
> that could minimize the contention, but the sort probably takes more
> time than it saves.
>
> HTH
> Dom
>
> -----Original Message-----
> From: Francisco Villaescusa Navarro [mailto:villaescusa.franci...@gmail.com
> ]
> Sent: Wednesday, April 11, 2012 9:39 AM
> To: Pazzula, Dominic J [ICG-IT]
> Cc: 'Francisco Villaescusa Navarro'; 'Thomas Wiecki'; 'pycuda@tiker.net
> '
> Subject: Re: [PyCUDA] Histograms with PyCUDA
>
> OK thanks,
>
> I understand. Is there any way to tell PyCUDA which parts of an array
> (or matrix) should go into each memory bank?
>
> I guess that if I were able to place each thread histogram into a
> different bank memory, there should not be these problems and the
> calculation should be much faster?
>
> It is possible to declare a "local" variable for each thread such as
> each thread works with "its own memory" instead of using the global
> memory (and transfer the data at the end, when threads finish)?
>
> Thanks a lot!
>
> Fran.
>
>
> El 11/04/2012, a las 16:13, Pazzula, Dominic J escribió:
>
>> Looking at it, memory bank conflicts may only exist in shared memory
>> structures.  If you think of the memory as a matrix NxM.  Memory
>> addresses in column i are controlled by the same memory controller.
>> The column is said to be a bank.  If two processes are trying to
>> write to addresses (0,i) and (1,i), then it will take 2 memory
>> cycles to put those in.  The controller can only handle 1 write into
>> the bank at a time.
>>
>> If I comment out the line:
>> /*temp_grid[id*interv+bin]+=1.0;*/
>>
>> Processing time drops from 140.7ms to 1ms.  The time spent in the
>> binning kernel goes from 139.9 to .17ms.  That write into the global
>> memory is taking all the time in the process.
>>
>> -----Original Message-----
>> From: Francisco Villaescusa Navarro [mailto:villaescusa.franci...@gmail.com
>> ]
>> Sent: Wednesday, April 11, 2012 3:33 AM
>> To: Pazzula, Dominic J [ICG-IT]
>> Cc: 'Francisco Villaescusa Navarro'; 'Thomas Wiecki'; 'pycuda@tiker.net
>> '
>> Subject: Re: [PyCUDA] Histograms with PyCUDA
>>
>> Hi,
>>
>> Thanks for checking it.
>>
>> I'm not sure to fully understand your comment. The matrix temp_grid
>> is
>> a matrix of size (threads, intervals). The thread 0 will create an
>> histogram and will save it on the first column of that matrix, the
>> thread 1 will create an histogram and will save it on the second row
>> of that matrix....
>>
>> Once you have the matrix filled, the reduction part will just sum all
>> the histograms to produce the final result.
>>
>> So I think there is no memory conflict between different threads (the
>> elements filled by one thread are never touched by other threads).
>>
>> Are you saying that accessing those positions of memory every time,
>> in
>> the same thread is a low step?
>>
>> Thanks a lot,
>>
>> Fran.
>>
>> El 10/04/2012, a las 22:04, Pazzula, Dominic J escribió:
>>
>>> I'm seeing performance of around a 7.5x speed up on my 8500GT and 1
>>> year old Xeon.  140ms GPU vs 1047ms CPU on 10,000,000 points.
>>>
>>> For me, the slower memory throughput kills the idea of using MKL and
>>> numpy for the reduction.  MKL can do the reduction step in less
>>> than .01ms and the GPU is doing it in .9ms.  But the transfer time
>>> of that 2MB (512x1024 floats) is about 2ms, giving the GPU the edge.
>>>
>>> I am guessing here, but my thought on the slower than expected
>>> performance comes from the writing of the bins.
>>>
>>>      temp_grid[id*interv+bin]+=1.0;
>>>
>>> If I am remembering correctly, the GPU cannot write to the same
>>> memory slot simultaneously.  If the memory is aligned such that
>>> bin=0 for each core is in the same slot, then if two cores were
>>> attempting to adjust the value in bin=0, it would take 2 cycles
>>> instead of 1.  10 writing means 10 cycles.  Etc.  Even if the memory
>>> is not aligned, you will be attempting to write to the same block a
>>> large number of times.
>>>
>>> Utilizing shared memory to temporarily hold the bin counts and then
>>> putting them into the global output matrix SHOULD increase your
>>> time.  If I have a chance, I'll work on it and post an updated
>>> kernel.
>>>
>>> Dominic
>>>
>>>
>>> -----Original Message-----
>>> From: pycuda-boun...@tiker.net [mailto:pycuda-boun...@tiker.net] On
>>> Behalf Of Francisco Villaescusa Navarro
>>> Sent: Friday, April 06, 2012 11:26 AM
>>> To: Thomas Wiecki
>>> Cc: pycuda@tiker.net
>>> Subject: Re: [PyCUDA] Histograms with PyCUDA
>>>
>>> Thanks for all the suggestions!
>>>
>>> Regarding removing sqrt: it seems that the code only gains about
>>> ~1%,
>>> and you lose the capacity to easily define linear intervals...
>>>
>>> I have tried with sqrt and sqrtf, but there is not difference in the
>>> total time (or it is very small).
>>>
>>> The code to find the histogram of an array with values between 0
>>> and 1
>>> should be something as:
>>>
>>> import numpy as np
>>> import time
>>> import pycuda.driver as cuda
>>> import pycuda.autoinit
>>> import pycuda.gpuarray as gpuarray
>>> import pycuda.cumath as cumath
>>> from pycuda.compiler import SourceModule
>>> from pycuda import compiler
>>>
>>> grid_gpu_template = """
>>> __global__ void grid(float *values, int size, float *temp_grid)
>>> {
>>>   unsigned int id = threadIdx.x;
>>>   int i,bin;
>>>   const uint interv = %(interv)s;
>>>
>>>   for(i=id;i<size;i+=blockDim.x){
>>>       bin=(int)(values[i]*interv);
>>>       if (bin==interv){
>>>          bin=interv-1;
>>>       }
>>>       temp_grid[id*interv+bin]+=1.0;
>>>   }
>>> }
>>> """
>>>
>>> reduction_gpu_template = """
>>> __global__ void reduction(float *temp_grid, float *his)
>>> {
>>>   unsigned int id = blockIdx.x*blockDim.x+threadIdx.x;
>>>   const uint interv = %(interv)s;
>>>   const uint threads = %(max_number_of_threads)s;
>>>
>>>   if(id<interv){
>>>       for(int i=0;i<threads;i++){
>>>           his[id]+=temp_grid[id+interv*i];
>>>       }
>>>   }
>>> }
>>> """
>>>
>>> number_of_points=100000000
>>> max_number_of_threads=512
>>> interv=1024
>>>
>>> blocks=interv/max_number_of_threads
>>> if interv%max_number_of_threads!=0:
>>>   blocks+=1
>>>
>>> values=np.random.random(number_of_points).astype(np.float32)
>>>
>>> grid_gpu = grid_gpu_template % {
>>>   'interv': interv,
>>> }
>>> mod_grid = compiler.SourceModule(grid_gpu)
>>> grid = mod_grid.get_function("grid")
>>>
>>> reduction_gpu = reduction_gpu_template % {
>>>   'interv': interv,
>>>   'max_number_of_threads': max_number_of_threads,
>>> }
>>> mod_redt = compiler.SourceModule(reduction_gpu)
>>> redt = mod_redt.get_function("reduction")
>>>
>>> values_gpu=gpuarray.to_gpu(values)
>>> temp_grid_gpu
>>> =gpuarray.zeros((max_number_of_threads,interv),dtype=np.float32)
>>> hist=np.zeros(interv,dtype=np.float32)
>>> hist_gpu=gpuarray.to_gpu(hist)
>>>
>>> start=time.clock()*1e3
>>> grid
>>> (values_gpu
>>> ,np
>>> .int32
>>> (number_of_points
>>> ),temp_grid_gpu,grid=(1,1),block=(max_number_of_threads,1,1))
>>> redt(temp_grid_gpu,hist_gpu,grid=(blocks,
>>> 1),block=(max_number_of_threads,1,1))
>>> hist=hist_gpu.get()
>>> print 'Time used to grid with GPU:',time.clock()*1e3-start,' ms'
>>>
>>>
>>> start=time.clock()*1e3
>>> bins_histo=np.linspace(0.0,1.0,interv+1)
>>> hist_CPU=np.histogram(values,bins=bins_histo)[0]
>>> print 'Time used to grid with CPU:',time.clock()*1e3-start,' ms'
>>>
>>> print 'max difference between methods=',np.max(hist_CPU-hist)
>>>
>>>
>>> ################
>>>
>>> Results:
>>>
>>> Time used to grid with GPU: 680.0  ms
>>> Time used to grid with CPU: 9320.0  ms
>>> max difference between methods= 0.0
>>>
>>> So it seems that with this algorithm we can't achieve factors larger
>>> than ~15
>>>
>>> Fran.
>>>
>>>
>>>
>>> _______________________________________________
>>> PyCUDA mailing list
>>> PyCUDA@tiker.net
>>> http://lists.tiker.net/listinfo/pycuda
>>
>


_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to