Thanks a lot, I see... It is amazing how a "trivial" thing for a serial code becomes a "nightmare" in a GPU.
Thanks again, Fran. On abr 11, 2012, at 20:23, "Pazzula, Dominic J " <dominic.j.pazz...@citi.com> wrote: > So far, I cannot improve upon what you originally had. > > I made this change to see what happened: > grid_gpu_template = """ > __global__ void grid(float *values, int size, float *temp_grid) { > unsigned int id = threadIdx.x; > unsigned int bid = blockIdx.x; > const uint threads = %(max_number_of_threads)s; > > int i,bin; > const uint interv =%(interv)s ; > __shared__ float grd[interv]; > unsigned int row = bid*interv; > unsigned int idx ; > > for(i=0;i<interv;i++){ > grd[i] = 0; > } > > for (i=0;i<threads;i++) > { > idx = bid*threads + i; > if (idx < size){ > bin=(int)(values[idx]*interv); > if (bin==interv){ > bin=interv-1; > } > grd[bin] +=1; > } > } > > for(i=0;i<interv;i++){ > temp_grid[row+i] = grd[i]; > } > > } > """ > (calling like grid(values_gpu, > np.int32(number_of_points),temp_grid_gpu,grid=(blocks,1),block=(1,1,1)) ) > > This was about 3.5 times slower. > > I commented out the write to the grd[bin] (/*grd[bin] +=1;*/) and it ran in > less than 10ms. What this tells me is that the contention of writing to the > shared memory far outweighs the slow write to global memory. > > Maybe someone else can find a faster way... > > -----Original Message----- > From: Francisco Villaescusa Navarro [mailto:villaescusa.franci...@gmail.com] > Sent: Wednesday, April 11, 2012 11:54 AM > To: Pazzula, Dominic J [ICG-IT] > Subject: Re: [PyCUDA] Histograms with PyCUDA > > that would be great! > > Thanks a lot for your help, I really appreciate it. > > Best, > > Fran. > > El 11/04/2012, a las 18:51, Pazzula, Dominic J escribió: > >> n/p. I am going to have some time later today. I'm going to look >> harder at this and make sure we are not going down a rabbit hole. >> >> -----Original Message----- >> From: Francisco Villaescusa Navarro [mailto:villaescusa.franci...@gmail.com >> ] >> Sent: Wednesday, April 11, 2012 11:48 AM >> To: Pazzula, Dominic J [ICG-IT] >> Subject: Re: [PyCUDA] Histograms with PyCUDA >> >> Thanks a lot! >> >> I will take a look at it. >> >> Fran. >> >> >> El 11/04/2012, a las 18:44, Pazzula, Dominic J escribió: >> >>> atomicAdd() on the race condition question. >>> >>> http://supercomputingblog.com/cuda/cuda-tutorial-4-atomic-operations/ >>> >>> >>> -----Original Message----- >>> From: Francisco Villaescusa Navarro [mailto:villaescusa.franci...@gmail.com >>> ] >>> Sent: Wednesday, April 11, 2012 11:01 AM >>> To: Pazzula, Dominic J [ICG-IT] >>> Cc: pycuda@tiker.net >>> Subject: Re: [PyCUDA] Histograms with PyCUDA >>> >>> Thanks a lot for the detailed answer! >>> >>> So you are suggesting creating an histogram per block, right? My >>> (probably stupid) question is how do you manage it to properly >>> account >>> all elements. Imagine that thread 0 is analyzing a element array >>> whose >>> bin position is zero, then it will make something as A[0]++, but if >>> thread 25 is analyzing other element whose bin is also zero, how can >>> you sum properly this bin taking into account that thread 0 has found >>> another element for bin 0? >>> >>> I have tried the following modification to the kernel: >>> >>> grid_gpu_template = """ >>> __global__ void grid(float *values, int size, float *temp_grid) >>> { >>> unsigned int id = threadIdx.x; >>> int i,bin; >>> const uint interv = %(interv)s; >>> >>> float A[interv]; >>> for(i=0;i<interv;i++){ >>> A[i]=0.0; >>> } >>> >>> for(i=id;i<size;i+=blockDim.x){ >>> bin=(int)(values[i]*interv); >>> if (bin==interv){ >>> bin=interv-1; >>> } >>> A[bin]+=1.0; >>> } >>> for(i=0;i<interv;i++){ >>> temp_grid[id*interv+i]=A[i]; >>> } >>> >>> } >>> """ >>> >>> As you can see, the "local variable" A is used to make the histogram, >>> and when it finishes, its values are transfered to global memory. The >>> time to create the histogram with variable A is very very short >>> (smaller than 1 ms), whereas the time to transfer it to global memory >>> (the last loop) becomes very large. At the end of the day, with this >>> kernel, the total time is even worst than with in the previous one. >>> >>> Thanks again, >>> >>> Fran. >>> >>> El 11/04/2012, a las 16:59, Pazzula, Dominic J escribió: >>> >>>> Yes and no. Any variable you declare inside the kernel is "local." >>>> However, the number of registers are limited to only a few kb. >>>> These registers are the memory locations located on the chip, next >>>> to the processors. If the compiler detects that you are over your >>>> limit, it will move that memory into the slower Global bank (I.E. >>>> those big ram chips you see stuck on the card). >>>> >>>> What I suggested about using shared memory is basically what you are >>>> suggesting. Shared memory is pseudo local to the processor. It is >>>> shared across threads in a block. Physically, it is located on the >>>> chip. It is very fast, much faster than accessing memory in the >>>> global. >>>> >>>> Writing into shared memory is where you get the memory contentions I >>>> was talking about. Sorry about the confusion. >>>> >>>> So what you should do is have each block tally up its bin counts >>>> into a shared array. Your output matrix should be (Blocks x Bins) >>>> instead of (Threads x Bins). At the end of the kernel, call >>>> __syncthreads() and then write the values from the block's shared >>>> memory into the output matrix. You should only do it once, so only >>>> 1 thread needs do the write. >>>> >>>> This should give you a factor of 2 or more improvement. >>>> >>>> For now, don't worry about optimizing the shared memory access. I >>>> don't think you will be able to optimize this because of the random >>>> nature of the input data. You could possibly sort the data in a way >>>> that could minimize the contention, but the sort probably takes more >>>> time than it saves. >>>> >>>> HTH >>>> Dom >>>> >>>> -----Original Message----- >>>> From: Francisco Villaescusa Navarro [mailto:villaescusa.franci...@gmail.com >>>> ] >>>> Sent: Wednesday, April 11, 2012 9:39 AM >>>> To: Pazzula, Dominic J [ICG-IT] >>>> Cc: 'Francisco Villaescusa Navarro'; 'Thomas Wiecki'; 'pycuda@tiker.net >>>> ' >>>> Subject: Re: [PyCUDA] Histograms with PyCUDA >>>> >>>> OK thanks, >>>> >>>> I understand. Is there any way to tell PyCUDA which parts of an >>>> array >>>> (or matrix) should go into each memory bank? >>>> >>>> I guess that if I were able to place each thread histogram into a >>>> different bank memory, there should not be these problems and the >>>> calculation should be much faster? >>>> >>>> It is possible to declare a "local" variable for each thread such as >>>> each thread works with "its own memory" instead of using the global >>>> memory (and transfer the data at the end, when threads finish)? >>>> >>>> Thanks a lot! >>>> >>>> Fran. >>>> >>>> >>>> El 11/04/2012, a las 16:13, Pazzula, Dominic J escribió: >>>> >>>>> Looking at it, memory bank conflicts may only exist in shared >>>>> memory >>>>> structures. If you think of the memory as a matrix NxM. Memory >>>>> addresses in column i are controlled by the same memory controller. >>>>> The column is said to be a bank. If two processes are trying to >>>>> write to addresses (0,i) and (1,i), then it will take 2 memory >>>>> cycles to put those in. The controller can only handle 1 write >>>>> into >>>>> the bank at a time. >>>>> >>>>> If I comment out the line: >>>>> /*temp_grid[id*interv+bin]+=1.0;*/ >>>>> >>>>> Processing time drops from 140.7ms to 1ms. The time spent in the >>>>> binning kernel goes from 139.9 to .17ms. That write into the >>>>> global >>>>> memory is taking all the time in the process. >>>>> >>>>> -----Original Message----- >>>>> From: Francisco Villaescusa Navarro >>>>> [mailto:villaescusa.franci...@gmail.com >>>>> ] >>>>> Sent: Wednesday, April 11, 2012 3:33 AM >>>>> To: Pazzula, Dominic J [ICG-IT] >>>>> Cc: 'Francisco Villaescusa Navarro'; 'Thomas Wiecki'; 'pycuda@tiker.net >>>>> ' >>>>> Subject: Re: [PyCUDA] Histograms with PyCUDA >>>>> >>>>> Hi, >>>>> >>>>> Thanks for checking it. >>>>> >>>>> I'm not sure to fully understand your comment. The matrix temp_grid >>>>> is >>>>> a matrix of size (threads, intervals). The thread 0 will create an >>>>> histogram and will save it on the first column of that matrix, the >>>>> thread 1 will create an histogram and will save it on the second >>>>> row >>>>> of that matrix.... >>>>> >>>>> Once you have the matrix filled, the reduction part will just sum >>>>> all >>>>> the histograms to produce the final result. >>>>> >>>>> So I think there is no memory conflict between different threads >>>>> (the >>>>> elements filled by one thread are never touched by other threads). >>>>> >>>>> Are you saying that accessing those positions of memory every time, >>>>> in >>>>> the same thread is a low step? >>>>> >>>>> Thanks a lot, >>>>> >>>>> Fran. >>>>> >>>>> El 10/04/2012, a las 22:04, Pazzula, Dominic J escribió: >>>>> >>>>>> I'm seeing performance of around a 7.5x speed up on my 8500GT >>>>>> and 1 >>>>>> year old Xeon. 140ms GPU vs 1047ms CPU on 10,000,000 points. >>>>>> >>>>>> For me, the slower memory throughput kills the idea of using MKL >>>>>> and >>>>>> numpy for the reduction. MKL can do the reduction step in less >>>>>> than .01ms and the GPU is doing it in .9ms. But the transfer time >>>>>> of that 2MB (512x1024 floats) is about 2ms, giving the GPU the >>>>>> edge. >>>>>> >>>>>> I am guessing here, but my thought on the slower than expected >>>>>> performance comes from the writing of the bins. >>>>>> >>>>>> temp_grid[id*interv+bin]+=1.0; >>>>>> >>>>>> If I am remembering correctly, the GPU cannot write to the same >>>>>> memory slot simultaneously. If the memory is aligned such that >>>>>> bin=0 for each core is in the same slot, then if two cores were >>>>>> attempting to adjust the value in bin=0, it would take 2 cycles >>>>>> instead of 1. 10 writing means 10 cycles. Etc. Even if the >>>>>> memory >>>>>> is not aligned, you will be attempting to write to the same >>>>>> block a >>>>>> large number of times. >>>>>> >>>>>> Utilizing shared memory to temporarily hold the bin counts and >>>>>> then >>>>>> putting them into the global output matrix SHOULD increase your >>>>>> time. If I have a chance, I'll work on it and post an updated >>>>>> kernel. >>>>>> >>>>>> Dominic >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: pycuda-boun...@tiker.net [mailto:pycuda-boun...@tiker.net] >>>>>> On >>>>>> Behalf Of Francisco Villaescusa Navarro >>>>>> Sent: Friday, April 06, 2012 11:26 AM >>>>>> To: Thomas Wiecki >>>>>> Cc: pycuda@tiker.net >>>>>> Subject: Re: [PyCUDA] Histograms with PyCUDA >>>>>> >>>>>> Thanks for all the suggestions! >>>>>> >>>>>> Regarding removing sqrt: it seems that the code only gains about >>>>>> ~1%, >>>>>> and you lose the capacity to easily define linear intervals... >>>>>> >>>>>> I have tried with sqrt and sqrtf, but there is not difference in >>>>>> the >>>>>> total time (or it is very small). >>>>>> >>>>>> The code to find the histogram of an array with values between 0 >>>>>> and 1 >>>>>> should be something as: >>>>>> >>>>>> import numpy as np >>>>>> import time >>>>>> import pycuda.driver as cuda >>>>>> import pycuda.autoinit >>>>>> import pycuda.gpuarray as gpuarray >>>>>> import pycuda.cumath as cumath >>>>>> from pycuda.compiler import SourceModule >>>>>> from pycuda import compiler >>>>>> >>>>>> grid_gpu_template = """ >>>>>> __global__ void grid(float *values, int size, float *temp_grid) >>>>>> { >>>>>> unsigned int id = threadIdx.x; >>>>>> int i,bin; >>>>>> const uint interv = %(interv)s; >>>>>> >>>>>> for(i=id;i<size;i+=blockDim.x){ >>>>>> bin=(int)(values[i]*interv); >>>>>> if (bin==interv){ >>>>>> bin=interv-1; >>>>>> } >>>>>> temp_grid[id*interv+bin]+=1.0; >>>>>> } >>>>>> } >>>>>> """ >>>>>> >>>>>> reduction_gpu_template = """ >>>>>> __global__ void reduction(float *temp_grid, float *his) >>>>>> { >>>>>> unsigned int id = blockIdx.x*blockDim.x+threadIdx.x; >>>>>> const uint interv = %(interv)s; >>>>>> const uint threads = %(max_number_of_threads)s; >>>>>> >>>>>> if(id<interv){ >>>>>> for(int i=0;i<threads;i++){ >>>>>> his[id]+=temp_grid[id+interv*i]; >>>>>> } >>>>>> } >>>>>> } >>>>>> """ >>>>>> >>>>>> number_of_points=100000000 >>>>>> max_number_of_threads=512 >>>>>> interv=1024 >>>>>> >>>>>> blocks=interv/max_number_of_threads >>>>>> if interv%max_number_of_threads!=0: >>>>>> blocks+=1 >>>>>> >>>>>> values=np.random.random(number_of_points).astype(np.float32) >>>>>> >>>>>> grid_gpu = grid_gpu_template % { >>>>>> 'interv': interv, >>>>>> } >>>>>> mod_grid = compiler.SourceModule(grid_gpu) >>>>>> grid = mod_grid.get_function("grid") >>>>>> >>>>>> reduction_gpu = reduction_gpu_template % { >>>>>> 'interv': interv, >>>>>> 'max_number_of_threads': max_number_of_threads, >>>>>> } >>>>>> mod_redt = compiler.SourceModule(reduction_gpu) >>>>>> redt = mod_redt.get_function("reduction") >>>>>> >>>>>> values_gpu=gpuarray.to_gpu(values) >>>>>> temp_grid_gpu >>>>>> =gpuarray.zeros((max_number_of_threads,interv),dtype=np.float32) >>>>>> hist=np.zeros(interv,dtype=np.float32) >>>>>> hist_gpu=gpuarray.to_gpu(hist) >>>>>> >>>>>> start=time.clock()*1e3 >>>>>> grid >>>>>> (values_gpu >>>>>> ,np >>>>>> .int32 >>>>>> (number_of_points >>>>>> ),temp_grid_gpu,grid=(1,1),block=(max_number_of_threads,1,1)) >>>>>> redt(temp_grid_gpu,hist_gpu,grid=(blocks, >>>>>> 1),block=(max_number_of_threads,1,1)) >>>>>> hist=hist_gpu.get() >>>>>> print 'Time used to grid with GPU:',time.clock()*1e3-start,' ms' >>>>>> >>>>>> >>>>>> start=time.clock()*1e3 >>>>>> bins_histo=np.linspace(0.0,1.0,interv+1) >>>>>> hist_CPU=np.histogram(values,bins=bins_histo)[0] >>>>>> print 'Time used to grid with CPU:',time.clock()*1e3-start,' ms' >>>>>> >>>>>> print 'max difference between methods=',np.max(hist_CPU-hist) >>>>>> >>>>>> >>>>>> ################ >>>>>> >>>>>> Results: >>>>>> >>>>>> Time used to grid with GPU: 680.0 ms >>>>>> Time used to grid with CPU: 9320.0 ms >>>>>> max difference between methods= 0.0 >>>>>> >>>>>> So it seems that with this algorithm we can't achieve factors >>>>>> larger >>>>>> than ~15 >>>>>> >>>>>> Fran. >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> PyCUDA mailing list >>>>>> PyCUDA@tiker.net >>>>>> http://lists.tiker.net/listinfo/pycuda >>>>> >>>> >>> >> > _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda