On Wed, 2009-07-01 at 05:27 -0400, Yu-Hua Yang wrote: > Thank you very much for your reply. I have some more questions. I > understand you wont be looking/updating this branch anymore due to the > release of OpenCL, but thanks for helping right now! > > Before you dive in, one thing I am having concerns is this output that > happens every time I run CUDA-enable code: > > gr_vmcircbuf_cuda::copy_buf_to_buf() error cudaMemcpy() returned 0x3 = > 3 initialization error > gr_vmcircbuf_cuda::copy_buf_to_buf recreating buffer d_base=0x3820000 > size=32768 *2 > If I remember well, this error is not a problem if you get it once per instantiated CUDA block.
It is a problem if you get it multiple times for the same block. If I remember well, this has to do with CUDA not being happy with multithreaded applications and access to the same device buffer from different host-threads. Gnuradio using cuda should always use the single-threaded scheduler. (This is not the default) I don't remember if my code does this automatically or if you have to force it using an environment variable. ( you can select the scheduler used at runtime by setting the GR_SCHEDULER environment variable. E.g., multithreaded Thread-Per-Block scheduler $ GR_SCHEDULER=TPB ./my-gnuradio-application.py ... or Single-Threaded-Scheduler $ GR_SCHEDULER=STS ./my-gnuradio-application.py ... ) If you have to use the environment variable, then this should go into the gnuradio-CUDA docs (if it is not already there). Greeting, Martin > > Are these 2 outputs normal? Are they suppose to happen? If not, how do > I fix this? > > 2009/6/30 Martin DvH <gnuradiom...@olifantasia.com> > On Tue, 2009-06-30 at 02:52 -0400, Yu-Hua Yang wrote: > > Upon a closer look into cuda_muiltiply_const_ff_kernel.cu, > there > > exists 5 different kernel functions to do the > multiplication, where > > the default one, > > > > __global__ void > > cuda_multiply_const_ff_kernel(const float* g_idata, > float* > > g_odata,const int noutput_items,const float konst) > > > > is completely blank. But regardless, nobody calls these > kernel > > functions. > > cuda_multiply_const_ff was never finished or I accidently > removed some > code here. > Either way, this is clearly a bug. > The empty kernel should call one of the implemented kernels. > > > > > Then, in the same file, which is called by > cuda_multiply_const_ff.cc, > > in this function > > > > int > > get_cuda_multiply_const_ff_kernel_params > > ( cuda_multiply_const_ff_kernel_params *params ) > > { > > int result=0; > > //const unsigned int max_num_threads_per_block = > > MAX_NUM_THREADS_ALL; //can use the maximum number > of threads > > if wanted > > //unsigned int max_num_blocks = > MAX_NUM_BLOCKS_ALL; > > > > unsigned int num_blocks=4096 ;// = > > gridDim.x; > //NUM_CUDABLOCKS > > unsigned int num_threads_per_block=512;// = > > blockDim.x; //NUM_THREADS; > > unsigned int > num_outputs_per_block=num_threads_per_block; > > > > const unsigned int num_outputs_per_grid= > > num_outputs_per_block*num_blocks; > //(blockDim.x)*gridDim.x > > > > size_t dynamic_shared_mem_size = > > > 0;//256*sizeof(float);//0;//num_threads_per_block*sizeof(gr_complex); > > dim3 griddim( num_blocks, 1, 1); > > dim3 threaddim( num_threads_per_block, 1, 1); > > > > params->griddim=griddim; > > params->threaddim=threaddim; > > > params->dynamic_shared_mem_size=dynamic_shared_mem_size; > > params->num_outputs_padded=num_outputs_per_grid; > > params->num_inputs_padded=num_outputs_per_grid; > > > params->num_inputs=0;//num_outputs_per_grid;//num_outputs; > > > params->num_outputs=0;//num_outputs_per_grid;//num_outputs; > > > > //Now you can do the kernel invocation like this: > > //cuda_multiply_const_ff_filter_kernel<<< > params->griddim, > > params->threaddim, params->dynamic_shared_mem_size > > >>>(g_idata, g_odata, > params->num_outputs_padded*X,konst); > > return result; > > } > > > > The kernel invocation is completely commented out! The > result is > > initialized as 0 at the top and returns it. All the work in > between to > > specify and allocate thread, block sizes does not seem to > matter. Not > > sure why this code exists this way, did someone make an edit > or did > > Martin specifically commented out the kernel invocation? > > Yes the kernel invocation is specifically commented out. > The commented out kernel invocation is only here as > documentation on > how to do the actual kernel invocation. > > There is still a small typo. > It should refer to cuda_multiply_const_ff_kernel and not > cuda_multiply_const_ff_filter_kernel > > > > > This methods name is get_cuda_multiply_const_ff_kernel_params > It does just that, determine the needed params for the kernel > invocation. > > The actual kernel invocation is done in: > cuda_multiply_const_ff_work_device > > > Both get_cuda_multiply_const_ff_kernel_params and > cuda_multiply_const_ff_work_device are called from normal C++ > code in > cuda_multiply_const_ff_kernel.cu > > Here you mean called by cuda_multiply_const_ff.cc right? the kernel > itself seems to just define the kernel functions and everything is > called by cuda_multiply_const_ff.cc, maybe I am wrong because this > leads me to my question which is, where in cuda_multiply_const_ff > calls cuda_multiply_const_ff_work_device? doesn't seem like it does > but maybe its some OOP process that I missed...anyways...not that > important at the moment.... > > > Better examples to look at and benchmark would be > cuda_quadrature_demod_cuda_cf > cuda_fir_filter_fff > > These are both used in > testbed/wfm/cuda_wfm_rcv.py > Which calls the complete cuda implementation of a WFM receiver > in > cuda_wfm_rcv_cuda.py > > You will notice a cuda_multiply_const_ff block is > instantiated as > volume_control but not used in the final connect because it > didn't work. > Now I know this is because of the bug you found. > > > > Is this suppose to be this way? I don't see how this can be > a proper > > benchmarking if it seems that we just test about allocating > threads > > and blocks on the device and memory access times, but really > don't do > > any computation. > > That is not a good benchmark indeed. > cuda_multiply_const_ff should be fixed first. > > The fir-filter is however the thing you really want to > benchmark. > (Which is also in there) > The more taps the fir-filter has, the more computations it > will have to > make. > Note that the cuda implementation is at the moment limited in > the number > of taps it can support at a certain decimation factor. > If I remember well decimationfactor*numberoftaps should stay > below 2048 > (or was it 512) > > Ways of improving the resuls include having gnuradio do > calculations in > bigger chunks. > This can be done in several ways. > One way is to have a block at the end of the processing chain > which has > a large output_multiple requirement which needs 1 input for > every > output. > In other words. Connect a block to the end of the processing > chain which > has set_output_multiple(large_number) in its constructor. > > You can use gr_test for this. > > I decided to abandon and comment out all the cuda.multiply_const_ff > function calls and concentrate on cuda.fir_filter_fff as suggested. > Things I got questions/concerns > > 1. I increased output_multiple by doing "options.output_multiple = > xxx" and this has no effect on the computing time of either CUDA or > CPU. Did I do something wrong? > 2. I increased the taps by doing "taps = range(1,256)" and also > increasing number of blocks of fir_filter in the code and voila, I am > now able to get CUDA to be faster than just CPU. However, if I > implement something like "taps = range(1,512)" the CUDA part would be > extremely slow (~20 seconds) while the CPU is still cool (~ 2 sec). > Why? But this maybe related to what you were saying about max number > of taps...although why is CPU able to still compute? > 3. I had to increase the number of fir_filter blocks to 14 blocks > before I can start seeing CUDA out-perform CPU. Experimentally its > fine, I achieved my objective, but how is this "increased computation" > justified in a normal GNURadio operation? I mean, when would a normal > GNURadio operation require a chain of 14 fir_filters? I guess this is > going beyond just "benchmarking" and asking where else can I take > advantage of CUDA's computation power in GNURadio in a "normal" > operation? > 4. Looking at cuda_fir_fff_7_kernel, which I believe is the core of > cuda_fir_filter, it seems you are using shared memory right? Just > making sure we are not using global or local memory which would > disastrously slow down the CUDA computations. > > > > There are other ways which include changing the scheduler and > the buffer > instantiating code. > > I rather not pursue these ways because of the increased complexity. I > rather increase computation, seems fastest/easiest to me, just not > sure where. > > > > I hope this helps, > > Martin > > It has been most helpful! Thank you again! > > > > > I am probably way off here, doesnt make any > sense......someone please > > clarify! > > > > > _______________________________________________ > > Discuss-gnuradio mailing list > > Discuss-gnuradio@gnu.org > > http://lists.gnu.org/mailman/listinfo/discuss-gnuradio > > _______________________________________________ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org http://lists.gnu.org/mailman/listinfo/discuss-gnuradio