Upon a closer look into cuda_muiltiply_const_ff_kernel.cu, there exists 5
different kernel functions to do the multiplication, where the default one,

 __global__ void
> cuda_multiply_const_ff_kernel(const float* g_idata, float* g_odata,const
> int noutput_items,const float konst)
>

is completely blank. But regardless, nobody calls these kernel functions.
Then, in the same file, which is called by cuda_multiply_const_ff.cc, in
this function

int
> get_cuda_multiply_const_ff_kernel_params (
> cuda_multiply_const_ff_kernel_params *params )
> {
>   int result=0;
>   //const unsigned int max_num_threads_per_block  = MAX_NUM_THREADS_ALL;
> //can use the maximum number of threads if wanted
>   //unsigned int max_num_blocks         = MAX_NUM_BLOCKS_ALL;
>
>   unsigned int num_blocks=4096 ;// =
> gridDim.x;                                 //NUM_CUDABLOCKS
>   unsigned int num_threads_per_block=512;//  =
> blockDim.x;                     //NUM_THREADS;
>   unsigned int num_outputs_per_block=num_threads_per_block;
>
>   const unsigned int num_outputs_per_grid=
> num_outputs_per_block*num_blocks;  //(blockDim.x)*gridDim.x
>
>   size_t dynamic_shared_mem_size =
> 0;//256*sizeof(float);//0;//num_threads_per_block*sizeof(gr_complex);
>   dim3  griddim( num_blocks, 1, 1);
>   dim3  threaddim( num_threads_per_block, 1, 1);
>
>   params->griddim=griddim;
>   params->threaddim=threaddim;
>   params->dynamic_shared_mem_size=dynamic_shared_mem_size;
>   params->num_outputs_padded=num_outputs_per_grid;
>   params->num_inputs_padded=num_outputs_per_grid;
>   params->num_inputs=0;//num_outputs_per_grid;//num_outputs;
>   params->num_outputs=0;//num_outputs_per_grid;//num_outputs;
>
>   //Now you can do the kernel invocation like this:
>   //cuda_multiply_const_ff_filter_kernel<<< params->griddim,
> params->threaddim, params->dynamic_shared_mem_size >>>(g_idata, g_odata,
> params->num_outputs_padded*X,konst);
>   return result;
> }


The kernel invocation is completely commented out! The result is initialized
as 0 at the top and returns it. All the work in between to specify and
allocate thread, block sizes does not seem to matter. Not sure why this code
exists this way, did someone make an edit or did Martin specifically
commented out the kernel invocation? Is this suppose to be this way? I don't
see how this can be a proper benchmarking if it seems that we just test
about allocating threads and blocks on the device and memory access times,
but really don't do any computation.
I am probably way off here, doesnt make any sense......someone please
clarify!
_______________________________________________
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Reply via email to