Hi, > I prefer option 3. This would allow for something like : > > if(size(x)>1e5 && stride==1 && start==0){
Here we also need to check the internal_size to fit the vector width > > //The following steps are costly for small vectors > NumericT cpu_alpha = alpha //copy back to host when the scalar is on > global device memory) > if(alpha_flip) cpu_alpha*=-1; > if(reciprocal) cpu_alpha = 1/cpu_alpha; > //... same for beta > > //Optimized routines > if(external_blas) > call_axpy_twice(x,cpu_alpha,y,cpu_beta,z) > else{ > generate_execute(x = cpu_alpha*y + cpu_beta*z); > } > else{ > //fallback > } > > This way, we at most generate two kernels, one for small vectors, > designed to optimize latency, and one for big vectors, designed to > optimize bandwidth. Are we converging? :) Convergence depends on what is inside generate_execute() ;-) How is the problem with alpha and beta residing on the GPU addressed? How will the batch-compilation look like? The important point is that for the default axpy kernels we really don't want to go through the jit-compiler for each of them individually. Note to self: Collect some numbers on the costs of jit-compilation for different OpenCL SDKs. Best regards, Karli ------------------------------------------------------------------------------ CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments & Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk _______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel