Hey, > (x programs/y kernels each) Execution time > (1/128) 1.4 > (2/64) 2.0 > (4/32) 3.2 > (8/16) 5.6 > (16/8) 10.5 > (32/4) 20.0 > (64/2) 39.5 > (128/1) 80.6 > > Thus, jit launch overhead is in the order of a second! > > > Okay, it seems like 1 program for all the kernels is the way to go. From > your hard facts, though, it seems like generating 16 kernels inside the > same program would have practically the same cost as generating only > one, since the execution time is largely dominated by the kernel launch > overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s, > which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~ 6ms.
Considering that the flip_sign and reciprocal trick cannot be applied for unsigned integers, this is the way to go then. The increase in the number of kernels should be somewhat compensated by the fact that each of the kernels is shorter. > All we need to do is to have a interface to the generator where we > can just extract the axpy-kernels. The generator should not do any > OpenCL program and kernel management. > > > I don't see any problem with extracting the source code from the > generator in order to create this program (it is already done for GEMM), > but the generator doesn't handle reciprocal and flip_sign. As I said > earlier this feature is cool because it may prevent the transfer of > several GPU-scalar in order to invert/reverse the value. On the other > hand, though, it is incompatible with the clBlas interface and the > kernel generator (both of which are fed with cl_float and cl_double) . > Modifying the generator to handle "x = y/a - w/b - z*c" internally as "x > = y*a + w*b + z*c + option_a + option_b + option_c" sounds like a very > dangerous idea to me. It could have a lot of undesirable side effects if > made general, and making an axpy-specific tree parsing would lead to a > huge amount of code bloat. This is actually the reason why I am so > reluctant to integrating reciprocal and flip_sign within the generator... Okay, let's not propagate reciprocal and flip_sign into the generator then. Also, feel free to eliminate the second reduction stage for scalars, which is encoded into the option value. It is currently unused and makes the generator integration harder than necessary. We can revisit that later if all other optimizations are exhausted ;-) > if(size(x)>1e5 && stride==1 && start==0){ //Vectors are padded, wouldn't > it be confounding/unnecessary to check for the internal size to fit the > width? > > //The following steps are costly for small vectors > cl_type<NumericT> cpu_alpha = alpha //copy back to host when the > scalar is on global device memory) Never copy device scalars back unless requested by the user. They reads block the command queue, preventing overlaps of host and device computations. > if(alpha_flip) cpu_alpha*=-1; > if(reciprocal) cpu_alpha = 1/cpu_alpha; > //... same for beta Let's just generate all the needed kernels and only dispatch into the correct kernel. > //Optimized routines > if(external_blas) > call_axpy_twice(x,cpu_alpha,y,cpu_beta,z) > else{ > dynamically_generated_program::init(); > ambm_kernel(x,cpu_alpha,y,cpu_beta,z) > } > else{ > statically_generated_program::init(); > ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta, > reciprocal_beta, flip_beta, z) > } What is the difference between dynamically_generated_program::init(); and statically_generated_program::init(); ? Why aren't they the same? Also, mind the coding style regarding the placement of curly braces and spaces ;-) > Wouldn't this solve all of our issues? > > I (really) hope we're converging now! :) I think we can safely use dynamically_generated_program::init(); in both cases, which contains all the kernels which are currently in the statically generated program. > I don't believe it is our task to implement such a cache. This is > way too much a source of error and messing with the filesystem for > ViennaCL which is supposed to run with user permissions. An OpenCL > SDK is installed into the system and thus has much better options to > deal with the location of cache, etc. Also, why is only NVIDIA able > to provide such a cache, even though they don't even seem to care > about OpenCL 1.2? I doubt that e.g. AMD will go without a cache for > an extended amount of time. > > > Agreed. I was just suggesting this because PyOpenCL already provides > this, but python comes with a set of dynamic libraries, so this is > probably not the same context. Python has a whole bunch of functionality for abstracting the file systems. Boost.filesystem is somewhat similar, but this a way too painful for what we get from it. Best regards, Karli ------------------------------------------------------------------------------ CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments & Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk _______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel