Hey,

 >     (x programs/y kernels each)  Execution time
>     (1/128)         1.4
>     (2/64)          2.0
>     (4/32)          3.2
>     (8/16)          5.6
>     (16/8)         10.5
>     (32/4)         20.0
>     (64/2)         39.5
>     (128/1)        80.6
>
>     Thus, jit launch overhead is in the order of a second!
>
>
> Okay, it seems like 1 program for all the kernels is the way to go. From
> your hard facts, though, it seems like generating 16 kernels inside the
> same program would have practically the same cost as generating only
> one, since the execution time is largely dominated by the kernel launch
> overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s,
> which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~ 6ms.

Considering that the flip_sign and reciprocal trick cannot be applied 
for unsigned integers, this is the way to go then. The increase in the 
number of kernels should be somewhat compensated by the fact that each 
of the kernels is shorter.


>     All we need to do is to have a interface to the generator where we
>     can just extract the axpy-kernels. The generator should not do any
>     OpenCL program and kernel management.
>
>
> I don't see any problem with extracting the source code from the
> generator in order to create this program (it is already done for GEMM),
> but the generator doesn't handle reciprocal and flip_sign. As I said
> earlier this feature is cool because it may prevent the transfer of
> several GPU-scalar in order to invert/reverse the value. On the other
> hand, though, it is incompatible with the clBlas interface and the
> kernel generator  (both of which are fed with cl_float and cl_double) .
> Modifying the generator to handle "x = y/a - w/b - z*c" internally as "x
> = y*a + w*b + z*c + option_a + option_b + option_c" sounds like a very
> dangerous idea to me. It could have a lot of undesirable side effects if
> made general, and making an axpy-specific tree parsing would lead to a
> huge amount of code bloat. This is actually the reason why I am so
> reluctant to integrating reciprocal and flip_sign within the generator...

Okay, let's not propagate reciprocal and flip_sign into the generator 
then. Also, feel free to eliminate the second reduction stage for 
scalars, which is encoded into the option value. It is currently unused 
and makes the generator integration harder than necessary. We can 
revisit that later if all other optimizations are exhausted ;-)


> if(size(x)>1e5 && stride==1 && start==0){ //Vectors are padded, wouldn't
> it be confounding/unnecessary to check for the internal size to fit the
> width?
>
> //The following steps are costly for small vectors
>   cl_type<NumericT> cpu_alpha = alpha //copy back to host when the
> scalar is on global device memory)

Never copy device scalars back unless requested by the user. They reads 
block the command queue, preventing overlaps of host and device 
computations.

>   if(alpha_flip) cpu_alpha*=-1;
>   if(reciprocal) cpu_alpha = 1/cpu_alpha;
>   //... same for beta

Let's just generate all the needed kernels and only dispatch into the 
correct kernel.


> //Optimized routines
>   if(external_blas)
>     call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
>   else{
>     dynamically_generated_program::init();
>     ambm_kernel(x,cpu_alpha,y,cpu_beta,z)
>   }
> else{
>    statically_generated_program::init();
>    ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta,
> reciprocal_beta, flip_beta, z)
> }

What is the difference between
   dynamically_generated_program::init();
and
   statically_generated_program::init();
? Why aren't they the same?

Also, mind the coding style regarding the placement of curly braces and 
spaces ;-)


> Wouldn't this solve all of our issues?
>
> I (really) hope we're converging now! :)

I think we can safely use
   dynamically_generated_program::init();
in both cases, which contains all the kernels which are currently in the 
statically generated program.


>     I don't believe it is our task to implement such a cache. This is
>     way too much a source of error and messing with the filesystem for
>     ViennaCL which is supposed to run with user permissions. An OpenCL
>     SDK is installed into the system and thus has much better options to
>     deal with the location of cache, etc. Also, why is only NVIDIA able
>     to provide such a cache, even though they don't even seem to care
>     about OpenCL 1.2? I doubt that e.g. AMD will go without a cache for
>     an extended amount of time.
>
>
> Agreed. I was just suggesting this because PyOpenCL already provides
> this, but python comes with a set of dynamic libraries, so this is
> probably not the same context.

Python has a whole bunch of functionality for abstracting the file 
systems. Boost.filesystem is somewhat similar, but this a way too 
painful for what we get from it.

Best regards,
Karli


------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to