Hey,

I think we agree on everything now! Okay, I will generate all the kernels,
this will lead actually to 16 kernels for each cpu-gpu scalar combination,
so 64 small kernels in total. This took time but it was a fruitful
discussion :)

Anyways, my ideas are much clearer now, thanks!

Best regards,
Philippe


2014-01-26 Karl Rupp <r...@iue.tuwien.ac.at>

> Hey,
>
>
> >     (x programs/y kernels each)  Execution time
>
>>     (1/128)         1.4
>>     (2/64)          2.0
>>     (4/32)          3.2
>>     (8/16)          5.6
>>     (16/8)         10.5
>>     (32/4)         20.0
>>     (64/2)         39.5
>>     (128/1)        80.6
>>
>>     Thus, jit launch overhead is in the order of a second!
>>
>>
>> Okay, it seems like 1 program for all the kernels is the way to go. From
>> your hard facts, though, it seems like generating 16 kernels inside the
>> same program would have practically the same cost as generating only
>> one, since the execution time is largely dominated by the kernel launch
>> overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s,
>> which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~
>> 6ms.
>>
>
> Considering that the flip_sign and reciprocal trick cannot be applied for
> unsigned integers, this is the way to go then. The increase in the number
> of kernels should be somewhat compensated by the fact that each of the
> kernels is shorter.
>
>
>
>      All we need to do is to have a interface to the generator where we
>>     can just extract the axpy-kernels. The generator should not do any
>>     OpenCL program and kernel management.
>>
>>
>> I don't see any problem with extracting the source code from the
>> generator in order to create this program (it is already done for GEMM),
>> but the generator doesn't handle reciprocal and flip_sign. As I said
>> earlier this feature is cool because it may prevent the transfer of
>> several GPU-scalar in order to invert/reverse the value. On the other
>> hand, though, it is incompatible with the clBlas interface and the
>> kernel generator  (both of which are fed with cl_float and cl_double) .
>> Modifying the generator to handle "x = y/a - w/b - z*c" internally as "x
>> = y*a + w*b + z*c + option_a + option_b + option_c" sounds like a very
>> dangerous idea to me. It could have a lot of undesirable side effects if
>> made general, and making an axpy-specific tree parsing would lead to a
>> huge amount of code bloat. This is actually the reason why I am so
>> reluctant to integrating reciprocal and flip_sign within the generator...
>>
>
> Okay, let's not propagate reciprocal and flip_sign into the generator
> then. Also, feel free to eliminate the second reduction stage for scalars,
> which is encoded into the option value. It is currently unused and makes
> the generator integration harder than necessary. We can revisit that later
> if all other optimizations are exhausted ;-)
>
>
>
>  if(size(x)>1e5 && stride==1 && start==0){ //Vectors are padded, wouldn't
>> it be confounding/unnecessary to check for the internal size to fit the
>> width?
>>
>> //The following steps are costly for small vectors
>>   cl_type<NumericT> cpu_alpha = alpha //copy back to host when the
>> scalar is on global device memory)
>>
>
> Never copy device scalars back unless requested by the user. They reads
> block the command queue, preventing overlaps of host and device
> computations.
>
>
>    if(alpha_flip) cpu_alpha*=-1;
>>   if(reciprocal) cpu_alpha = 1/cpu_alpha;
>>   //... same for beta
>>
>
> Let's just generate all the needed kernels and only dispatch into the
> correct kernel.
>
>
>
>  //Optimized routines
>>   if(external_blas)
>>     call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
>>   else{
>>     dynamically_generated_program::init();
>>     ambm_kernel(x,cpu_alpha,y,cpu_beta,z)
>>   }
>> else{
>>    statically_generated_program::init();
>>    ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta,
>> reciprocal_beta, flip_beta, z)
>> }
>>
>
> What is the difference between
>   dynamically_generated_program::init();
> and
>   statically_generated_program::init();
> ? Why aren't they the same?
>
> Also, mind the coding style regarding the placement of curly braces and
> spaces ;-)
>
>
>
>  Wouldn't this solve all of our issues?
>>
>> I (really) hope we're converging now! :)
>>
>
> I think we can safely use
>   dynamically_generated_program::init();
> in both cases, which contains all the kernels which are currently in the
> statically generated program.
>
>
>
>      I don't believe it is our task to implement such a cache. This is
>>     way too much a source of error and messing with the filesystem for
>>     ViennaCL which is supposed to run with user permissions. An OpenCL
>>     SDK is installed into the system and thus has much better options to
>>     deal with the location of cache, etc. Also, why is only NVIDIA able
>>     to provide such a cache, even though they don't even seem to care
>>     about OpenCL 1.2? I doubt that e.g. AMD will go without a cache for
>>     an extended amount of time.
>>
>>
>> Agreed. I was just suggesting this because PyOpenCL already provides
>> this, but python comes with a set of dynamic libraries, so this is
>> probably not the same context.
>>
>
> Python has a whole bunch of functionality for abstracting the file
> systems. Boost.filesystem is somewhat similar, but this a way too painful
> for what we get from it.
>
> Best regards,
> Karli
>
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to