Hey,
I think we agree on everything now! Okay, I will generate all the kernels,
this will lead actually to 16 kernels for each cpu-gpu scalar combination,
so 64 small kernels in total. This took time but it was a fruitful
discussion :)
Anyways, my ideas are much clearer now, thanks!
Best regards,
Philippe
2014-01-26 Karl Rupp <r...@iue.tuwien.ac.at>
> Hey,
>
>
> > (x programs/y kernels each) Execution time
>
>> (1/128) 1.4
>> (2/64) 2.0
>> (4/32) 3.2
>> (8/16) 5.6
>> (16/8) 10.5
>> (32/4) 20.0
>> (64/2) 39.5
>> (128/1) 80.6
>>
>> Thus, jit launch overhead is in the order of a second!
>>
>>
>> Okay, it seems like 1 program for all the kernels is the way to go. From
>> your hard facts, though, it seems like generating 16 kernels inside the
>> same program would have practically the same cost as generating only
>> one, since the execution time is largely dominated by the kernel launch
>> overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s,
>> which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~
>> 6ms.
>>
>
> Considering that the flip_sign and reciprocal trick cannot be applied for
> unsigned integers, this is the way to go then. The increase in the number
> of kernels should be somewhat compensated by the fact that each of the
> kernels is shorter.
>
>
>
> All we need to do is to have a interface to the generator where we
>> can just extract the axpy-kernels. The generator should not do any
>> OpenCL program and kernel management.
>>
>>
>> I don't see any problem with extracting the source code from the
>> generator in order to create this program (it is already done for GEMM),
>> but the generator doesn't handle reciprocal and flip_sign. As I said
>> earlier this feature is cool because it may prevent the transfer of
>> several GPU-scalar in order to invert/reverse the value. On the other
>> hand, though, it is incompatible with the clBlas interface and the
>> kernel generator (both of which are fed with cl_float and cl_double) .
>> Modifying the generator to handle "x = y/a - w/b - z*c" internally as "x
>> = y*a + w*b + z*c + option_a + option_b + option_c" sounds like a very
>> dangerous idea to me. It could have a lot of undesirable side effects if
>> made general, and making an axpy-specific tree parsing would lead to a
>> huge amount of code bloat. This is actually the reason why I am so
>> reluctant to integrating reciprocal and flip_sign within the generator...
>>
>
> Okay, let's not propagate reciprocal and flip_sign into the generator
> then. Also, feel free to eliminate the second reduction stage for scalars,
> which is encoded into the option value. It is currently unused and makes
> the generator integration harder than necessary. We can revisit that later
> if all other optimizations are exhausted ;-)
>
>
>
> if(size(x)>1e5 && stride==1 && start==0){ //Vectors are padded, wouldn't
>> it be confounding/unnecessary to check for the internal size to fit the
>> width?
>>
>> //The following steps are costly for small vectors
>> cl_type<NumericT> cpu_alpha = alpha //copy back to host when the
>> scalar is on global device memory)
>>
>
> Never copy device scalars back unless requested by the user. They reads
> block the command queue, preventing overlaps of host and device
> computations.
>
>
> if(alpha_flip) cpu_alpha*=-1;
>> if(reciprocal) cpu_alpha = 1/cpu_alpha;
>> //... same for beta
>>
>
> Let's just generate all the needed kernels and only dispatch into the
> correct kernel.
>
>
>
> //Optimized routines
>> if(external_blas)
>> call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
>> else{
>> dynamically_generated_program::init();
>> ambm_kernel(x,cpu_alpha,y,cpu_beta,z)
>> }
>> else{
>> statically_generated_program::init();
>> ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta,
>> reciprocal_beta, flip_beta, z)
>> }
>>
>
> What is the difference between
> dynamically_generated_program::init();
> and
> statically_generated_program::init();
> ? Why aren't they the same?
>
> Also, mind the coding style regarding the placement of curly braces and
> spaces ;-)
>
>
>
> Wouldn't this solve all of our issues?
>>
>> I (really) hope we're converging now! :)
>>
>
> I think we can safely use
> dynamically_generated_program::init();
> in both cases, which contains all the kernels which are currently in the
> statically generated program.
>
>
>
> I don't believe it is our task to implement such a cache. This is
>> way too much a source of error and messing with the filesystem for
>> ViennaCL which is supposed to run with user permissions. An OpenCL
>> SDK is installed into the system and thus has much better options to
>> deal with the location of cache, etc. Also, why is only NVIDIA able
>> to provide such a cache, even though they don't even seem to care
>> about OpenCL 1.2? I doubt that e.g. AMD will go without a cache for
>> an extended amount of time.
>>
>>
>> Agreed. I was just suggesting this because PyOpenCL already provides
>> this, but python comes with a set of dynamic libraries, so this is
>> probably not the same context.
>>
>
> Python has a whole bunch of functionality for abstracting the file
> systems. Boost.filesystem is somewhat similar, but this a way too painful
> for what we get from it.
>
> Best regards,
> Karli
>
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel