Hey hey Karl,




2014/1/25 Karl Rupp <r...@iue.tuwien.ac.at>

> Hi Phil,
>
>
> > Oh, I get it better now. I am not entirely convinced, though ;)
>
>>  From my experience, the overhead of the jit launch is negligible
>>   compared to the compilation of one kernel. I'm not sure whether
>> compiling two kernels in the same program or two different program
>> creates a big difference.
>>
>
> Okay, time to feed you with some hard facts ;-) Scenario: compilation of
> 128 kernels. Configurations (x programs with y kernels each, x*y=128)
> Execution times:
>
> (x programs/y kernels each)  Execution time
> (1/128)         1.4
> (2/64)          2.0
> (4/32)          3.2
> (8/16)          5.6
> (16/8)         10.5
> (32/4)         20.0
> (64/2)         39.5
> (128/1)        80.6
>
> Thus, jit launch overhead is in the order of a second!
>
>
Okay, it seems like 1 program for all the kernels is the way to go. From
your hard facts, though, it seems like generating 16 kernels inside the
same program would have practically the same cost as generating only one,
since the execution time is largely dominated by the kernel launch
overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s,
which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~ 6ms.


>
>  Plus, ideally, in the case of linear solver,
>> the generator could be used to generate fused kernels, provided that the
>> scheduler is fully operationnal.
>>
>
> Sure, kernel fusion is a bonus of the micro-scheduler, but we still need
> to have a fast default behavior for scenarios where the the kernel fusion
> is disabled.
>
>
>
>  I fear that any solution to the
>> aforementioned problem would destroy this precious ability... Ideally,
>> once we enable it, the generate_execute() mentioned above would just be
>> replaced by generate() (or enqueue_for_generation, which is more explicit)
>>
>
> All we need to do is to have a interface to the generator where we can
> just extract the axpy-kernels. The generator should not do any OpenCL
> program and kernel management.
>
>
I don't see any problem with extracting the source code from the generator
in order to create this program (it is already done for GEMM), but the
generator doesn't handle reciprocal and flip_sign. As I said earlier this
feature is cool because it may prevent the transfer of several GPU-scalar
in order to invert/reverse the value. On the other hand, though, it is
incompatible with the clBlas interface and the kernel generator  (both of
which are fed with cl_float and cl_double) . Modifying the generator to
handle "x = y/a - w/b - z*c" internally as "x = y*a + w*b + z*c + option_a
+ option_b + option_c" sounds like a very dangerous idea to me. It could
have a lot of undesirable side effects if made general, and making an
axpy-specific tree parsing would lead to a huge amount of code bloat. This
is actually the reason why I am so reluctant to integrating reciprocal and
flip_sign within the generator...

if(size(x)>1e5 && stride==1 && start==0){ //Vectors are padded, wouldn't it
be confounding/unnecessary to check for the internal size to fit the width?

 //The following steps are costly for small vectors
 cl_type<NumericT> cpu_alpha = alpha //copy back to host when the scalar is
on global device memory)
 if(alpha_flip) cpu_alpha*=-1;
 if(reciprocal) cpu_alpha = 1/cpu_alpha;
 //... same for beta

 //Optimized routines
 if(external_blas)
   call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
 else{
   dynamically_generated_program::init();
   ambm_kernel(x,cpu_alpha,y,cpu_beta,z)
 }
else{
  statically_generated_program::init();
  ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta,
reciprocal_beta, flip_beta, z)
 }

Wouldn't this solve all of our issues?

I (really) hope we're converging now! :)



>
>
>
>  This put aside, I'm not sure if we should give that much importance to
>> jit-compilation overhead, since the binaries can be cached. If I
>> remember well, Denis Demidov implemented such a caching mechanism for
>> VexCL. What if we replace  "distributed vector/matrix" with "optionnal
>> automatic kernel caching mechanism" for ViennaCL 1.6.0 (we just have a
>> limited amount of time :P) ? The drawback is that the filesystem library
>> would have to be dynamically linked, though, but afterall OpenCL itself
>> also has to be dynamically linked.
>>
>
> I don't believe it is our task to implement such a cache. This is way too
> much a source of error and messing with the filesystem for ViennaCL which
> is supposed to run with user permissions. An OpenCL SDK is installed into
> the system and thus has much better options to deal with the location of
> cache, etc. Also, why is only NVIDIA able to provide such a cache, even
> though they don't even seem to care about OpenCL 1.2? I doubt that e.g. AMD
> will go without a cache for an extended amount of time.
>

Agreed. I was just suggesting this because PyOpenCL already provides this,
but python comes with a set of dynamic libraries, so this is probably not
the same context.

Best regards,
Philippe


> Best regards,
> Karli
>
>
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today.
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to