Hi Phil,

 > Oh, I get it better now. I am not entirely convinced, though ;)
>  From my experience, the overhead of the jit launch is negligible
>   compared to the compilation of one kernel. I'm not sure whether
> compiling two kernels in the same program or two different program
> creates a big difference.

Okay, time to feed you with some hard facts ;-) Scenario: compilation of 
128 kernels. Configurations (x programs with y kernels each, x*y=128)
Execution times:

(x programs/y kernels each)  Execution time
(1/128)         1.4
(2/64)          2.0
(4/32)          3.2
(8/16)          5.6
(16/8)         10.5
(32/4)         20.0
(64/2)         39.5
(128/1)        80.6

Thus, jit launch overhead is in the order of a second!

> Plus, ideally, in the case of linear solver,
> the generator could be used to generate fused kernels, provided that the
> scheduler is fully operationnal.

Sure, kernel fusion is a bonus of the micro-scheduler, but we still need 
to have a fast default behavior for scenarios where the the kernel 
fusion is disabled.


> I fear that any solution to the
> aforementioned problem would destroy this precious ability... Ideally,
> once we enable it, the generate_execute() mentioned above would just be
> replaced by generate() (or enqueue_for_generation, which is more explicit)

All we need to do is to have a interface to the generator where we can 
just extract the axpy-kernels. The generator should not do any OpenCL 
program and kernel management.



> This put aside, I'm not sure if we should give that much importance to
> jit-compilation overhead, since the binaries can be cached. If I
> remember well, Denis Demidov implemented such a caching mechanism for
> VexCL. What if we replace  "distributed vector/matrix" with "optionnal
> automatic kernel caching mechanism" for ViennaCL 1.6.0 (we just have a
> limited amount of time :P) ? The drawback is that the filesystem library
> would have to be dynamically linked, though, but afterall OpenCL itself
> also has to be dynamically linked.

I don't believe it is our task to implement such a cache. This is way 
too much a source of error and messing with the filesystem for ViennaCL 
which is supposed to run with user permissions. An OpenCL SDK is 
installed into the system and thus has much better options to deal with 
the location of cache, etc. Also, why is only NVIDIA able to provide 
such a cache, even though they don't even seem to care about OpenCL 1.2? 
I doubt that e.g. AMD will go without a cache for an extended amount of 
time.

Best regards,
Karli


------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to