Hi Phil, > Oh, I get it better now. I am not entirely convinced, though ;) > From my experience, the overhead of the jit launch is negligible > compared to the compilation of one kernel. I'm not sure whether > compiling two kernels in the same program or two different program > creates a big difference.
Okay, time to feed you with some hard facts ;-) Scenario: compilation of 128 kernels. Configurations (x programs with y kernels each, x*y=128) Execution times: (x programs/y kernels each) Execution time (1/128) 1.4 (2/64) 2.0 (4/32) 3.2 (8/16) 5.6 (16/8) 10.5 (32/4) 20.0 (64/2) 39.5 (128/1) 80.6 Thus, jit launch overhead is in the order of a second! > Plus, ideally, in the case of linear solver, > the generator could be used to generate fused kernels, provided that the > scheduler is fully operationnal. Sure, kernel fusion is a bonus of the micro-scheduler, but we still need to have a fast default behavior for scenarios where the the kernel fusion is disabled. > I fear that any solution to the > aforementioned problem would destroy this precious ability... Ideally, > once we enable it, the generate_execute() mentioned above would just be > replaced by generate() (or enqueue_for_generation, which is more explicit) All we need to do is to have a interface to the generator where we can just extract the axpy-kernels. The generator should not do any OpenCL program and kernel management. > This put aside, I'm not sure if we should give that much importance to > jit-compilation overhead, since the binaries can be cached. If I > remember well, Denis Demidov implemented such a caching mechanism for > VexCL. What if we replace "distributed vector/matrix" with "optionnal > automatic kernel caching mechanism" for ViennaCL 1.6.0 (we just have a > limited amount of time :P) ? The drawback is that the filesystem library > would have to be dynamically linked, though, but afterall OpenCL itself > also has to be dynamically linked. I don't believe it is our task to implement such a cache. This is way too much a source of error and messing with the filesystem for ViennaCL which is supposed to run with user permissions. An OpenCL SDK is installed into the system and thus has much better options to deal with the location of cache, etc. Also, why is only NVIDIA able to provide such a cache, even though they don't even seem to care about OpenCL 1.2? I doubt that e.g. AMD will go without a cache for an extended amount of time. Best regards, Karli ------------------------------------------------------------------------------ CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments & Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk _______________________________________________ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel