Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hey, I think we agree on everything now! Okay, I will generate all the kernels, this will lead actually to 16 kernels for each cpu-gpu scalar combination, so 64 small kernels in total. This took time but it was a fruitful discussion :) Anyways, my ideas are much clearer now, thanks! Best regards, Philippe 2014-01-26 Karl Rupp r...@iue.tuwien.ac.at Hey, (x programs/y kernels each) Execution time (1/128) 1.4 (2/64) 2.0 (4/32) 3.2 (8/16) 5.6 (16/8) 10.5 (32/4) 20.0 (64/2) 39.5 (128/1)80.6 Thus, jit launch overhead is in the order of a second! Okay, it seems like 1 program for all the kernels is the way to go. From your hard facts, though, it seems like generating 16 kernels inside the same program would have practically the same cost as generating only one, since the execution time is largely dominated by the kernel launch overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s, which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~ 6ms. Considering that the flip_sign and reciprocal trick cannot be applied for unsigned integers, this is the way to go then. The increase in the number of kernels should be somewhat compensated by the fact that each of the kernels is shorter. All we need to do is to have a interface to the generator where we can just extract the axpy-kernels. The generator should not do any OpenCL program and kernel management. I don't see any problem with extracting the source code from the generator in order to create this program (it is already done for GEMM), but the generator doesn't handle reciprocal and flip_sign. As I said earlier this feature is cool because it may prevent the transfer of several GPU-scalar in order to invert/reverse the value. On the other hand, though, it is incompatible with the clBlas interface and the kernel generator (both of which are fed with cl_float and cl_double) . Modifying the generator to handle x = y/a - w/b - z*c internally as x = y*a + w*b + z*c + option_a + option_b + option_c sounds like a very dangerous idea to me. It could have a lot of undesirable side effects if made general, and making an axpy-specific tree parsing would lead to a huge amount of code bloat. This is actually the reason why I am so reluctant to integrating reciprocal and flip_sign within the generator... Okay, let's not propagate reciprocal and flip_sign into the generator then. Also, feel free to eliminate the second reduction stage for scalars, which is encoded into the option value. It is currently unused and makes the generator integration harder than necessary. We can revisit that later if all other optimizations are exhausted ;-) if(size(x)1e5 stride==1 start==0){ //Vectors are padded, wouldn't it be confounding/unnecessary to check for the internal size to fit the width? //The following steps are costly for small vectors cl_typeNumericT cpu_alpha = alpha //copy back to host when the scalar is on global device memory) Never copy device scalars back unless requested by the user. They reads block the command queue, preventing overlaps of host and device computations. if(alpha_flip) cpu_alpha*=-1; if(reciprocal) cpu_alpha = 1/cpu_alpha; //... same for beta Let's just generate all the needed kernels and only dispatch into the correct kernel. //Optimized routines if(external_blas) call_axpy_twice(x,cpu_alpha,y,cpu_beta,z) else{ dynamically_generated_program::init(); ambm_kernel(x,cpu_alpha,y,cpu_beta,z) } else{ statically_generated_program::init(); ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta, reciprocal_beta, flip_beta, z) } What is the difference between dynamically_generated_program::init(); and statically_generated_program::init(); ? Why aren't they the same? Also, mind the coding style regarding the placement of curly braces and spaces ;-) Wouldn't this solve all of our issues? I (really) hope we're converging now! :) I think we can safely use dynamically_generated_program::init(); in both cases, which contains all the kernels which are currently in the statically generated program. I don't believe it is our task to implement such a cache. This is way too much a source of error and messing with the filesystem for ViennaCL which is supposed to run with user permissions. An OpenCL SDK is installed into the system and thus has much better options to deal with the location of cache, etc. Also, why is only NVIDIA able to provide such a cache, even though they don't even seem to care about OpenCL 1.2? I doubt that e.g. AMD will go without a cache for an extended amount of time. Agreed. I was just suggesting this because PyOpenCL already provides this, but python comes with a set of dynamic libraries, so
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hi Phil, Oh, I get it better now. I am not entirely convinced, though ;) From my experience, the overhead of the jit launch is negligible compared to the compilation of one kernel. I'm not sure whether compiling two kernels in the same program or two different program creates a big difference. Okay, time to feed you with some hard facts ;-) Scenario: compilation of 128 kernels. Configurations (x programs with y kernels each, x*y=128) Execution times: (x programs/y kernels each) Execution time (1/128) 1.4 (2/64) 2.0 (4/32) 3.2 (8/16) 5.6 (16/8) 10.5 (32/4) 20.0 (64/2) 39.5 (128/1)80.6 Thus, jit launch overhead is in the order of a second! Plus, ideally, in the case of linear solver, the generator could be used to generate fused kernels, provided that the scheduler is fully operationnal. Sure, kernel fusion is a bonus of the micro-scheduler, but we still need to have a fast default behavior for scenarios where the the kernel fusion is disabled. I fear that any solution to the aforementioned problem would destroy this precious ability... Ideally, once we enable it, the generate_execute() mentioned above would just be replaced by generate() (or enqueue_for_generation, which is more explicit) All we need to do is to have a interface to the generator where we can just extract the axpy-kernels. The generator should not do any OpenCL program and kernel management. This put aside, I'm not sure if we should give that much importance to jit-compilation overhead, since the binaries can be cached. If I remember well, Denis Demidov implemented such a caching mechanism for VexCL. What if we replace distributed vector/matrix with optionnal automatic kernel caching mechanism for ViennaCL 1.6.0 (we just have a limited amount of time :P) ? The drawback is that the filesystem library would have to be dynamically linked, though, but afterall OpenCL itself also has to be dynamically linked. I don't believe it is our task to implement such a cache. This is way too much a source of error and messing with the filesystem for ViennaCL which is supposed to run with user permissions. An OpenCL SDK is installed into the system and thus has much better options to deal with the location of cache, etc. Also, why is only NVIDIA able to provide such a cache, even though they don't even seem to care about OpenCL 1.2? I doubt that e.g. AMD will go without a cache for an extended amount of time. Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hey hey Karl, 2014/1/25 Karl Rupp r...@iue.tuwien.ac.at Hi Phil, Oh, I get it better now. I am not entirely convinced, though ;) From my experience, the overhead of the jit launch is negligible compared to the compilation of one kernel. I'm not sure whether compiling two kernels in the same program or two different program creates a big difference. Okay, time to feed you with some hard facts ;-) Scenario: compilation of 128 kernels. Configurations (x programs with y kernels each, x*y=128) Execution times: (x programs/y kernels each) Execution time (1/128) 1.4 (2/64) 2.0 (4/32) 3.2 (8/16) 5.6 (16/8) 10.5 (32/4) 20.0 (64/2) 39.5 (128/1)80.6 Thus, jit launch overhead is in the order of a second! Okay, it seems like 1 program for all the kernels is the way to go. From your hard facts, though, it seems like generating 16 kernels inside the same program would have practically the same cost as generating only one, since the execution time is largely dominated by the kernel launch overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s, which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~ 6ms. Plus, ideally, in the case of linear solver, the generator could be used to generate fused kernels, provided that the scheduler is fully operationnal. Sure, kernel fusion is a bonus of the micro-scheduler, but we still need to have a fast default behavior for scenarios where the the kernel fusion is disabled. I fear that any solution to the aforementioned problem would destroy this precious ability... Ideally, once we enable it, the generate_execute() mentioned above would just be replaced by generate() (or enqueue_for_generation, which is more explicit) All we need to do is to have a interface to the generator where we can just extract the axpy-kernels. The generator should not do any OpenCL program and kernel management. I don't see any problem with extracting the source code from the generator in order to create this program (it is already done for GEMM), but the generator doesn't handle reciprocal and flip_sign. As I said earlier this feature is cool because it may prevent the transfer of several GPU-scalar in order to invert/reverse the value. On the other hand, though, it is incompatible with the clBlas interface and the kernel generator (both of which are fed with cl_float and cl_double) . Modifying the generator to handle x = y/a - w/b - z*c internally as x = y*a + w*b + z*c + option_a + option_b + option_c sounds like a very dangerous idea to me. It could have a lot of undesirable side effects if made general, and making an axpy-specific tree parsing would lead to a huge amount of code bloat. This is actually the reason why I am so reluctant to integrating reciprocal and flip_sign within the generator... if(size(x)1e5 stride==1 start==0){ //Vectors are padded, wouldn't it be confounding/unnecessary to check for the internal size to fit the width? //The following steps are costly for small vectors cl_typeNumericT cpu_alpha = alpha //copy back to host when the scalar is on global device memory) if(alpha_flip) cpu_alpha*=-1; if(reciprocal) cpu_alpha = 1/cpu_alpha; //... same for beta //Optimized routines if(external_blas) call_axpy_twice(x,cpu_alpha,y,cpu_beta,z) else{ dynamically_generated_program::init(); ambm_kernel(x,cpu_alpha,y,cpu_beta,z) } else{ statically_generated_program::init(); ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta, reciprocal_beta, flip_beta, z) } Wouldn't this solve all of our issues? I (really) hope we're converging now! :) This put aside, I'm not sure if we should give that much importance to jit-compilation overhead, since the binaries can be cached. If I remember well, Denis Demidov implemented such a caching mechanism for VexCL. What if we replace distributed vector/matrix with optionnal automatic kernel caching mechanism for ViennaCL 1.6.0 (we just have a limited amount of time :P) ? The drawback is that the filesystem library would have to be dynamically linked, though, but afterall OpenCL itself also has to be dynamically linked. I don't believe it is our task to implement such a cache. This is way too much a source of error and messing with the filesystem for ViennaCL which is supposed to run with user permissions. An OpenCL SDK is installed into the system and thus has much better options to deal with the location of cache, etc. Also, why is only NVIDIA able to provide such a cache, even though they don't even seem to care about OpenCL 1.2? I doubt that e.g. AMD will go without a cache for an extended amount of time. Agreed. I was just suggesting this because PyOpenCL already provides this, but python comes with a set of dynamic libraries, so this is probably not the same context. Best regards, Philippe Best regards, Karli
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hey, I am a bit confused, is there any reason for using reciprocal and flip_sign, instead of just changing the scalar accordingly? yes (with a drawback I'll discuss at the end): Consider the family of operations x = +- y OP1 a +- z OP2 b where x, y, and z are vectors, OP1 and OP2 are either multiplication or division, and a,b are host scalars. If I did the math correctly, these are 16 different kernels when coded explicitly. Hence, if you put all these into separate OpenCL kernels, you'll get fairly long compilation times. However, not that you cannot do this if a and b stem from device scalars, because then the manipulation of a and b would result in additional buffer allocations and kernel launches - way too slow. For floating point operations, one can reduce the number of operations a lot when (+- OP1 a) and (+- OP2 b) are computed once in a preprocessing step. Then, only the kernel x = y * a' + z * b' is needed, cutting the number of OpenCL kernels from 16 to 1. Since (-a) and (1/a) cannot be computed outside the kernel if a is a GPU scalar, this is always computed in a preprocessing step inside the OpenCL kernel for unification purposes. I think we can even apply some more cleverness here if we delegate all the work to a suitable implementation function. And now for the drawback: When using integers, the operation n/m is no longer the same as n * (1/m). Even worse, for unsigned integers it is also no longer possible to replace n - m by n + (-m). Thus, we certainly have to bite the bullet and generate kernels for all 16 combinations when using unsigned integers. However, I'm reluctant to generate all 16 combinations for floating point arguments if this is not needed... Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hi Karl, 2014/1/24 Karl Rupp r...@iue.tuwien.ac.at Hey, I am a bit confused, is there any reason for using reciprocal and flip_sign, instead of just changing the scalar accordingly? yes (with a drawback I'll discuss at the end): Consider the family of operations x = +- y OP1 a +- z OP2 b where x, y, and z are vectors, OP1 and OP2 are either multiplication or division, and a,b are host scalars. If I did the math correctly, these are 16 different kernels when coded explicitly. Hence, if you put all these into separate OpenCL kernels, you'll get fairly long compilation times. However, not that you cannot do this if a and b stem from device scalars, because then the manipulation of a and b would result in additional buffer allocations and kernel launches - way too slow. For floating point operations, one can reduce the number of operations a lot when (+- OP1 a) and (+- OP2 b) are computed once in a preprocessing step. Then, only the kernel x = y * a' + z * b' is needed, cutting the number of OpenCL kernels from 16 to 1. Since (-a) and (1/a) cannot be computed outside the kernel if a is a GPU scalar, this is always computed in a preprocessing step inside the OpenCL kernel for unification purposes. I think we can even apply some more cleverness here if we delegate all the work to a suitable implementation function. And now for the drawback: When using integers, the operation n/m is no longer the same as n * (1/m). Even worse, for unsigned integers it is also no longer possible to replace n - m by n + (-m). Thus, we certainly have to bite the bullet and generate kernels for all 16 combinations when using unsigned integers. However, I'm reluctant to generate all 16 combinations for floating point arguments if this is not needed... Thanks for the clarification. I also absolutely don't want to generate the 16 kernels either! I was in fact wondering why one passed reciprocal_alpha and flip_sign into the kernel. After thinking more about it, I have noticed that this permits us to do the corresponding inversion/multiplication within the kernel, and therefore avoid one some latency penalty / kernel launch overhead when the scalar is pointed out, that's smart! On the other hand, modifying the generator to not actually generate a specific kernel would be absurd imho. This brings another question, then. How could ambm beneficiate from the auto-tuning environment? I propose the following solution: check the size of the matrices/vector If the computation is dominated by the kernel launch time (say, less than 100,000 elements), then we use the current ambm kernel. Otherwise, we transfer the scalars to the CPU, perform the corresponding a' = +- OP a, b' = +- OP b, and either generate the kernel or use a BLAS library. This way, we beneficiate from kernel launch time optimization for small data, and high-bandwidth for large data. Does this sounds good? Best regards, Philippe Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hi, I was in fact wondering why one passed reciprocal_alpha and flip_sign into the kernel. After thinking more about it, I have noticed that this permits us to do the corresponding inversion/multiplication within the kernel, and therefore avoid one some latency penalty / kernel launch overhead when the scalar is pointed out, that's smart! On the other hand, modifying the generator to not actually generate a specific kernel would be absurd imho. This brings another question, then. How could ambm beneficiate from the auto-tuning environment? I propose the following solution: check the size of the matrices/vector If the computation is dominated by the kernel launch time (say, less than 100,000 elements), then we use the current ambm kernel. Otherwise, we transfer the scalars to the CPU, perform the corresponding a' = +- OP a, b' = +- OP b, and either generate the kernel or use a BLAS library. This way, we beneficiate from kernel launch time optimization for small data, and high-bandwidth for large data. Does this sounds good? In terms of execution time, this is probably the best solution. On the other hand, it does not solve the problem of compilation overhead: If we only dispatch into the generator for large data, we still have to generate the respective kernels and go through the OpenCL jit-compiler each time. The compilation overhead of this is even likely to dominate any gains we get from a faster execution. Instead, what about opening up the generator a bit? It is enough if we have some mechanism to access a batch-generation of axpy-like operations, for all other operations the generator can remain as-is. Another option is to move only the axpy-template from the generator over to linalg/opencl/kernels/*, because the generation of these kernels is fairly light-weight. Sure, it is a little bit of code-duplication, but it will keep the generator clean. Another possible improvement is to separate operations on full vectors from operations on ranges and slices. For full vectors we can use the built-in vector-types in OpenCL, which allows further optimizations not possible with ranges and strides, where we cannot use vector types in general. What do you think? Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hey, 2014/1/24 Karl Rupp r...@iue.tuwien.ac.at Hi, I was in fact wondering why one passed reciprocal_alpha and flip_sign into the kernel. After thinking more about it, I have noticed that this permits us to do the corresponding inversion/multiplication within the kernel, and therefore avoid one some latency penalty / kernel launch overhead when the scalar is pointed out, that's smart! On the other hand, modifying the generator to not actually generate a specific kernel would be absurd imho. This brings another question, then. How could ambm beneficiate from the auto-tuning environment? I propose the following solution: check the size of the matrices/vector If the computation is dominated by the kernel launch time (say, less than 100,000 elements), then we use the current ambm kernel. Otherwise, we transfer the scalars to the CPU, perform the corresponding a' = +- OP a, b' = +- OP b, and either generate the kernel or use a BLAS library. This way, we beneficiate from kernel launch time optimization for small data, and high-bandwidth for large data. Does this sounds good? In terms of execution time, this is probably the best solution. On the other hand, it does not solve the problem of compilation overhead: If we only dispatch into the generator for large data, we still have to generate the respective kernels and go through the OpenCL jit-compiler each time. The compilation overhead of this is even likely to dominate any gains we get from a faster execution. Instead, what about opening up the generator a bit? It is enough if we have some mechanism to access a batch-generation of axpy-like operations, for all other operations the generator can remain as-is. Another option is to move only the axpy-template from the generator over to linalg/opencl/kernels/*, because the generation of these kernels is fairly light-weight. Sure, it is a little bit of code-duplication, but it will keep the generator clean. Another possible improvement is to separate operations on full vectors from operations on ranges and slices. For full vectors we can use the built-in vector-types in OpenCL, which allows further optimizations not possible with ranges and strides, where we cannot use vector types in general. What do you think? I prefer option 3. This would allow for something like : if(size(x)1e5 stride==1 start==0){ //The following steps are costly for small vectors NumericT cpu_alpha = alpha //copy back to host when the scalar is on global device memory) if(alpha_flip) cpu_alpha*=-1; if(reciprocal) cpu_alpha = 1/cpu_alpha; //... same for beta //Optimized routines if(external_blas) call_axpy_twice(x,cpu_alpha,y,cpu_beta,z) else{ generate_execute(x = cpu_alpha*y + cpu_beta*z); } else{ //fallback } This way, we at most generate two kernels, one for small vectors, designed to optimize latency, and one for big vectors, designed to optimize bandwidth. Are we converging? :) Best regards, Philippe Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hi, I prefer option 3. This would allow for something like : if(size(x)1e5 stride==1 start==0){ Here we also need to check the internal_size to fit the vector width //The following steps are costly for small vectors NumericT cpu_alpha = alpha //copy back to host when the scalar is on global device memory) if(alpha_flip) cpu_alpha*=-1; if(reciprocal) cpu_alpha = 1/cpu_alpha; //... same for beta //Optimized routines if(external_blas) call_axpy_twice(x,cpu_alpha,y,cpu_beta,z) else{ generate_execute(x = cpu_alpha*y + cpu_beta*z); } else{ //fallback } This way, we at most generate two kernels, one for small vectors, designed to optimize latency, and one for big vectors, designed to optimize bandwidth. Are we converging? :) Convergence depends on what is inside generate_execute() ;-) How is the problem with alpha and beta residing on the GPU addressed? How will the batch-compilation look like? The important point is that for the default axpy kernels we really don't want to go through the jit-compiler for each of them individually. Note to self: Collect some numbers on the costs of jit-compilation for different OpenCL SDKs. Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hey hey hey, Convergence depends on what is inside generate_execute() ;-) How is the problem with alpha and beta residing on the GPU addressed? How will the batch-compilation look like? The important point is that for the default axpy kernels we really don't want to go through the jit-compiler for each of them individually. ;) in this case, generate_execute() will just trigger the compilation - on the first call only - of the kernel x = cpu_alpha*y + cpu_beta*z; __kernel void kernel(unsigned int N, float4* x, float4* y, float4* z, float alpha, float beta) { for(i = get_global_id(0) ; i N ; i+=get_global_size(0)) x[i] = alpha*y[i] + beta*z[i]; } I'm afraid this is not suitable then. A simple conjugate gradient solver would then go through ~10 OpenCL compilations, making it awfully slow at the first run... With AMD and Intel SDKs, which to my knowledge still do not buffer kernels, this would mean that each time a process is started, this large overhead will be visible. Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hey hey, 2014/1/25 Karl Rupp r...@iue.tuwien.ac.at Hi, I prefer option 3. This would allow for something like : if(size(x)1e5 stride==1 start==0){ Here we also need to check the internal_size to fit the vector width //The following steps are costly for small vectors NumericT cpu_alpha = alpha //copy back to host when the scalar is on global device memory) if(alpha_flip) cpu_alpha*=-1; if(reciprocal) cpu_alpha = 1/cpu_alpha; //... same for beta //Optimized routines if(external_blas) call_axpy_twice(x,cpu_alpha,y,cpu_beta,z) else{ generate_execute(x = cpu_alpha*y + cpu_beta*z); } else{ //fallback } This way, we at most generate two kernels, one for small vectors, designed to optimize latency, and one for big vectors, designed to optimize bandwidth. Are we converging? :) Convergence depends on what is inside generate_execute() ;-) How is the problem with alpha and beta residing on the GPU addressed? How will the batch-compilation look like? The important point is that for the default axpy kernels we really don't want to go through the jit-compiler for each of them individually. ;) in this case, generate_execute() will just trigger the compilation - on the first call only - of the kernel x = cpu_alpha*y + cpu_beta*z; __kernel void kernel(unsigned int N, float4* x, float4* y, float4* z, float alpha, float beta) { for(i = get_global_id(0) ; i N ; i+=get_global_size(0)) x[i] = alpha*y[i] + beta*z[i]; } with of course an appropriate compute profile Note to self: Collect some numbers on the costs of jit-compilation for different OpenCL SDKs. Best regards, Karli Best regards, Philippe -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hey, 2014/1/25 Karl Rupp r...@iue.tuwien.ac.at Hey hey hey, Convergence depends on what is inside generate_execute() ;-) How is the problem with alpha and beta residing on the GPU addressed? How will the batch-compilation look like? The important point is that for the default axpy kernels we really don't want to go through the jit-compiler for each of them individually. ;) in this case, generate_execute() will just trigger the compilation - on the first call only - of the kernel x = cpu_alpha*y + cpu_beta*z; __kernel void kernel(unsigned int N, float4* x, float4* y, float4* z, float alpha, float beta) { for(i = get_global_id(0) ; i N ; i+=get_global_size(0)) x[i] = alpha*y[i] + beta*z[i]; } I'm afraid this is not suitable then. A simple conjugate gradient solver would then go through ~10 OpenCL compilations, making it awfully slow at the first run... With AMD and Intel SDKs, which to my knowledge still do not buffer kernels, this would mean that each time a process is started, this large overhead will be visible. I don't understand why this would go through more than one compilation... This kernel is compiled only once, the value of flip_sign and reciprocal only changes the dynamic value of the argument, not the source code. This would eventually result in: if(alpha_reciprocal) kernel(N,x,y,z,1/alpha,beta) Am I missing something? Best regards, Philippe Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hi Philippe, I don't understand why this would go through more than one compilation... This kernel is compiled only once, the value of flip_sign and reciprocal only changes the dynamic value of the argument, not the source code. This would eventually result in: if(alpha_reciprocal) kernel(N,x,y,z,1/alpha,beta) Am I missing something? I think so ;-) It's not about a single kernel, it's about the compilation unit (i.e. OpenCL program). For conjugate gradients we roughly have the following vector operations (random variable names) x = y; x += alpha y; x = z + alpha z; x = y - alpha z; x = inner_prod(y,z); BiCGStab and GMRES add a few more of them. If we use the generator as-is now, then each of the operations creates a separate OpenCL program the first time it is encountered and we pay the jit-compiler launch overhead multiple times. With the current non-generator model, all vector kernels are in the same OpenCL program and we pay the jit-overhead only once. I'd like to stick with the current model of having just one OpenCL program for all the basic kernels, but get the target-optimized sources from the generator. Sorry if I wasn't clear enough in my earlier mails. Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk ___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel
Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters
Hi, Oh, I get it better now. I am not entirely convinced, though ;) From my experience, the overhead of the jit launch is negligible compared to the compilation of one kernel. I'm not sure whether compiling two kernels in the same program or two different program creates a big difference. Plus, ideally, in the case of linear solver, the generator could be used to generate fused kernels, provided that the scheduler is fully operationnal. I fear that any solution to the aforementioned problem would destroy this precious ability... Ideally, once we enable it, the generate_execute() mentioned above would just be replaced by generate() (or enqueue_for_generation, which is more explicit) This put aside, I'm not sure if we should give that much importance to jit-compilation overhead, since the binaries can be cached. If I remember well, Denis Demidov implemented such a caching mechanism for VexCL. What if we replace distributed vector/matrix with optionnal automatic kernel caching mechanism for ViennaCL 1.6.0 (we just have a limited amount of time :P) ? The drawback is that the filesystem library would have to be dynamically linked, though, but afterall OpenCL itself also has to be dynamically linked. Best regards, Philippe 2014/1/25 Karl Rupp r...@iue.tuwien.ac.at Hi Philippe, I don't understand why this would go through more than one compilation... This kernel is compiled only once, the value of flip_sign and reciprocal only changes the dynamic value of the argument, not the source code. This would eventually result in: if(alpha_reciprocal) kernel(N,x,y,z,1/alpha,beta) Am I missing something? I think so ;-) It's not about a single kernel, it's about the compilation unit (i.e. OpenCL program). For conjugate gradients we roughly have the following vector operations (random variable names) x = y; x += alpha y; x = z + alpha z; x = y - alpha z; x = inner_prod(y,z); BiCGStab and GMRES add a few more of them. If we use the generator as-is now, then each of the operations creates a separate OpenCL program the first time it is encountered and we pay the jit-compiler launch overhead multiple times. With the current non-generator model, all vector kernels are in the same OpenCL program and we pay the jit-overhead only once. I'd like to stick with the current model of having just one OpenCL program for all the basic kernels, but get the target-optimized sources from the generator. Sorry if I wasn't clear enough in my earlier mails. Best regards, Karli -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ ViennaCL-devel mailing list ViennaCL-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/viennacl-devel