Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-26 Thread Philippe Tillet
Hey,

I think we agree on everything now! Okay, I will generate all the kernels,
this will lead actually to 16 kernels for each cpu-gpu scalar combination,
so 64 small kernels in total. This took time but it was a fruitful
discussion :)

Anyways, my ideas are much clearer now, thanks!

Best regards,
Philippe


2014-01-26 Karl Rupp r...@iue.tuwien.ac.at

 Hey,


  (x programs/y kernels each)  Execution time

 (1/128) 1.4
 (2/64)  2.0
 (4/32)  3.2
 (8/16)  5.6
 (16/8) 10.5
 (32/4) 20.0
 (64/2) 39.5
 (128/1)80.6

 Thus, jit launch overhead is in the order of a second!


 Okay, it seems like 1 program for all the kernels is the way to go. From
 your hard facts, though, it seems like generating 16 kernels inside the
 same program would have practically the same cost as generating only
 one, since the execution time is largely dominated by the kernel launch
 overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s,
 which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~
 6ms.


 Considering that the flip_sign and reciprocal trick cannot be applied for
 unsigned integers, this is the way to go then. The increase in the number
 of kernels should be somewhat compensated by the fact that each of the
 kernels is shorter.



  All we need to do is to have a interface to the generator where we
 can just extract the axpy-kernels. The generator should not do any
 OpenCL program and kernel management.


 I don't see any problem with extracting the source code from the
 generator in order to create this program (it is already done for GEMM),
 but the generator doesn't handle reciprocal and flip_sign. As I said
 earlier this feature is cool because it may prevent the transfer of
 several GPU-scalar in order to invert/reverse the value. On the other
 hand, though, it is incompatible with the clBlas interface and the
 kernel generator  (both of which are fed with cl_float and cl_double) .
 Modifying the generator to handle x = y/a - w/b - z*c internally as x
 = y*a + w*b + z*c + option_a + option_b + option_c sounds like a very
 dangerous idea to me. It could have a lot of undesirable side effects if
 made general, and making an axpy-specific tree parsing would lead to a
 huge amount of code bloat. This is actually the reason why I am so
 reluctant to integrating reciprocal and flip_sign within the generator...


 Okay, let's not propagate reciprocal and flip_sign into the generator
 then. Also, feel free to eliminate the second reduction stage for scalars,
 which is encoded into the option value. It is currently unused and makes
 the generator integration harder than necessary. We can revisit that later
 if all other optimizations are exhausted ;-)



  if(size(x)1e5  stride==1  start==0){ //Vectors are padded, wouldn't
 it be confounding/unnecessary to check for the internal size to fit the
 width?

 //The following steps are costly for small vectors
   cl_typeNumericT cpu_alpha = alpha //copy back to host when the
 scalar is on global device memory)


 Never copy device scalars back unless requested by the user. They reads
 block the command queue, preventing overlaps of host and device
 computations.


if(alpha_flip) cpu_alpha*=-1;
   if(reciprocal) cpu_alpha = 1/cpu_alpha;
   //... same for beta


 Let's just generate all the needed kernels and only dispatch into the
 correct kernel.



  //Optimized routines
   if(external_blas)
 call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
   else{
 dynamically_generated_program::init();
 ambm_kernel(x,cpu_alpha,y,cpu_beta,z)
   }
 else{
statically_generated_program::init();
ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta,
 reciprocal_beta, flip_beta, z)
 }


 What is the difference between
   dynamically_generated_program::init();
 and
   statically_generated_program::init();
 ? Why aren't they the same?

 Also, mind the coding style regarding the placement of curly braces and
 spaces ;-)



  Wouldn't this solve all of our issues?

 I (really) hope we're converging now! :)


 I think we can safely use
   dynamically_generated_program::init();
 in both cases, which contains all the kernels which are currently in the
 statically generated program.



  I don't believe it is our task to implement such a cache. This is
 way too much a source of error and messing with the filesystem for
 ViennaCL which is supposed to run with user permissions. An OpenCL
 SDK is installed into the system and thus has much better options to
 deal with the location of cache, etc. Also, why is only NVIDIA able
 to provide such a cache, even though they don't even seem to care
 about OpenCL 1.2? I doubt that e.g. AMD will go without a cache for
 an extended amount of time.


 Agreed. I was just suggesting this because PyOpenCL already provides
 this, but python comes with a set of dynamic libraries, so 

Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-25 Thread Karl Rupp
Hi Phil,

  Oh, I get it better now. I am not entirely convinced, though ;)
  From my experience, the overhead of the jit launch is negligible
   compared to the compilation of one kernel. I'm not sure whether
 compiling two kernels in the same program or two different program
 creates a big difference.

Okay, time to feed you with some hard facts ;-) Scenario: compilation of 
128 kernels. Configurations (x programs with y kernels each, x*y=128)
Execution times:

(x programs/y kernels each)  Execution time
(1/128) 1.4
(2/64)  2.0
(4/32)  3.2
(8/16)  5.6
(16/8) 10.5
(32/4) 20.0
(64/2) 39.5
(128/1)80.6

Thus, jit launch overhead is in the order of a second!

 Plus, ideally, in the case of linear solver,
 the generator could be used to generate fused kernels, provided that the
 scheduler is fully operationnal.

Sure, kernel fusion is a bonus of the micro-scheduler, but we still need 
to have a fast default behavior for scenarios where the the kernel 
fusion is disabled.


 I fear that any solution to the
 aforementioned problem would destroy this precious ability... Ideally,
 once we enable it, the generate_execute() mentioned above would just be
 replaced by generate() (or enqueue_for_generation, which is more explicit)

All we need to do is to have a interface to the generator where we can 
just extract the axpy-kernels. The generator should not do any OpenCL 
program and kernel management.



 This put aside, I'm not sure if we should give that much importance to
 jit-compilation overhead, since the binaries can be cached. If I
 remember well, Denis Demidov implemented such a caching mechanism for
 VexCL. What if we replace  distributed vector/matrix with optionnal
 automatic kernel caching mechanism for ViennaCL 1.6.0 (we just have a
 limited amount of time :P) ? The drawback is that the filesystem library
 would have to be dynamically linked, though, but afterall OpenCL itself
 also has to be dynamically linked.

I don't believe it is our task to implement such a cache. This is way 
too much a source of error and messing with the filesystem for ViennaCL 
which is supposed to run with user permissions. An OpenCL SDK is 
installed into the system and thus has much better options to deal with 
the location of cache, etc. Also, why is only NVIDIA able to provide 
such a cache, even though they don't even seem to care about OpenCL 1.2? 
I doubt that e.g. AMD will go without a cache for an extended amount of 
time.

Best regards,
Karli


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-25 Thread Philippe Tillet
Hey hey Karl,





2014/1/25 Karl Rupp r...@iue.tuwien.ac.at

 Hi Phil,


  Oh, I get it better now. I am not entirely convinced, though ;)

  From my experience, the overhead of the jit launch is negligible
   compared to the compilation of one kernel. I'm not sure whether
 compiling two kernels in the same program or two different program
 creates a big difference.


 Okay, time to feed you with some hard facts ;-) Scenario: compilation of
 128 kernels. Configurations (x programs with y kernels each, x*y=128)
 Execution times:

 (x programs/y kernels each)  Execution time
 (1/128) 1.4
 (2/64)  2.0
 (4/32)  3.2
 (8/16)  5.6
 (16/8) 10.5
 (32/4) 20.0
 (64/2) 39.5
 (128/1)80.6

 Thus, jit launch overhead is in the order of a second!


Okay, it seems like 1 program for all the kernels is the way to go. From
your hard facts, though, it seems like generating 16 kernels inside the
same program would have practically the same cost as generating only one,
since the execution time is largely dominated by the kernel launch
overhead. The jit launch overhead seems to be of roughly 80/128 = 0.8s,
which leads to a kernel compilation time of roughly (1.4 - 0.8)/128 =~ 6ms.



  Plus, ideally, in the case of linear solver,
 the generator could be used to generate fused kernels, provided that the
 scheduler is fully operationnal.


 Sure, kernel fusion is a bonus of the micro-scheduler, but we still need
 to have a fast default behavior for scenarios where the the kernel fusion
 is disabled.



  I fear that any solution to the
 aforementioned problem would destroy this precious ability... Ideally,
 once we enable it, the generate_execute() mentioned above would just be
 replaced by generate() (or enqueue_for_generation, which is more explicit)


 All we need to do is to have a interface to the generator where we can
 just extract the axpy-kernels. The generator should not do any OpenCL
 program and kernel management.


I don't see any problem with extracting the source code from the generator
in order to create this program (it is already done for GEMM), but the
generator doesn't handle reciprocal and flip_sign. As I said earlier this
feature is cool because it may prevent the transfer of several GPU-scalar
in order to invert/reverse the value. On the other hand, though, it is
incompatible with the clBlas interface and the kernel generator  (both of
which are fed with cl_float and cl_double) . Modifying the generator to
handle x = y/a - w/b - z*c internally as x = y*a + w*b + z*c + option_a
+ option_b + option_c sounds like a very dangerous idea to me. It could
have a lot of undesirable side effects if made general, and making an
axpy-specific tree parsing would lead to a huge amount of code bloat. This
is actually the reason why I am so reluctant to integrating reciprocal and
flip_sign within the generator...

if(size(x)1e5  stride==1  start==0){ //Vectors are padded, wouldn't it
be confounding/unnecessary to check for the internal size to fit the width?

 //The following steps are costly for small vectors
 cl_typeNumericT cpu_alpha = alpha //copy back to host when the scalar is
on global device memory)
 if(alpha_flip) cpu_alpha*=-1;
 if(reciprocal) cpu_alpha = 1/cpu_alpha;
 //... same for beta

 //Optimized routines
 if(external_blas)
   call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
 else{
   dynamically_generated_program::init();
   ambm_kernel(x,cpu_alpha,y,cpu_beta,z)
 }
else{
  statically_generated_program::init();
  ambm_kernel(x, alpha, reciprocal_alpha, flip_alpha y, beta,
reciprocal_beta, flip_beta, z)
 }

Wouldn't this solve all of our issues?

I (really) hope we're converging now! :)






  This put aside, I'm not sure if we should give that much importance to
 jit-compilation overhead, since the binaries can be cached. If I
 remember well, Denis Demidov implemented such a caching mechanism for
 VexCL. What if we replace  distributed vector/matrix with optionnal
 automatic kernel caching mechanism for ViennaCL 1.6.0 (we just have a
 limited amount of time :P) ? The drawback is that the filesystem library
 would have to be dynamically linked, though, but afterall OpenCL itself
 also has to be dynamically linked.


 I don't believe it is our task to implement such a cache. This is way too
 much a source of error and messing with the filesystem for ViennaCL which
 is supposed to run with user permissions. An OpenCL SDK is installed into
 the system and thus has much better options to deal with the location of
 cache, etc. Also, why is only NVIDIA able to provide such a cache, even
 though they don't even seem to care about OpenCL 1.2? I doubt that e.g. AMD
 will go without a cache for an extended amount of time.


Agreed. I was just suggesting this because PyOpenCL already provides this,
but python comes with a set of dynamic libraries, so this is probably not
the same context.

Best regards,
Philippe


 Best regards,
 Karli



Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Karl Rupp
Hey,

  I am a bit confused, is there any reason for using reciprocal and
 flip_sign, instead of just changing the scalar accordingly?

yes (with a drawback I'll discuss at the end): Consider the family of 
operations

  x = +- y OP1 a +- z OP2 b

where x, y, and z are vectors, OP1 and OP2 are either multiplication or 
division, and a,b are host scalars. If I did the math correctly, these 
are 16 different kernels when coded explicitly. Hence, if you put all 
these into separate OpenCL kernels, you'll get fairly long compilation 
times. However, not that you cannot do this if a and b stem from device 
scalars, because then the manipulation of a and b would result in 
additional buffer allocations and kernel launches - way too slow.

For floating point operations, one can reduce the number of operations a 
lot when (+- OP1 a) and (+- OP2 b) are computed once in a preprocessing 
step. Then, only the kernel

  x = y * a' + z * b'

is needed, cutting the number of OpenCL kernels from 16 to 1. Since (-a) 
and (1/a) cannot be computed outside the kernel if a is a GPU scalar, 
this is always computed in a preprocessing step inside the OpenCL kernel 
for unification purposes. I think we can even apply some more cleverness 
here if we delegate all the work to a suitable implementation function.

And now for the drawback: When using integers, the operation n/m is no 
longer the same as n * (1/m). Even worse, for unsigned integers it is 
also no longer possible to replace n - m by n + (-m). Thus, we certainly 
have to bite the bullet and generate kernels for all 16 combinations 
when using unsigned integers. However, I'm reluctant to generate all 16 
combinations for floating point arguments if this is not needed...

Best regards,
Karli


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Philippe Tillet
Hi Karl,



2014/1/24 Karl Rupp r...@iue.tuwien.ac.at

 Hey,

   I am a bit confused, is there any reason for using reciprocal and
  flip_sign, instead of just changing the scalar accordingly?

 yes (with a drawback I'll discuss at the end): Consider the family of
 operations

   x = +- y OP1 a +- z OP2 b

 where x, y, and z are vectors, OP1 and OP2 are either multiplication or
 division, and a,b are host scalars. If I did the math correctly, these
 are 16 different kernels when coded explicitly. Hence, if you put all
 these into separate OpenCL kernels, you'll get fairly long compilation
 times. However, not that you cannot do this if a and b stem from device
 scalars, because then the manipulation of a and b would result in
 additional buffer allocations and kernel launches - way too slow.

 For floating point operations, one can reduce the number of operations a
 lot when (+- OP1 a) and (+- OP2 b) are computed once in a preprocessing
 step. Then, only the kernel

   x = y * a' + z * b'

 is needed, cutting the number of OpenCL kernels from 16 to 1. Since (-a)
 and (1/a) cannot be computed outside the kernel if a is a GPU scalar,
 this is always computed in a preprocessing step inside the OpenCL kernel
 for unification purposes. I think we can even apply some more cleverness
 here if we delegate all the work to a suitable implementation function.

 And now for the drawback: When using integers, the operation n/m is no
 longer the same as n * (1/m). Even worse, for unsigned integers it is
 also no longer possible to replace n - m by n + (-m). Thus, we certainly
 have to bite the bullet and generate kernels for all 16 combinations
 when using unsigned integers. However, I'm reluctant to generate all 16
 combinations for floating point arguments if this is not needed...


Thanks for the clarification. I also absolutely don't want to generate the
16 kernels either!

I was in fact wondering why one passed reciprocal_alpha and flip_sign into
the kernel. After thinking more about it, I have noticed that this permits
us to do the corresponding inversion/multiplication within the kernel, and
therefore avoid one some latency penalty / kernel launch overhead when the
scalar is pointed out, that's smart!
On the other hand, modifying the generator to not actually generate a
specific kernel would be absurd imho. This brings another question, then.
How could ambm beneficiate from the auto-tuning environment?
I propose the following solution:

check the size of the matrices/vector

If the computation is dominated by the kernel launch time (say, less than
100,000 elements), then we use the current ambm kernel. Otherwise, we
transfer the scalars to the CPU, perform the corresponding a' = +- OP a, b'
= +- OP b, and either generate the kernel or use a BLAS library. This way,
we beneficiate from kernel launch time optimization for small data, and
high-bandwidth for large data. Does this sounds good?

Best regards,
Philippe


Best regards,
 Karli



 --
 CenturyLink Cloud: The Leader in Enterprise Cloud Services.
 Learn Why More Businesses Are Choosing CenturyLink Cloud For
 Critical Workloads, Development Environments  Everything In Between.
 Get a Quote or Start a Free Trial Today.

 http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
 ___
 ViennaCL-devel mailing list
 ViennaCL-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/viennacl-devel

--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Karl Rupp
Hi,

  I was in fact wondering why one passed reciprocal_alpha and flip_sign
 into the kernel. After thinking more about it, I have noticed that this
 permits us to do the corresponding inversion/multiplication within the
 kernel, and therefore avoid one some latency penalty / kernel launch
 overhead when the scalar is pointed out, that's smart!
 On the other hand, modifying the generator to not actually generate a
 specific kernel would be absurd imho. This brings another question,
 then. How could ambm beneficiate from the auto-tuning environment?
 I propose the following solution:

 check the size of the matrices/vector

 If the computation is dominated by the kernel launch time (say, less
 than 100,000 elements), then we use the current ambm kernel. Otherwise,
 we transfer the scalars to the CPU, perform the corresponding a' = +- OP
 a, b' = +- OP b, and either generate the kernel or use a BLAS library.
 This way, we beneficiate from kernel launch time optimization for small
 data, and high-bandwidth for large data. Does this sounds good?

In terms of execution time, this is probably the best solution. On the 
other hand, it does not solve the problem of compilation overhead: If we 
only dispatch into the generator for large data, we still have to 
generate the respective kernels and go through the OpenCL jit-compiler 
each time. The compilation overhead of this is even likely to dominate 
any gains we get from a faster execution.

Instead, what about opening up the generator a bit? It is enough if we 
have some mechanism to access a batch-generation of axpy-like 
operations, for all other operations the generator can remain as-is.

Another option is to move only the axpy-template from the generator over 
to linalg/opencl/kernels/*, because the generation of these kernels is 
fairly light-weight. Sure, it is a little bit of code-duplication, but 
it will keep the generator clean.

Another possible improvement is to separate operations on full vectors 
from operations on ranges and slices. For full vectors we can use the 
built-in vector-types in OpenCL, which allows further optimizations not 
possible with ranges and strides, where we cannot use vector types in 
general.

What do you think?

Best regards,
Karli


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Philippe Tillet
Hey,


2014/1/24 Karl Rupp r...@iue.tuwien.ac.at

 Hi,


  I was in fact wondering why one passed reciprocal_alpha and flip_sign

 into the kernel. After thinking more about it, I have noticed that this
 permits us to do the corresponding inversion/multiplication within the
 kernel, and therefore avoid one some latency penalty / kernel launch
 overhead when the scalar is pointed out, that's smart!
 On the other hand, modifying the generator to not actually generate a
 specific kernel would be absurd imho. This brings another question,
 then. How could ambm beneficiate from the auto-tuning environment?
 I propose the following solution:

 check the size of the matrices/vector

 If the computation is dominated by the kernel launch time (say, less
 than 100,000 elements), then we use the current ambm kernel. Otherwise,
 we transfer the scalars to the CPU, perform the corresponding a' = +- OP
 a, b' = +- OP b, and either generate the kernel or use a BLAS library.
 This way, we beneficiate from kernel launch time optimization for small
 data, and high-bandwidth for large data. Does this sounds good?


 In terms of execution time, this is probably the best solution. On the
 other hand, it does not solve the problem of compilation overhead: If we
 only dispatch into the generator for large data, we still have to generate
 the respective kernels and go through the OpenCL jit-compiler each time.
 The compilation overhead of this is even likely to dominate any gains we
 get from a faster execution.

Instead, what about opening up the generator a bit? It is enough if we have
 some mechanism to access a batch-generation of axpy-like operations, for
 all other operations the generator can remain as-is.

 Another option is to move only the axpy-template from the generator over
 to linalg/opencl/kernels/*, because the generation of these kernels is
 fairly light-weight. Sure, it is a little bit of code-duplication, but it
 will keep the generator clean.

 Another possible improvement is to separate operations on full vectors
 from operations on ranges and slices. For full vectors we can use the
 built-in vector-types in OpenCL, which allows further optimizations not
 possible with ranges and strides, where we cannot use vector types in
 general.


 What do you think?


I prefer option 3. This would allow for something like :

if(size(x)1e5  stride==1  start==0){

 //The following steps are costly for small vectors
 NumericT cpu_alpha = alpha //copy back to host when the scalar is on
global device memory)
 if(alpha_flip) cpu_alpha*=-1;
 if(reciprocal) cpu_alpha = 1/cpu_alpha;
 //... same for beta

//Optimized routines
 if(external_blas)
   call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
 else{
   generate_execute(x = cpu_alpha*y + cpu_beta*z);
}
else{
  //fallback
}

This way, we at most generate two kernels, one for small vectors,  designed
to optimize latency, and one for big vectors, designed to optimize
bandwidth. Are we converging? :)


Best regards,
Philippe


Best regards,
 Karli


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Karl Rupp
Hi,

 I prefer option 3. This would allow for something like :

 if(size(x)1e5  stride==1  start==0){

Here we also need to check the internal_size to fit the vector width


   //The following steps are costly for small vectors
   NumericT cpu_alpha = alpha //copy back to host when the scalar is on
 global device memory)
   if(alpha_flip) cpu_alpha*=-1;
   if(reciprocal) cpu_alpha = 1/cpu_alpha;
   //... same for beta

 //Optimized routines
   if(external_blas)
 call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
   else{
 generate_execute(x = cpu_alpha*y + cpu_beta*z);
 }
 else{
//fallback
 }

 This way, we at most generate two kernels, one for small vectors,
   designed to optimize latency, and one for big vectors, designed to
 optimize bandwidth. Are we converging? :)

Convergence depends on what is inside generate_execute() ;-) How is the 
problem with alpha and beta residing on the GPU addressed? How will the 
batch-compilation look like? The important point is that for the default 
axpy kernels we really don't want to go through the jit-compiler for 
each of them individually.

Note to self: Collect some numbers on the costs of jit-compilation for 
different OpenCL SDKs.

Best regards,
Karli



--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Karl Rupp
Hey hey hey,

  Convergence depends on what is inside generate_execute() ;-) How is
 the problem with alpha and beta residing on the GPU addressed? How
 will the batch-compilation look like? The important point is that
 for the default axpy kernels we really don't want to go through the
 jit-compiler for each of them individually.


 ;)
 in this case, generate_execute() will just trigger the compilation - on
 the first call only - of the kernel
 x = cpu_alpha*y + cpu_beta*z;

 __kernel void kernel(unsigned int N, float4* x, float4* y, float4* z,
 float alpha, float beta)
 {
for(i = get_global_id(0) ; i  N ; i+=get_global_size(0))
  x[i] = alpha*y[i] + beta*z[i];
 }

I'm afraid this is not suitable then. A simple conjugate gradient solver 
would then go through ~10 OpenCL compilations, making it awfully slow at 
the first run... With AMD and Intel SDKs, which to my knowledge still do 
not buffer kernels, this would mean that each time a process is started, 
this large overhead will be visible.

Best regards,
Karli



--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Philippe Tillet
Hey hey,


2014/1/25 Karl Rupp r...@iue.tuwien.ac.at

 Hi,


  I prefer option 3. This would allow for something like :

 if(size(x)1e5  stride==1  start==0){


 Here we also need to check the internal_size to fit the vector width



   //The following steps are costly for small vectors
   NumericT cpu_alpha = alpha //copy back to host when the scalar is on
 global device memory)
   if(alpha_flip) cpu_alpha*=-1;
   if(reciprocal) cpu_alpha = 1/cpu_alpha;
   //... same for beta

 //Optimized routines
   if(external_blas)
 call_axpy_twice(x,cpu_alpha,y,cpu_beta,z)
   else{
 generate_execute(x = cpu_alpha*y + cpu_beta*z);
 }
 else{
//fallback
 }

 This way, we at most generate two kernels, one for small vectors,
   designed to optimize latency, and one for big vectors, designed to
 optimize bandwidth. Are we converging? :)


 Convergence depends on what is inside generate_execute() ;-) How is the
 problem with alpha and beta residing on the GPU addressed? How will the
 batch-compilation look like? The important point is that for the default
 axpy kernels we really don't want to go through the jit-compiler for each
 of them individually.


;)
in this case, generate_execute() will just trigger the compilation - on the
first call only - of the kernel
x = cpu_alpha*y + cpu_beta*z;

__kernel void kernel(unsigned int N, float4* x, float4* y, float4* z, float
alpha, float beta)
{
  for(i = get_global_id(0) ; i  N ; i+=get_global_size(0))
x[i] = alpha*y[i] + beta*z[i];
}

with of course an appropriate compute profile


 Note to self: Collect some numbers on the costs of jit-compilation for
 different OpenCL SDKs.

 Best regards,
 Karli



Best regards,
Philippe
--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Philippe Tillet
Hey,


2014/1/25 Karl Rupp r...@iue.tuwien.ac.at

 Hey hey hey,


  Convergence depends on what is inside generate_execute() ;-) How is

 the problem with alpha and beta residing on the GPU addressed? How
 will the batch-compilation look like? The important point is that
 for the default axpy kernels we really don't want to go through the
 jit-compiler for each of them individually.


 ;)
 in this case, generate_execute() will just trigger the compilation - on
 the first call only - of the kernel
 x = cpu_alpha*y + cpu_beta*z;

 __kernel void kernel(unsigned int N, float4* x, float4* y, float4* z,
 float alpha, float beta)
 {
for(i = get_global_id(0) ; i  N ; i+=get_global_size(0))
  x[i] = alpha*y[i] + beta*z[i];
 }


 I'm afraid this is not suitable then. A simple conjugate gradient solver
 would then go through ~10 OpenCL compilations, making it awfully slow at
 the first run... With AMD and Intel SDKs, which to my knowledge still do
 not buffer kernels, this would mean that each time a process is started,
 this large overhead will be visible.


I don't understand why this would go through more than one compilation...
This kernel is compiled only once, the value of flip_sign and reciprocal
only changes the dynamic value of the argument, not the source code.

This would eventually result in:

if(alpha_reciprocal)
   kernel(N,x,y,z,1/alpha,beta)

Am I missing something?

Best regards,
Philippe

Best regards,
 Karli



--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Karl Rupp
Hi Philippe,


 I don't understand why this would go through more than one compilation...
 This kernel is compiled only once, the value of flip_sign and reciprocal
 only changes the dynamic value of the argument, not the source code.

 This would eventually result in:

 if(alpha_reciprocal)
 kernel(N,x,y,z,1/alpha,beta)

 Am I missing something?

I think so ;-) It's not about a single kernel, it's about the 
compilation unit (i.e. OpenCL program). For conjugate gradients we 
roughly have the following vector operations (random variable names)

x = y;
x += alpha y;
x = z + alpha z;
x = y - alpha z;
x = inner_prod(y,z);

BiCGStab and GMRES add a few more of them. If we use the generator as-is 
now, then each of the operations creates a separate OpenCL program the 
first time it is encountered and we pay the jit-compiler launch overhead 
multiple times. With the current non-generator model, all vector kernels 
are in the same OpenCL program and we pay the jit-overhead only once. 
I'd like to stick with the current model of having just one OpenCL 
program for all the basic kernels, but get the target-optimized sources 
from the generator.

Sorry if I wasn't clear enough in my earlier mails.

Best regards,
Karli


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk
___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel


Re: [ViennaCL-devel] AXPY and reciprocal, flip_sign parameters

2014-01-24 Thread Philippe Tillet
Hi,

Oh, I get it better now. I am not entirely convinced, though ;)
From my experience, the overhead of the jit launch is negligible  compared
to the compilation of one kernel. I'm not sure whether compiling two
kernels in the same program or two different program creates a big
difference. Plus, ideally, in the case of linear solver, the generator
could be used to generate fused kernels, provided that the scheduler is
fully operationnal. I fear that any solution to the aforementioned problem
would destroy this precious ability... Ideally, once we enable it, the
generate_execute() mentioned above would just be replaced by generate() (or
enqueue_for_generation, which is more explicit)

This put aside, I'm not sure if we should give that much importance to
jit-compilation overhead, since the binaries can be cached. If I remember
well, Denis Demidov implemented such a caching mechanism for VexCL. What if
we replace  distributed vector/matrix with optionnal automatic kernel
caching mechanism for ViennaCL 1.6.0 (we just have a limited amount of
time :P) ? The drawback is that the filesystem library would have to be
dynamically linked, though, but afterall OpenCL itself also has to be
dynamically linked.

Best regards,
Philippe

2014/1/25 Karl Rupp r...@iue.tuwien.ac.at

 Hi Philippe,



  I don't understand why this would go through more than one compilation...
 This kernel is compiled only once, the value of flip_sign and reciprocal
 only changes the dynamic value of the argument, not the source code.

 This would eventually result in:

 if(alpha_reciprocal)
 kernel(N,x,y,z,1/alpha,beta)

 Am I missing something?


 I think so ;-) It's not about a single kernel, it's about the compilation
 unit (i.e. OpenCL program). For conjugate gradients we roughly have the
 following vector operations (random variable names)

 x = y;
 x += alpha y;
 x = z + alpha z;
 x = y - alpha z;
 x = inner_prod(y,z);

 BiCGStab and GMRES add a few more of them. If we use the generator as-is
 now, then each of the operations creates a separate OpenCL program the
 first time it is encountered and we pay the jit-compiler launch overhead
 multiple times. With the current non-generator model, all vector kernels
 are in the same OpenCL program and we pay the jit-overhead only once. I'd
 like to stick with the current model of having just one OpenCL program for
 all the basic kernels, but get the target-optimized sources from the
 generator.

 Sorry if I wasn't clear enough in my earlier mails.

 Best regards,
 Karli


--
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments  Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___
ViennaCL-devel mailing list
ViennaCL-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/viennacl-devel