Hi,

in addition to what Philippe said, let me give you a short code snippet 
used as a prototype for a couple of reductions in ViennaCL. It certainly 
takes a little to get your head around it, but once you figured it out 
it's like a swiss army knife ;-)

The overall workflow is always the same: You decompose the full data 
into large chunks upon which all the workgroups operate (e.g. individual 
rows for a matrix-vector product). Within each work group you further 
decompose the work for each thread. Then you need to sum (alternatives: 
min/max/xor/etc.) all those values:

  // place thread results in __local array:
  shared_array[get_local_id(0)] = value_computed_by_respective_thread();

  // reduction loop:
  for (uint stride=get_local_size(0)/2; stride > 0; stride /= 2)
  {
    barrier(CLK_LOCAL_MEM_FENCE);
    if (get_local_id(0) < stride)
      shared_array[get_local_id(0)] += shared_array[get_local_id(0) + 
stride];
  }

  // process result in shared_array[0], e.g. write to global memory:
  if (get_local_id(0) == 0)
    do_something_with_result(shared_array[0]);

The barrier in the body of the for-loop is required to avoid data races. 
The last if-statement is merely for processing the results and may also 
involve all threads rather than just the first thread in the workgroup.

Hope this helps :-)

Best regards,
Karli


On 07/27/2015 07:07 PM, Charles Determan wrote:
> Philippe,
>
> I definitely understand and support such a solution for ViennaCL.  I
> don't mean to say it should be included, I was just curious what the
> current approach was.  However, I am interested in additional OpenCL
> development outside of the framework.  Do you have any recommendations
> in learning more about coding OpenCL reductions?  As I mentioned above,
> I have only found very basic reduction approaches so far, nothing for
> slightly more complex scenarios like the column and row sum examples.
>
> If this is transitioning too far from the mailing list focus I would
> certainly appreciate a reply off list.
>
> Thank you,
> Charles
>
> On Mon, Jul 27, 2015 at 11:46 AM, Philippe Tillet <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Hi,
>
>     Such row-rise / column-wise reductions could be generate-able by the
>     OpenCL backend, but this won't work on the Host of CUDA backend.
>     Plus, this is not really maintained at the moment. I would recommend
>     Karl's solution, even though it won't be optimal when the vector
>     does not fit in the L2 cache of the OpenCL device (Maxwell for
>     example has 2MB of L2 cache), as the current algorithm for GEMV
>     accesses the entire vector get_num_groups(0) times.
>
>     Philippe
>
>     2015-07-27 9:40 GMT-07:00 Karl Rupp <[email protected]
>     <mailto:[email protected]>>:
>
>
>         > Excellent, thank you.  I thought that would be the way to go 
> initially
>         > but I hesitated because of concerns about having additional 
> temporary
>         > objects taking up memory when matrices begin to get larger but it
>         > certainly is simpler this way.
>
>         Just pushed:
>         
> https://github.com/viennacl/viennacl-dev/commit/4063c941235d46804cd448db7ddecf0c3238548f
>
>         Yeah, it's a bit of a trade-off: Sure, one could optimize the
>         summation
>         kernel, but this also implies more code to maintain. On the
>         other hand,
>         I'm not aware (which, of course, does not deny a possible
>         existence) of
>         a scenario where such summation routines are the performance
>         bottleneck.
>
>         > Glad to hear that 1.7.0 is nearly completed.  Does that mean we 
> should
>         > expect a formal release soon?
>
>         Yep. Expect the release on Wednesday.
>
>         Best regards,
>         Karli
>
>
>
>         > On Mon, Jul 27, 2015 at 9:57 AM, Karl Rupp <[email protected] 
> <mailto:[email protected]>
>         > <mailto:[email protected] <mailto:[email protected]>>> 
> wrote:
>         >
>         >     Hi Charles,
>         >
>         >       > I am working on writing some additional opencl kernels
>         >     (potentially to
>         >     > incorporate in to viennacl) which involve column-wise 
> reductions.  A
>         >     > simple case would simply be the sum of each column of a 
> matrix.
>         >     > However, I am having an extremely difficult time getting my 
> kernel
>         >     > correct (reductions are tricky to me).  That said, after 
> searching for
>         >     > some resources I came across an old post on sourceforge 
> referring to
>         >     > column-wise kernels
>          >      >
>         (http://sourceforge.net/p/viennacl/mailman/message/27542552/) with
>         >     > viennacl.  This leads me to my primary question.
>         >     >
>         >     > Are there such kernels already in ViennaCL that I have 
> overlooked?
>         >
>         >     Yes ;-) Have a look here at how row-wise sums reduce to a 
> standard
>         >     matrix-vector product:
>         
> >https://sourceforge.net/p/viennacl/discussion/1143678/thread/38e942a0/
>         >
>         >     That is, in order to compute a row-sum and a column-sum you can 
> use
>         >        row_sum = prod(A, ones);
>         >        col_sum = prod(trans(A), ones);
>         >
>         >     In an hour or two I will push convenience functions for 
> summation fixing
>         >     the only remaining issue for the 1.7.0 release:
>         >https://github.com/viennacl/viennacl-dev/issues/127
>         >
>         >
>         >     > If not, are there any examples or resources you would 
> recommend to help
>         >     > learn this topic?  I have tried searching further but the 
> only thing I
>         >     > can really find is a reduction of an entire matrix (which is 
> relatively
>         >     > simple) as opposed to by column or row.
>         >
>         >     At this point I can only recommend to think about how such 
> operations
>         >     can be recast in terms of (standard) linear algebra. For 
> example, row-
>         >     and column-wise updates to a matrix are special cases of the 
> more
>         >     general
>         >        A += outer_prod(u, v);
>         >     operation (rank-1 updates). I'll improve the documentation in 
> that
>         >     direction.
>         >
>         >     Best regards,
>         >     Karli
>         >
>         >
>         >     
> ------------------------------------------------------------------------------
>         >     _______________________________________________
>         >     ViennaCL-devel mailing list
>          > [email protected]
>         <mailto:[email protected]>
>          >     <mailto:[email protected]
>         <mailto:[email protected]>>
>          > https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>         >
>         >
>
>
>         
> ------------------------------------------------------------------------------
>         _______________________________________________
>         ViennaCL-devel mailing list
>         [email protected]
>         <mailto:[email protected]>
>         https://lists.sourceforge.net/lists/listinfo/viennacl-devel
>
>
>


------------------------------------------------------------------------------
_______________________________________________
ViennaCL-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/viennacl-devel

Reply via email to