Hi Erik,

On 02/02/2013 12:08 AM, Erik Schnetter wrote:
> I have implemented vecmathlib <https://bitbucket.org/eschnett/vecmathlib>, a
> generic library that implements these math functions directly. Benchmarks show
> that this can speed up these intrinsics significantly, i.e. by a factor of
> several in some cases. I would like to use vecmathlib with pocl.

Sounds and looks good. Quickly looking you have managed to avoid branching in
the implementations quite well, which helps in static parallelization
(SIMD, VLIW...) of the implementations.

> vecmathlib is implemented in C++, since using templates greatly simplifies the
> code. It builds with both gcc and clang. vecmathlib is still incomplete in
> several respects, e.g. some intrinsics should have a higher precision, could 
> be
> further optimised, or inf and nan are not yet handled correctly.

How the autovectorization of the math lib calls should work with pocl needs to
be thought through.

Your idea is to probably call the vector versions of these directly in the
math library of pocl using your template library? A question is how to make
this work efficiently with autovectorization which in the OpenCL data parallel
orientation is even more important to optimize than "intra kernel vector
computation".

If your library specifically optimizes for vector versions (having multiple
calls to the scalar versions is slower than the corresponding vector
alternative) then, to get the best benefit, we need to implement automatic
conversion of (at least) the scalar math lib calls to the vector counterparts
(this has been discussed in passing). Or, is your goal to produce
scalar versions that are easily autovectorizable? I think that is the more
"performance portable" approach, unless the vector-optimized ones are
much faster.

All in all there are two cases: the kernel is not autovectorizable (or
not profitably so), e.g., is called only with 1 work item or contains
very ugly control flow. Then one wants to use hand-optimized vector library
calls. If it is autovectorizable, then one wants to either call the
corresponding SIMD instructions directly automatically for the library
calls, or call a flattened scalar version of the math function which, after
inlining, can be autovectorized across work items.

In any case, using these implementations should speed things up for
now as it enables inlining, vectorizing, and context-optimizing some of
the previously noninlined calls.

BTW, when you start using them in pocl, as it's C++, please make the
dependency optional as I prefer to keep most of the code base C-compilable for
embedded/standalone targets. The LLVM passes (and LLVM/Clang itself) are C++,
but is OK as the standalone use case likely doesn't support online kernel
compilation anyways.

It shouldn't be a problem for the general purpose CPU targets to use your
library unconditionally, but let's allow compiling without it (perhaps
simply using the selective target compilation).  So, even shipping this
template library in pocl shouldn't harm as when we are compiling for x86_64, 
for 
example, we probably can assume a C++ capable compiler is available and it
can use your lib freely in the kernel library override directory. If
everything is in the headers it should work easily.

-- 
--Pekka


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Reply via email to