On Sat, Feb 2, 2013 at 5:49 AM, Pekka Jääskeläinen <
[email protected]> wrote:
> Hi Erik,
>
> On 02/02/2013 12:08 AM, Erik Schnetter wrote:
> > I have implemented vecmathlib <https://bitbucket.org/eschnett/vecmathlib>,
> a
> > generic library that implements these math functions directly.
> Benchmarks show
> > that this can speed up these intrinsics significantly, i.e. by a factor
> of
> > several in some cases. I would like to use vecmathlib with pocl.
>
> Sounds and looks good. Quickly looking you have managed to avoid branching
> in
> the implementations quite well, which helps in static parallelization
> (SIMD, VLIW...) of the implementations.
>
Yes, that is one of the main points of this library.
> vecmathlib is implemented in C++, since using templates greatly
> simplifies the
> > code. It builds with both gcc and clang. vecmathlib is still incomplete
> in
> > several respects, e.g. some intrinsics should have a higher precision,
> could be
> > further optimised, or inf and nan are not yet handled correctly.
>
> How the autovectorization of the math lib calls should work with pocl
> needs to
> be thought through.
>
> Your idea is to probably call the vector versions of these directly in the
> math library of pocl using your template library? A question is how to make
> this work efficiently with autovectorization which in the OpenCL data
> parallel
> orientation is even more important to optimize than "intra kernel vector
> computation".
>
Yes, this is the idea I had. The kernel library would provide e.g. doubleN
sqrt(doubleN) for all allowed N, and the scalarizer/vectorizer would know
about these different versions (either because the OpenCL standard teaches
it, or because the compiler knows that its run-time library provides these).
If your library specifically optimizes for vector versions (having multiple
> calls to the scalar versions is slower than the corresponding vector
> alternative) then, to get the best benefit, we need to implement automatic
> conversion of (at least) the scalar math lib calls to the vector
> counterparts
> (this has been discussed in passing). Or, is your goal to produce
> scalar versions that are easily autovectorizable? I think that is the more
> "performance portable" approach, unless the vector-optimized ones are
> much faster.
>
As is, the library provides different implementations for different vector
sizes and for different systems. This seems to provide a simple way to
obtain best performance.
You can view the library as providing scalar-but-vectorizable functions; to
do so, you would simply ignore the vectorized function versions it
provides. However, if a particular loop containing e.g. a call to sqrt has
then been auto-vectorized, it would be very difficult to replace the
generated code for sqrt with e.g. a single, faster machine instruction if
one exists: detecting that a particular code sequence corresponds to an
approximation to vsqrtpd seems impossible to me.
All in all there are two cases: the kernel is not autovectorizable (or
> not profitably so), e.g., is called only with 1 work item or contains
> very ugly control flow. Then one wants to use hand-optimized vector library
> calls. If it is autovectorizable, then one wants to either call the
> corresponding SIMD instructions directly automatically for the library
> calls, or call a flattened scalar version of the math function which, after
> inlining, can be autovectorized across work items.
>
In my mind, the vectorizer would never look into sqrt() or any other
functions defined in the language standard, but would simply expect
efficient vector implementations of these. Instead of looking into the
language standard we could also add a respective attribute to the function
definitions. This attribute would then confirm that e.g. double2
sqrt(double2) is equivalent to double sqrt(double).
__attribute__((__vector_equivalence__)) could be a name.
In any case, using these implementations should speed things up for
> now as it enables inlining, vectorizing, and context-optimizing some of
> the previously noninlined calls.
>
> BTW, when you start using them in pocl, as it's C++, please make the
> dependency optional as I prefer to keep most of the code base C-compilable
> for
> embedded/standalone targets. The LLVM passes (and LLVM/Clang itself) are
> C++,
> but is OK as the standalone use case likely doesn't support online kernel
> compilation anyways.
>
If C++ availability is your main concern, then one could simply enable this
library when C++ support is present.
It shouldn't be a problem for the general purpose CPU targets to use your
> library unconditionally, but let's allow compiling without it (perhaps
> simply using the selective target compilation). So, even shipping this
> template library in pocl shouldn't harm as when we are compiling for
> x86_64, for
> example, we probably can assume a C++ capable compiler is available and it
> can use your lib freely in the kernel library override directory. If
> everything is in the headers it should work easily.
>
Okay.
-erik
--
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel