Hi Erik, On 02/02/2013 12:08 AM, Erik Schnetter wrote: > I have implemented vecmathlib <https://bitbucket.org/eschnett/vecmathlib>, a > generic library that implements these math functions directly. Benchmarks show > that this can speed up these intrinsics significantly, i.e. by a factor of > several in some cases. I would like to use vecmathlib with pocl.
Sounds and looks good. Quickly looking you have managed to avoid branching in the implementations quite well, which helps in static parallelization (SIMD, VLIW...) of the implementations. > vecmathlib is implemented in C++, since using templates greatly simplifies the > code. It builds with both gcc and clang. vecmathlib is still incomplete in > several respects, e.g. some intrinsics should have a higher precision, could > be > further optimised, or inf and nan are not yet handled correctly. How the autovectorization of the math lib calls should work with pocl needs to be thought through. Your idea is to probably call the vector versions of these directly in the math library of pocl using your template library? A question is how to make this work efficiently with autovectorization which in the OpenCL data parallel orientation is even more important to optimize than "intra kernel vector computation". If your library specifically optimizes for vector versions (having multiple calls to the scalar versions is slower than the corresponding vector alternative) then, to get the best benefit, we need to implement automatic conversion of (at least) the scalar math lib calls to the vector counterparts (this has been discussed in passing). Or, is your goal to produce scalar versions that are easily autovectorizable? I think that is the more "performance portable" approach, unless the vector-optimized ones are much faster. All in all there are two cases: the kernel is not autovectorizable (or not profitably so), e.g., is called only with 1 work item or contains very ugly control flow. Then one wants to use hand-optimized vector library calls. If it is autovectorizable, then one wants to either call the corresponding SIMD instructions directly automatically for the library calls, or call a flattened scalar version of the math function which, after inlining, can be autovectorized across work items. In any case, using these implementations should speed things up for now as it enables inlining, vectorizing, and context-optimizing some of the previously noninlined calls. BTW, when you start using them in pocl, as it's C++, please make the dependency optional as I prefer to keep most of the code base C-compilable for embedded/standalone targets. The LLVM passes (and LLVM/Clang itself) are C++, but is OK as the standalone use case likely doesn't support online kernel compilation anyways. It shouldn't be a problem for the general purpose CPU targets to use your library unconditionally, but let's allow compiling without it (perhaps simply using the selective target compilation). So, even shipping this template library in pocl shouldn't harm as when we are compiling for x86_64, for example, we probably can assume a C++ capable compiler is available and it can use your lib freely in the kernel library override directory. If everything is in the headers it should work easily. -- --Pekka ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_jan _______________________________________________ pocl-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pocl-devel
