On 03/20/2013 07:48 PM, Erik Schnetter wrote: > I think I found the problems. The C++ compiler does not know that long and > double are to be supported, since the C++ code does not include types.h. > Therefore, only round(float) is generated, and not round(double). Presumably, > round(double) is then taken from somewhere else. Also, the C++ compiler > doesn't > seem to see the optimization settings, so it produces unoptimized code, so > that > the calls to memcpy remain, and the call chain within VML is not inlined.
Hmm. I wonder could the "merging" of a float2 arg to a double in the calling convention mess this up somehow. If it ends up calling round(double) when it should call round(float2)? And the round(double) is actually a libm scalar round instead of a vector round. Just shooting in the dark here... You should look at the final parallel.bc in the kernel temp dir if you want to see if the memcpys are optimized away. It has all the optimizations applied after fully linking and aggressively inlining everything. The clang++ per module optimizations should not matter here so much. -- --Pekka ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_mar _______________________________________________ pocl-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/pocl-devel
