On 03/20/2013 07:48 PM, Erik Schnetter wrote:
> I think I found the problems. The C++ compiler does not know that long and
> double are to be supported, since the C++ code does not include types.h.
> Therefore, only round(float) is generated, and not round(double). Presumably,
> round(double) is then taken from somewhere else. Also, the C++ compiler 
> doesn't
> seem to see the optimization settings, so it produces unoptimized code, so 
> that
> the calls to memcpy remain, and the call chain within VML is not inlined.

Hmm. I wonder could the "merging" of a float2 arg to a double in the calling
convention mess this up somehow. If it ends up calling round(double) when it
should call round(float2)? And the round(double) is actually a libm
scalar round instead of a vector round. Just shooting in the dark here...

You should look at the final parallel.bc in the kernel temp dir
if you want to see if the memcpys are optimized away. It has all the
optimizations applied after fully linking and aggressively inlining
everything. The clang++ per module optimizations should not matter here
so much.

-- 
--Pekka


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Reply via email to