I just realized that running pocl from its build directory is not
independent of a previous install -- pocl will look into its install
directory first, potentially using outdated libraries.
There are no calls to memcpy left in parallel.bc. Some functions are not
inlined, but this seems to be a reasonable decision by the compiler, since
the respective functions are not trivial.
For example, here is the function round(float8):
__Z9_cl_roundDv8_f:
pushq %rbp
movq %rsp, %rbp
vmovaps LCPI260_0(%rip), %ymm1
vandps %ymm1, %ymm0, %ymm2
vaddps LCPI260_1(%rip), %ymm2, %ymm2
vroundps $1, %ymm2, %ymm2
vandps %ymm1, %ymm2, %ymm1
vandps LCPI260_2(%rip), %ymm0, %ymm0
vorps %ymm1, %ymm0, %ymm0
popq %rbp
ret
This code has very little overhead.
Can you send me your "parallel.s" file (should live next to the
parallel.bc) for me to look at? Send it in private, the file is large.
-erik
On Wed, Mar 20, 2013 at 3:27 PM, Erik Schnetter <
[email protected]> wrote:
> On Wed, Mar 20, 2013 at 3:18 PM, Pekka Jääskeläinen <
> [email protected]> wrote:
>
>> On 03/20/2013 07:48 PM, Erik Schnetter wrote:
>>
>>> I think I found the problems. The C++ compiler does not know that long
>>> and
>>> double are to be supported, since the C++ code does not include types.h.
>>> Therefore, only round(float) is generated, and not round(double).
>>> Presumably,
>>> round(double) is then taken from somewhere else. Also, the C++ compiler
>>> doesn't
>>> seem to see the optimization settings, so it produces unoptimized code,
>>> so that
>>> the calls to memcpy remain, and the call chain within VML is not inlined.
>>>
>>
>> Hmm. I wonder could the "merging" of a float2 arg to a double in the
>> calling
>> convention mess this up somehow. If it ends up calling round(double) when
>> it
>> should call round(float2)? And the round(double) is actually a libm
>> scalar round instead of a vector round. Just shooting in the dark here...
>>
>
> You should look at the final parallel.bc in the kernel temp dir
>> if you want to see if the memcpys are optimized away. It has all the
>> optimizations applied after fully linking and aggressively inlining
>> everything. The clang++ per module optimizations should not matter here
>> so much.
>
>
> I'll have a look at the parallel.bc file then.
>
> -erik
>
> --
> Erik Schnetter <[email protected]>
> http://www.perimeterinstitute.ca/personal/eschnetter/
> AIM: eschnett247, Skype: eschnett, Google Talk: [email protected]
>
--
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/
AIM: eschnett247, Skype: eschnett, Google Talk: [email protected]
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel