Re: [pocl-devel] Incorporating Vecmathlib into pocl

Erik Schnetter Wed, 20 Mar 2013 08:50:10 -0700

On Wed, Mar 20, 2013 at 11:09 AM, Pekka Jääskeläinen <
[email protected]> wrote:

> On 03/19/2013 08:24 PM, Erik Schnetter wrote:
>
>> I can reproduce test suite failures on my system (OSX, x86_64), but no
>> hangs. Also, vecmathlib does not have loops that could lead to a hang.
>>
>
> Also stack corruption can lead to hangs. E.g., if a memcpy/memset goes
> over the bounds of a stack object.
>
> But it doesn't seem to be the case. There is actually an uncoditional
> jmp to itself where it gets stuck:
>
> => 0x00007ffff5dc7136 <_test_convert_type+263318>: jmp 0x7ffff5dc7136
> <_test_convert_type+263318>
>
> I can see it in the final parallel.bc:
>
> ...
> @.str1047 = private unnamed_addr constant [27 x i8]
> c"convert_char_sat((double))\**00", align 1
> ...
>
> ...
> if.then.i422340.wi_0_0_0.i:                       ; preds =
> %for.body.i422334.wi_0_0_0.i
>   %conv.i422335.i = sext i8 %11703 to i32, !wi !1, !wi_counter !47768
>
>   %conv2.i422336.i = sext i8 %11704 to i32, !wi !1, !wi_counter !47769
>
>   %call.i422339.i = call i32 (i8*, ...)* @printf(i8* getelementptr
> inbounds ([71 x i8]* @.str, i64 0, i64 0), i8* getelementptr inbounds ([27
> x i8]* @.str1047, i64 0, i64 0), i32 0, i32 0, i32 %conv.i422335.i, i32$
>   br label %tailrecurse.i, !wi !1, !wi_counter !47771
>
>
>
>
> tailrecurse.i:                                    ; preds =
> %tailrecurse.i, %if.then.i422340.wi_0_0_0.i, %for.cond.i422329.wi_0_0_0.i
>   br label %tailrecurse.i
>
> }
>
>
> So probably something accidentally ends up calling itself leading to
> infinite recursion that is optimized to an empty infinite loop. It happens
> after the printf which prints "convert_char_sat((double))"
>
> In the kernel source:
>
> ...
> compare_char_elements("**convert_char_sat((double))", i, expected.raw,
> actual.raw, 1);
> expected.value = ((char((char)double_rounded_**values_rte[i]));
>
>
> actual.value = convert_char_rte((double)**double_values[i]);
>
>
> Looking up convert_char_rte in the kernel_linked.bc shows it calls
> _cl_round(double) which calls VML routines in a longer chain which might
> cause the recursion (I didn't track the calls to the end).
>

Thanks! I looked for such explicit calls to other kernel functions, but
looked in the wrong place -- I looked in the conversions corresponding to
the last successful output, which (of course) didn't fail...

VML should not call anything outside VML, neither libm nor any pocl kernel
routines. All of VML's functions are in a namespace, and it introduces its
own vector types, so that e.g. no libm functions (nor pocl's functions)
should be available during name resolution. I assume that either removing
the __vml_ prefix confused the linker (unlikely because all of VML's
functions are in a namespace), or there is a circular dependency in VML.

I haven't encountered the latter, but there are #ifdefs in the code, so you
may be encountering a code path that I never tested. What system are you
using? What compiler flags? Do you have SSE2 / SSE3 / SSE4.1 / SSE4a / AVX
available on your CPU?

I just added some code to VML's test.cc to output its configuration at
startup; this string is for me e.g. "conf-DEBUG-SSE2-SSE3-SSE4.1-AVX".

Indeed: I tried to use the pocl's round.cl instead of VML's and it didn't
> hang anymore. It prints the verification errors and finishes.
>
> I saw you have a version that uses the __ext_vector_type__ attribute for
> the internal vector types instead of the __m128 (the SSE optimized
> version).
> It didn't compile when I tried to enable it, but maybe that's something to
> try next; try to use the exact same internal data type for the storage
> in VML as pocl does.
>

I began to implement this, but encountered some non-trivial obstacles. It
seems that llvm is not yet advanced enough to produce code that is near the
quality one can achieve by calling the __m128 functions directly, and in
many cases, there are things that clang just cannot do yet, so one has to
fall back to __m128 anyway.

I expect that __m128 maps to the same internal representation than the one
one gets via ext_vector_type or vector_size. In fact, __m128 is defined via
vector_size, which should be a trivial variant of ext_vector_type in Clang:

typedef double __m128d __attribute__ ((__vector_size__ (16),
__may_alias__));

With gcc, memcpy is completely optimized away. With clang as well -- I am
using memcpy internally e.g. to convert doubles into integers to access
certain bits, and this translates to no instruction at all, things are just
kept in the same register. I would therefore hope that the pocl->vecmathlib
transition would be similarly ideal.

Even if the issue we are seeing wasn't caused by the memset/memcpy
> conversion directly, it seems waste of cycles to convert back and
> worth this way just due to the VML interfacing, if that can be avoided.

This conversion should not take any cycles. If memcpy doesn't work, then
we'll have to do something else instead, maybe an asm statement, or a
custom-crafted .bc file.

-erik

-- 
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/
AIM: eschnett247, Skype: eschnett, Google Talk: [email protected]

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar

_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel

Re: [pocl-devel] Incorporating Vecmathlib into pocl

Reply via email to