On 15 October 2012 17:07, jerro <a...@a.com> wrote: > On Monday, 15 October 2012 at 13:43:28 UTC, Manu wrote: > >> On 15 October 2012 16:34, jerro <a...@a.com> wrote: >> >> On Monday, 15 October 2012 at 12:19:47 UTC, Manu wrote: >>> >>> On 15 October 2012 02:50, jerro <a...@a.com> wrote: >>>> >>>> Speaking of test – are they available somewhere? Now that LDC at least >>>> >>>>> >>>>> theoretically supports most of the GCC builtins, I'd like to throw >>>>>> some >>>>>> tests at it to see what happens. >>>>>> >>>>>> David >>>>>> >>>>>> >>>>>> I have a fork of std.simd with LDC support at >>>>> https://github.com/jerro/** >>>>> phobos/tree/std.simd <https://github.com/jerro/**** >>>>> phobos/tree/std.simd <https://github.com/jerro/**phobos/tree/std.simd> >>>>> <https://**github.com/jerro/phobos/tree/**std.simd<https://github.com/jerro/phobos/tree/std.simd> >>>>> >> >>>>> and >>>>> some tests for it at >>>>> https://github.com/jerro/std.******simd-tests<https://github.com/jerro/std.****simd-tests> >>>>> <https://github.**com/jerro/std.**simd-tests<https://github.com/jerro/std.**simd-tests> >>>>> > >>>>> <https://github.**com/jerro/**std.simd-tests<https://github.** >>>>> com/jerro/std.simd-tests <https://github.com/jerro/std.simd-tests>> >>>>> >. >>>>> >>>>> >>>>> Awesome. Pull request plz! :) >>>> >>>> >>> I did change an API for a few functions like loadUnaligned, though. In >>> those cases the signatures needed to be changed because the functions >>> used >>> T or T* for scalar parameters and return types and Vector!T for the >>> vector >>> parameters and return types. This only compiles if T is a static array >>> which I don't think makes much sense. I changed those to take the vector >>> type as a template parameter. The vector type can not be inferred from >>> the >>> scalar type because you can use vector registers of different sizes >>> simultaneously (with AVX, for example). Because of that the vector type >>> must be passed explicitly for some functions, so I made it the first >>> template parameter in those cases, so that Ver doesn't always need to be >>> specified. >>> >>> There is one more issue that I need to solve (and that may be a problem >>> in >>> some cases with GDC too) - the pure, @safe and @nothrow attributes. >>> Currently gcc builtin declarations in LDC have none of those attributes >>> (I >>> have to look into which of those can be added and if it can be done >>> automatically). I've just commented out the attributes in my std.simd >>> fork >>> for now, but this isn't a proper solution. >>> >>> >>> >>> That said, how did you come up with a lot of these implementations? Some >>> >>>> don't look particularly efficient, and others don't even look right. >>>> xor for instance: >>>> return cast(T) (cast(int4) v1 ^ cast(int4) v2); >>>> >>>> This is wrong for float types. x86 has separate instructions for doing >>>> this >>>> to floats, which make sure to do the right thing by the flags registers. >>>> Most of the LDC blocks assume that it could be any architecture... I >>>> don't >>>> think this will produce good portable code. It needs to be much more >>>> cafully hand-crafted, but it's a nice working start. >>>> >>>> >>> The problem is that LLVM doesn't provide intrinsics for those operations. >>> The xor function does compile to a single xorps instruction when >>> compiling >>> with -O1 or higher, though. I have looked at the code generated for many >>> (most, I think, but not for all possible types) of those LDC blocks and >>> most of them compile to the appropriate single instruction when compiled >>> with -O2 or -O3. Even the ones for which the D source code looks horribly >>> inefficient like for example loadUnaligned. >>> >>> By the way, clang does those in a similar way. For example, here is what >>> clang emits for a wrapper around _mm_xor_ps when compiled with -O1 >>> -emit-llvm: >>> >>> define <4 x float> @foo(<4 x float> %a, <4 x float> %b) nounwind uwtable >>> readnone { >>> %1 = bitcast <4 x float> %a to <4 x i32> >>> %2 = bitcast <4 x float> %b to <4 x i32> >>> %3 = xor <4 x i32> %1, %2 >>> %4 = bitcast <4 x i32> %3 to <4 x float> >>> ret <4 x float> %4 >>> } >>> >>> AFAICT, the only way to ensure that a certain instruction will be used >>> with LDC when there is no LLVM intrinsic for it is to use inline assembly >>> expressions. I remember having some problems with those in the past, but >>> it >>> could be that I was doing something wrong. Maybe we should look into that >>> option too. >>> >>> >> Inline assembly usually ruins optimising (code reordering around inline >> asm >> blocks is usually considered impossible). >> > > I don't see a reason why the compiler couldn't reorder code around GCC > style inline assembly blocks. You are supposed to specify which registers > are changed in the block. Doesn't that give the compiler enough information > to reorder code?
Not necessarily. If you affect various flags registers or whatever, or direct memory access might violate it's assumptions about the state of memory/stack. I don't think I've come in contact with any compiler's that aren't super-conservative about this sort of thing. It's interesting that the x86 codegen makes such good sense of those >> sequences, but I'm rather more concerned about other platforms. I wonder >> if >> other platforms have a similarly incomplete subset of intrinsics? :/ >> > > It looks to me like LLVM does provide intrinsics for those operation that > can't be expressed in other ways. So my guess is that if some intrinsics > are absolutely needed for some platform, they will probably be there. If an > intrinsic is needed, I also don't see a reason why they wouldn't accept a > patch that ads it. > Fair enough. Interesting to know. This means that cross-platform LDC SIMD code will need to be thoroughly scrutinised for codegen quality in all targets.