On 15 January 2012 19:01, bearophile <bearophileh...@lycos.com> wrote: > Iain Buclaw: > >> Correction, 1.5x speed up without, 20x speed up with -O1, 30x speed up >> with -O2 and above. My oh my... > > Please, show me the assembly code produced, with its relative D source :-) > > Bye, > bearophile
D code: ---- import core.simd; void test2a(float4 a) { } float4 test2() { float4 a = 1.2; a = a * 3 + 7; test2a(a); return a; } ---- Relevant assembly: ---- .LC5: .long 1067030938 .long 1067030938 .long 1067030938 .long 1067030938 .section .rodata.cst4,"aM",@progbits,4 .align 4 _D4test5test2FZNhG4f: .cfi_startproc movl $3, %eax cvtsi2ss %eax, %xmm0 movb $7, %al cvtsi2ss %eax, %xmm1 unpcklps %xmm0, %xmm0 unpcklps %xmm1, %xmm1 movlhps %xmm0, %xmm0 movlhps %xmm1, %xmm1 mulps .LC5(%rip), %xmm0 addps %xmm1, %xmm0 ret .cfi_endproc ---- As someone pointed out to me, the only optimisation missing was constant propagation, but that doesn't matter too much for now. Regards -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0';