On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote:
On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
Here's the assembly code for my alpha-blending routine:

Could you also paste the D version of your code? Perhaps the compiler (LDC, GDC) will generate similarly vectorized code that is inlinable, etc.

-Johan

ubyte[4] dest2 = *p;
dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 - src[0]))>>8); dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 - src[0]))>>8); dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 - src[0]))>>8);
*p = dest2;

The main problem with this is that it's much slower, even if I would calculate the alpha blending values once. The assembly code does not seem to have higher impact than the "replace if alpha = 255" algorithm:

if(src[0] == 255){
*p = src;
}

It also seems I have a quite few problems with the assembly code, mostly with the pmulhuw command (it returns the higher 16 bit of the result, I need the lower 16 bit as unsigned), also with the pointers, as the read outs and write backs doesn't land to their correct places, sometimes resulting in a flickering screen or wrong colors affecting neighboring pixels. Current assembly code:

//ushort[4] alpha = [src[0],src[0],src[0],src[0]]; //replace it if there's a faster method for this
ushort[4] alpha = [100,100,100,100];
//src[3] = 255;
ubyte[4] *p2 = cast(ubyte[4]*)src2.ptr;
ushort[4] *p3 = cast(ushort[4]*)alpha.ptr;
ushort[4] *pc_1 = cast(ushort[4]*)alphaMMXmul_const1.ptr;
ushort[4] *pc_256 = cast(ushort[4]*)alphaMMXmul_const256.ptr;
asm{
                                                                        
//moving the values to their destinations
                                                                        mov     
        ESI, p2[EBP];
mov             EDI, p[EBP];
movd    MM0, [ESI];
movd    MM1, [EDI];
mov             ESI, p3[EBP];
movq    MM5, [ESI];
mov             ESI, pc_256[EBP];
movq    MM7, [ESI];
mov             ESI, pc_1[EBP];
movq    MM6, [ESI];
punpcklbw       MM2, MM0;
punpcklbw       MM3, MM1;

paddw   MM6, MM5;       //1 + alpha
psubw   MM7, MM5;       //256 - alpha

//psllw MM2, 2;
//psllw MM3, 2;
psrlw   MM6, 1;
psrlw   MM7, 1;
pmullw  MM2, MM6;       //src * (1 + alpha)
pmullw  MM3, MM7;       //dest * (256 - alpha)
paddw   MM3, MM2;       //(src * (1 + alpha)) + (dest * (256 - alpha))
psrlw MM3, 8; //(src * (1 + alpha)) + (dest * (256 - alpha)) / 256
                                                                        
//moving the result to its place;
packuswb        MM4, MM3;
movd    [EDI-3], MM4;

emms;
}

Tried to get the correct result with trial and error, but there's no real improvement.

Reply via email to