Re: Can I get a more in-depth guide about the inline assembler?

ZILtoid1991 via Digitalmars-d-learn Thu, 02 Jun 2016 06:38:47 -0700

On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote:

On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
Here's the assembly code for my alpha-blending routine:
Could you also paste the D version of your code? Perhaps thecompiler (LDC, GDC) will generate similarly vectorized codethat is inlinable, etc.
-Johan


ubyte[4] dest2 = *p;

dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 -src[0]))>>8);dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 -src[0]))>>8);dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 -src[0]))>>8);

*p = dest2;

The main problem with this is that it's much slower, even if Iwould calculate the alpha blending values once. The assembly codedoes not seem to have higher impact than the "replace if alpha =255" algorithm:


if(src[0] == 255){
*p = src;
}

It also seems I have a quite few problems with the assembly code,mostly with the pmulhuw command (it returns the higher 16 bit ofthe result, I need the lower 16 bit as unsigned), also with thepointers, as the read outs and write backs doesn't land to theircorrect places, sometimes resulting in a flickering screen orwrong colors affecting neighboring pixels. Current assembly code:

//ushort[4] alpha = [src[0],src[0],src[0],src[0]]; //replace itif there's a faster method for this

ushort[4] alpha = [100,100,100,100];
//src[3] = 255;
ubyte[4] *p2 = cast(ubyte[4]*)src2.ptr;
ushort[4] *p3 = cast(ushort[4]*)alpha.ptr;
ushort[4] *pc_1 = cast(ushort[4]*)alphaMMXmul_const1.ptr;
ushort[4] *pc_256 = cast(ushort[4]*)alphaMMXmul_const256.ptr;
asm{
                                                                        
//moving the values to their destinations
                                                                        mov     
        ESI, p2[EBP];
mov             EDI, p[EBP];
movd    MM0, [ESI];
movd    MM1, [EDI];
mov             ESI, p3[EBP];
movq    MM5, [ESI];
mov             ESI, pc_256[EBP];
movq    MM7, [ESI];
mov             ESI, pc_1[EBP];
movq    MM6, [ESI];
punpcklbw       MM2, MM0;
punpcklbw       MM3, MM1;

paddw   MM6, MM5;       //1 + alpha
psubw   MM7, MM5;       //256 - alpha

//psllw MM2, 2;
//psllw MM3, 2;
psrlw   MM6, 1;
psrlw   MM7, 1;
pmullw  MM2, MM6;       //src * (1 + alpha)
pmullw  MM3, MM7;       //dest * (256 - alpha)
paddw   MM3, MM2;       //(src * (1 + alpha)) + (dest * (256 - alpha))

psrlw MM3, 8; //(src * (1 + alpha)) + (dest * (256 - alpha)) /256

                                                                        
//moving the result to its place;
packuswb        MM4, MM3;
movd    [EDI-3], MM4;

emms;
}

Tried to get the correct result with trial and error, but there'sno real improvement.

Re: Can I get a more in-depth guide about the inline assembler?

Reply via email to