Re: Can I get a more in-depth guide about the inline assembler?

2016-06-03 Thread ZILtoid1991 via Digitalmars-d-learn

On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:

Here's the assembly code for my alpha-blending routine:
ubyte[4] src = *cast(ubyte[4]*)(palette.ptr + 4 * *c);
ubyte[4] *p = cast(ubyte[4]*)(workpad + (offsetX + x)*4 + 
offsetY);

asm{//moving the values to their destinations
movdMM0, p;
movdMM1, src;
movqMM5, alpha;
movqMM7, alphaMMXmul_const1;
movqMM6, alphaMMXmul_const2;

punpcklbw   MM2, MM0;
punpcklbw   MM3, MM1;

paddw   MM6, MM5;   //1 + alpha
psubw   MM7, MM5;   //256 - alpha

pmulhuw MM2, MM6;   //src * (1 + alpha)
pmulhuw MM3, MM7;   //dest * (256 - alpha)
paddw   MM3, MM2;   //(src * (1 + alpha)) + (dest * (256 - alpha))
psrlw	MM3, 8;		//(src * (1 + alpha)) + (dest * (256 - alpha)) / 
256


//moving the result to its place;

packuswbMM4, MM3;
movdp, MM4;
emms;
}

The two constants being referred here:
static immutable ushort[4] alphaMMXmul_const1 = 
[256,256,256,256];

static immutable ushort[4] alphaMMXmul_const2 = [1,1,1,1];

alpha is a ushort[4] containing the alpha value four times.

After some debugging, I found out that the p pointer becomes 
null at the end instead of pointing to a value. I have no 
experience with using in-line assemblers (although I made a few 
Hello World programs for MS-Dos with a stand-alone assembler), 
so I don't know when and how the compiler will interpret the 
types from D.


Problem solved. Current assembly code:

asm{

//moving the values to their destinations
mov EBX, p[EBP];
movdMM0, src;
movdMM1, [EBX];

movqMM5, alpha; 
movqMM7, alphaMMXmul_const256;
movqMM6, alphaMMXmul_const1;
pxorMM2, MM2;
punpcklbw   MM0, MM2;
punpcklbw   MM1, MM2;

paddusw MM6, MM5;   //1 + alpha
psubusw MM7, MM5;   //256 - alpha

pmullw  MM0, MM6;   //src * (1 + alpha)
pmullw  MM1, MM7;   //dest * (256 - alpha)
paddusw MM0, MM1;   //(src * (1 + alpha)) + (dest * (256 - alpha))
psrlw	MM0, 8;		//(src * (1 + alpha)) + (dest * (256 - alpha)) / 
256


//moving the result to its place;
//pxor  MM2, MM2;
packuswbMM0, MM2;

movd[EBX], MM0;

emms;
}
The actual problem was the poor documentation of MMX instructions 
as it never really caught on, and the disappearance of assembly 
programming from the mainstream. The end result was a quick 
alpha-blending algorithm that barely has any extra performance 
penalty compared to just copying the pixels. I currently have no 
plans on translating the whole sprite displaying algorithm to 
assembly, instead I'll work on the editor for the game engine.


Re: Can I get a more in-depth guide about the inline assembler?

2016-06-02 Thread ZILtoid1991 via Digitalmars-d-learn

On Thursday, 2 June 2016 at 07:17:23 UTC, Johan Engelen wrote:

On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:

Here's the assembly code for my alpha-blending routine:


Could you also paste the D version of your code? Perhaps the 
compiler (LDC, GDC) will generate similarly vectorized code 
that is inlinable, etc.


-Johan


ubyte[4] dest2 = *p;
dest2[1] = to!ubyte((src[1] * (src[0] + 1) + dest2[1] * (256 - 
src[0]))>>8);
dest2[2] = to!ubyte((src[2] * (src[0] + 1) + dest2[2] * (256 - 
src[0]))>>8);
dest2[3] = to!ubyte((src[3] * (src[0] + 1) + dest2[3] * (256 - 
src[0]))>>8);

*p = dest2;

The main problem with this is that it's much slower, even if I 
would calculate the alpha blending values once. The assembly code 
does not seem to have higher impact than the "replace if alpha = 
255" algorithm:


if(src[0] == 255){
*p = src;
}

It also seems I have a quite few problems with the assembly code, 
mostly with the pmulhuw command (it returns the higher 16 bit of 
the result, I need the lower 16 bit as unsigned), also with the 
pointers, as the read outs and write backs doesn't land to their 
correct places, sometimes resulting in a flickering screen or 
wrong colors affecting neighboring pixels. Current assembly code:


//ushort[4] alpha = [src[0],src[0],src[0],src[0]];	//replace it 
if there's a faster method for this

ushort[4] alpha = [100,100,100,100];
//src[3] = 255;
ubyte[4] *p2 = cast(ubyte[4]*)src2.ptr;
ushort[4] *p3 = cast(ushort[4]*)alpha.ptr;
ushort[4] *pc_1 = cast(ushort[4]*)alphaMMXmul_const1.ptr;
ushort[4] *pc_256 = cast(ushort[4]*)alphaMMXmul_const256.ptr;
asm{

//moving the values to their destinations
mov 
ESI, p2[EBP];
mov EDI, p[EBP];
movdMM0, [ESI];
movdMM1, [EDI];
mov ESI, p3[EBP];
movqMM5, [ESI];
mov ESI, pc_256[EBP];
movqMM7, [ESI];
mov ESI, pc_1[EBP];
movqMM6, [ESI];
punpcklbw   MM2, MM0;
punpcklbw   MM3, MM1;

paddw   MM6, MM5;   //1 + alpha
psubw   MM7, MM5;   //256 - alpha

//psllw MM2, 2;
//psllw MM3, 2;
psrlw   MM6, 1;
psrlw   MM7, 1;
pmullw  MM2, MM6;   //src * (1 + alpha)
pmullw  MM3, MM7;   //dest * (256 - alpha)
paddw   MM3, MM2;   //(src * (1 + alpha)) + (dest * (256 - alpha))
psrlw	MM3, 8;		//(src * (1 + alpha)) + (dest * (256 - alpha)) / 
256


//moving the result to its place;
packuswbMM4, MM3;
movd[EDI-3], MM4;

emms;
}

Tried to get the correct result with trial and error, but there's 
no real improvement.


Re: Can I get a more in-depth guide about the inline assembler?

2016-06-01 Thread ZILtoid1991 via Digitalmars-d-learn

On Thursday, 2 June 2016 at 00:51:15 UTC, ZILtoid1991 wrote:

On Wednesday, 1 June 2016 at 23:35:40 UTC, Era Scarecrow wrote:

On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:


I could get the code working with a bug after replacing pmulhuw 
with pmullw, but due to integer overflow I get a glitched 
image. I try to get around the fact that pmulhuw stores the 
high bits of the result either with multiplication or with bit 
shifting.


I forgot to mention that I had to make pointers for the arrays I 
used in order to be able to load them.


Re: Can I get a more in-depth guide about the inline assembler?

2016-06-01 Thread ZILtoid1991 via Digitalmars-d-learn

On Wednesday, 1 June 2016 at 23:35:40 UTC, Era Scarecrow wrote:

On Wednesday, 1 June 2016 at 23:23:49 UTC, ZILtoid1991 wrote:
After some debugging, I found out that the p pointer becomes 
null at the end instead of pointing to a value. I have no 
experience with using in-line assemblers (although I made a 
few Hello World programs for MS-Dos with a stand-alone 
assembler), so I don't know when and how the compiler will 
interpret the types from D.


 In the assembler the variable names actually become just the 
offset to where they are in the stack in relation to BP. So if 
you want the full pointer you actually need to convert it into 
a register first and then just use that register instead. 
So This should be correct.


//unless you are going to actually use ubyte[4] here, just 
making a pointer will work instead, so cast(uint*) probably

ubyte[4] src = *cast(ubyte[4]*)(palette.ptr + 4 * *c);
ubyte[4] *p = cast(ubyte[4]*)(workpad + (offsetX + x)*4 + 
offsetY);

asm{//moving the values to their destinations

 movd   ESI, src[EBP]; //get source pointer
 movd   EDI,   p[EBP]; //get destination pointer
 movd   MM0,[EDI]; //use directly
 movd   MM1,[ESI];

movqMM5, alpha;
movqMM7, alphaMMXmul_const1;
movqMM6, alphaMMXmul_const2;






 movd   [EDI], MM4;
}


I could get the code working with a bug after replacing pmulhuw 
with pmullw, but due to integer overflow I get a glitched image. 
I try to get around the fact that pmulhuw stores the high bits of 
the result either with multiplication or with bit shifting.




Can I get a more in-depth guide about the inline assembler?

2016-06-01 Thread ZILtoid1991 via Digitalmars-d-learn

Here's the assembly code for my alpha-blending routine:
ubyte[4] src = *cast(ubyte[4]*)(palette.ptr + 4 * *c);
ubyte[4] *p = cast(ubyte[4]*)(workpad + (offsetX + x)*4 + 
offsetY);

asm{//moving the values to their destinations
movdMM0, p;
movdMM1, src;
movqMM5, alpha;
movqMM7, alphaMMXmul_const1;
movqMM6, alphaMMXmul_const2;

punpcklbw   MM2, MM0;
punpcklbw   MM3, MM1;

paddw   MM6, MM5;   //1 + alpha
psubw   MM7, MM5;   //256 - alpha

pmulhuw MM2, MM6;   //src * (1 + alpha)
pmulhuw MM3, MM7;   //dest * (256 - alpha)
paddw   MM3, MM2;   //(src * (1 + alpha)) + (dest * (256 - alpha))
psrlw	MM3, 8;		//(src * (1 + alpha)) + (dest * (256 - alpha)) / 
256


//moving the result to its place;

packuswbMM4, MM3;
movdp, MM4;
emms;
}

The two constants being referred here:
static immutable ushort[4] alphaMMXmul_const1 = [256,256,256,256];
static immutable ushort[4] alphaMMXmul_const2 = [1,1,1,1];

alpha is a ushort[4] containing the alpha value four times.

After some debugging, I found out that the p pointer becomes null 
at the end instead of pointing to a value. I have no experience 
with using in-line assemblers (although I made a few Hello World 
programs for MS-Dos with a stand-alone assembler), so I don't 
know when and how the compiler will interpret the types from D.


Is it any faster to read array slices than just elements of an array?

2015-04-08 Thread ZILtoid1991 via Digitalmars-d-learn
While I technically finished the 0.2 version of my graphics 
engine which has a reasonable speed at low internal resolutions 
and with only a couple of sprites, but it still gets bottlenecked 
a lot. First I'll throw out the top-down determination 
algorhythm as it requires constant memory paging (alrought it 
made much more sense when the engine was full OO and even slower).


Instead I'll use a overwriting (bottom-up) method. It still 
needs constant updates and I have to remove the per sprite 
transparency key and use a per layer key, however it requires 
much less paging, and still have the ability of unbound layer 
numbers and sprite count with unbound sizes.


I also came up with the idea of reading slices out from the 
graphical elements to potentially speed up the process a bit, 
especially as the custom bitmaps it uses are 16bit for palette 
operations, so per pixel read operations would waste a portion of 
memory bus. So should I write a method for the bitmap class which 
gets a line from it? (an array slice as it contains the data in a 
single 1D array to avoid jagged arrays on a future expansion for 
a scaler) And can I write an array slice at a position of an 
array? (to reduce writeback calls)