Re: d3dx9: Implement D3DXSHMultilpy5

2013-04-23 Thread Rico Schüller

On 20.04.2013 20:49, Nozomi Kodama wrote:

These bunch of define are pretty ugly anyway. I prefer a pretty duplicated code 
rather than these tons of defines.
Define should be avoided as much as possible.


Well, the only thing why I proposed something like this, was to reduce 
the code size a bit. As it looks like, the same constants and the same 
steps are used multiple times. This doesn't have to be defines, static 
inline functions would also do the job. I'm fine with either solution 
(this includes your solution), it was only a suggestion. :-)


Cheers
Rico




Re: d3dx9: Implement D3DXSHMultilpy5

2013-04-20 Thread Nozomi Kodama


I agrre with you. I should have keep at most 7 floats.

Nozomi



 
 

2013/4/15 Nozomi Kodama nozomi.kod...@yahoo.com:
 Hello

 thanks for the review.
 I don't think that calling defines is the way to go. Indeed, I tested my
 patch and yours. Yours is about 12% slower than mine in my computer.
 And now, we try to take care of efficiency of this dll. So, it is not the
 time  to increase latency.


 I used 10 digits since there are a lot of computation, I want to avoid as
 much as possible big rounding errors. If we want to uniformize, then we
 should uniformize d3dxshmultiply 2,3,4 with 10 digits.
 But that is for an another patch.


Just to chip in about float precision. As already discussed, 32-bit
float constants have about 7 decimals digits of precision, so
extending these constants from 8 to 10 digits should not change
anything. You would have to use doubles if you want more accurate
values (but I don't think we should use doubles in d3dx9 dlls anyway).
What you can do to ensure numeric accuracy is reviewing computations:
the order of the operations (e.g. how you resolve operators
associativity, etc) and storing/reusing intermediate results can make
a difference. I wouldn't care about this stuff unless we get
significantly different results compared to native.

 Nozomi.



Looks pretty much ok, but isn't there a way to reduce the size a bit?
Just see the dirty hack which is attached. D3DXSHMultiply6 will add a
lot of lines too...

Also is there a reason why we use constants with different accuracy
(e.g. 0.28209479f in D3DXSHMultiply4 and 0.2820948064f)?

Cheers
Rico







Re: d3dx9: Implement D3DXSHMultilpy5

2013-04-20 Thread Nozomi Kodama
These bunch of define are pretty ugly anyway. I prefer a pretty duplicated code 
rather than these tons of defines.
Define should be avoided as much as possible.





On 15.04.2013 02:10, Nozomi Kodama wrote:
 Hello
 
 thanks for the review.
 I don't think that calling defines is the way to go. Indeed, I tested my 
 patch and yours. Yours is about 12% slower than mine in my computer.
 And now, we try to take care of efficiency of this dll. So, it is not the 
 time  to increase latency.
 
You are right. I'm not really sure why the code should be slower though. The 
#defines shouldn't have an impact on the performance, but it might be because 
it is translated to:

ta = 0.28209479f * a[0] + -0.12615662f * a[6] + -0.21850968f * a[8];
tb = 0.28209479f * b[0] + -0.12615662f * b[6] + -0.21850968f * b[8];
out[1] = 0.0f + ta * b[1] + tb * a[1];
t = a[1] * b[1];
out[0] = out[0] + 0.28209479f * t;
out[6] = 0.0f + -0.12615662f * t;
out[8] = 0.0f + -0.21850968f * t;

instead of:
ta = 0.28209479f * a[0] - 0.12615662f * a[6] - 0.21850968f * a[8];
tb = 0.28209479f * b[0] - 0.12615662f * b[6] - 0.21850968f * b[8];
out[1] = ta * b[1] + tb * a[1];
t = a[1] * b[1];
out[0] += 0.28209479f * t;
out[6] = -0.12615662f * t;
out[8] = -0.21850968f * t;

May be due to out[8] = -0.21850968f * t; vs out[8] = 0.0f + -0.21850968f * 
t;:
1. the extra 0.0f - Shouldn't the 0.0f get optimized away? Imho the macro could 
be fixed (e.g. use an if / separate macro / don't use a macro for this cases).
2. + -0.21850968f * t; should be as fast as -0.21850968f * t isn't it?

Does anyone else have an objection about this? I just feel that the code size 
and the always reused constants could be written a little bit nicer. I'm not 
sure if the macro usage is really better, it was just a quick idea to avoid 
that much code duplication. Also changes like the suggested below are really 
error prone, because they change a lot of code mostly using copy and paste. 
Though I vote for the same style and precision usage in all cases.

Hope this helps a bit
Rico

 
 I used 10 digits since there are a lot of computation, I want to avoid as 
 much as possible big rounding errors. If we want to uniformize, then we 
 should uniformize d3dxshmultiply 2,3,4 with 10 digits.
 But that is for an another patch.
 
 Nozomi.
 
 


Re: d3dx9: Implement D3DXSHMultilpy5

2013-04-16 Thread Matteo Bruni
2013/4/15 Nozomi Kodama nozomi.kod...@yahoo.com:
 Hello

 thanks for the review.
 I don't think that calling defines is the way to go. Indeed, I tested my
 patch and yours. Yours is about 12% slower than mine in my computer.
 And now, we try to take care of efficiency of this dll. So, it is not the
 time  to increase latency.


 I used 10 digits since there are a lot of computation, I want to avoid as
 much as possible big rounding errors. If we want to uniformize, then we
 should uniformize d3dxshmultiply 2,3,4 with 10 digits.
 But that is for an another patch.


Just to chip in about float precision. As already discussed, 32-bit
float constants have about 7 decimals digits of precision, so
extending these constants from 8 to 10 digits should not change
anything. You would have to use doubles if you want more accurate
values (but I don't think we should use doubles in d3dx9 dlls anyway).
What you can do to ensure numeric accuracy is reviewing computations:
the order of the operations (e.g. how you resolve operators
associativity, etc) and storing/reusing intermediate results can make
a difference. I wouldn't care about this stuff unless we get
significantly different results compared to native.

 Nozomi.



Looks pretty much ok, but isn't there a way to reduce the size a bit?
Just see the dirty hack which is attached. D3DXSHMultiply6 will add a
lot of lines too...

Also is there a reason why we use constants with different accuracy
(e.g. 0.28209479f in D3DXSHMultiply4 and 0.2820948064f)?

Cheers
Rico









Re: d3dx9: Implement D3DXSHMultilpy5

2013-04-15 Thread Rico Schüller

On 15.04.2013 02:10, Nozomi Kodama wrote:

Hello

thanks for the review.
I don't think that calling defines is the way to go. Indeed, I tested my patch 
and yours. Yours is about 12% slower than mine in my computer.
And now, we try to take care of efficiency of this dll. So, it is not the time  
to increase latency.

You are right. I'm not really sure why the code should be slower though. 
The #defines shouldn't have an impact on the performance, but it might 
be because it is translated to:


ta = 0.28209479f * a[0] + -0.12615662f * a[6] + -0.21850968f * a[8];
tb = 0.28209479f * b[0] + -0.12615662f * b[6] + -0.21850968f * b[8];
out[1] = 0.0f + ta * b[1] + tb * a[1];
t = a[1] * b[1];
out[0] = out[0] + 0.28209479f * t;
out[6] = 0.0f + -0.12615662f * t;
out[8] = 0.0f + -0.21850968f * t;

instead of:
ta = 0.28209479f * a[0] - 0.12615662f * a[6] - 0.21850968f * a[8];
tb = 0.28209479f * b[0] - 0.12615662f * b[6] - 0.21850968f * b[8];
out[1] = ta * b[1] + tb * a[1];
t = a[1] * b[1];
out[0] += 0.28209479f * t;
out[6] = -0.12615662f * t;
out[8] = -0.21850968f * t;

May be due to out[8] = -0.21850968f * t; vs out[8] = 0.0f + 
-0.21850968f * t;:
1. the extra 0.0f - Shouldn't the 0.0f get optimized away? Imho the 
macro could be fixed (e.g. use an if / separate macro / don't use a 
macro for this cases).

2. + -0.21850968f * t; should be as fast as -0.21850968f * t isn't it?

Does anyone else have an objection about this? I just feel that the code 
size and the always reused constants could be written a little bit 
nicer. I'm not sure if the macro usage is really better, it was just a 
quick idea to avoid that much code duplication. Also changes like the 
suggested below are really error prone, because they change a lot of 
code mostly using copy and paste. Though I vote for the same style and 
precision usage in all cases.


Hope this helps a bit
Rico



I used 10 digits since there are a lot of computation, I want to avoid as much 
as possible big rounding errors. If we want to uniformize, then we should 
uniformize d3dxshmultiply 2,3,4 with 10 digits.
But that is for an another patch.

Nozomi.








Re: d3dx9: Implement D3DXSHMultilpy5

2013-04-15 Thread David Laight
On Mon, Apr 15, 2013 at 10:24:28AM +0200, Rico Sch?ller wrote:
 You are right. I'm not really sure why the code should be slower though. 
 The #defines shouldn't have an impact on the performance, but it might 
 be because it is translated to:
 
 ta = 0.28209479f * a[0] + -0.12615662f * a[6] + -0.21850968f * a[8];
 tb = 0.28209479f * b[0] + -0.12615662f * b[6] + -0.21850968f * b[8];
 out[1] = 0.0f + ta * b[1] + tb * a[1];
 t = a[1] * b[1];
 out[0] = out[0] + 0.28209479f * t;
 out[6] = 0.0f + -0.12615662f * t;
 out[8] = 0.0f + -0.21850968f * t;
 
 instead of:
 ta = 0.28209479f * a[0] - 0.12615662f * a[6] - 0.21850968f * a[8];
 tb = 0.28209479f * b[0] - 0.12615662f * b[6] - 0.21850968f * b[8];
 out[1] = ta * b[1] + tb * a[1];
 t = a[1] * b[1];
 out[0] += 0.28209479f * t;
 out[6] = -0.12615662f * t;
 out[8] = -0.21850968f * t;

If everything is 'float' (no doubles anywhere) then I can't see
why the above would compile to different code at any sane
optimisation level - best to look at the object code.

Unless the compiler knows that out[] can't overlap a[] or b[] the
generated code is likely to be better if 't' is evaluated before
the write to out[1].

David

-- 
David Laight: da...@l8s.co.uk




Re: d3dx9: Implement D3DXSHMultilpy5

2013-04-14 Thread Rico Schüller

On 13.04.2013 01:55, Nozomi Kodama wrote:

Is there a problem with this patch?
http://source.winehq.org/patches/data/9

Nozomi




Looks pretty much ok, but isn't there a way to reduce the size a bit? 
Just see the dirty hack which is attached. D3DXSHMultiply6 will add a 
lot of lines too...


Also is there a reason why we use constants with different accuracy 
(e.g. 0.28209479f in D3DXSHMultiply4 and 0.2820948064f)?


Cheers
Rico
commit 5939b27e87acbd1cac30faa975e4c3d246ff5f9d
Author: Rico Schüller kgbric...@web.de
Date:   Sun Apr 14 16:49:16 2013 +0200

d3dx9

diff --git a/dlls/d3dx9_36/math.c b/dlls/d3dx9_36/math.c
index 207d152..d675ad0 100644
--- a/dlls/d3dx9_36/math.c
+++ b/dlls/d3dx9_36/math.c
@@ -2580,6 +2580,63 @@ FLOAT * WINAPI D3DXSHMultiply2(FLOAT *out, const FLOAT *a, const FLOAT *b)
 return out;
 }
 
+#define c1(i1, m1)\
+ta = m1 * a[i1];\
+tb = m1 * b[i1];
+
+#define c2(i1, m1, i2, m2)\
+ta = m1 * a[i1] + m2 * a[i2];\
+tb = m1 * b[i1] + m2 * b[i2];
+
+#define c3(i1, m1, i2, m2, i3, m3)\
+ta = m1 * a[i1] + m2 * a[i2] + m3 * a[i3];\
+tb = m1 * b[i1] + m2 * b[i2] + m3 * b[i3];
+
+#define t1(i1, v1)\
+out[i1] = v1 + ta * b[i1] + tb * a[i1];\
+t = a[i1] * b[i1];
+
+#define t2(i1, v1, i2, v2)\
+out[i1] = v1 + ta * b[i2] + tb * a[i2];\
+out[i2] = v2 + ta * b[i1] + tb * a[i1];\
+t = a[i1] * b[i2] + a[i2] * b[i1];
+
+#define c1_t1(i1, m1, v1, i2, v2)\
+c1(i1, m1)\
+t1(i2, v2)\
+out[i1] = v1 + m1 * t;
+
+#define c1_t2(i1, m1, v1, i2, v2, i3, v3)\
+c1(i1, m1)\
+t2(i2, v2, i3, v3)\
+out[i1] = v1 + m1 * t;
+
+#define c2_t1(i1, m1, v1, i2, m2, v2, i3, v3)\
+c2(i1, m1, i2, m2)\
+t1(i3, v3)\
+out[i1] = v1 + m1 * t;\
+out[i2] = v2 + m2 * t;
+
+#define c2_t2(i1, m1, v1, i2, m2, v2, i3, v3, i4, v4)\
+c2(i1, m1, i2, m2)\
+t2(i3, v3, i4, v4)\
+out[i1] = v1 + m1 * t;\
+out[i2] = v2 + m2 * t;
+
+#define c3_t1(i1, m1, v1, i2, m2, v2, i3, m3, v3, i4, v4)\
+c3(i1, m1, i2, m2, i3, m3)\
+t1(i4, v4)\
+out[i1] = v1 + m1 * t;\
+out[i2] = v2 + m2 * t;\
+out[i3] = v3 + m3 * t;
+
+#define c3_t2(i1, m1, v1, i2, m2, v2, i3, m3, v3, i4, v4, i5, v5)\
+c3(i1, m1, i2, m2, i3, m3)\
+t2(i4, v4, i5, v5)\
+out[i1] = v1 + m1 * t;\
+out[i2] = v2 + m2 * t;\
+out[i3] = v3 + m3 * t;
+
 FLOAT * WINAPI D3DXSHMultiply3(FLOAT *out, const FLOAT *a, const FLOAT *b)
 {
 FLOAT t, ta, tb;
@@ -2587,94 +2644,19 @@ FLOAT * WINAPI D3DXSHMultiply3(FLOAT *out, const FLOAT *a, const FLOAT *b)
 TRACE(out %p, a %p, b %p\n, out, a, b);
 
 out[0] = 0.28209479f * a[0] * b[0];
-
-ta = 0.28209479f * a[0] - 0.12615662f * a[6] - 0.21850968f * a[8];
-tb = 0.28209479f * b[0] - 0.12615662f * b[6] - 0.21850968f * b[8];
-out[1] = ta * b[1] + tb * a[1];
-t = a[1] * b[1];
-out[0] += 0.28209479f * t;
-out[6] = -0.12615662f * t;
-out[8] = -0.21850968f * t;
-
-ta = 0.21850968f * a[5];
-tb = 0.21850968f * b[5];
-out[1] += ta * b[2] + tb * a[2];
-out[2] = ta * b[1] + tb * a[1];
-t = a[1] * b[2] +a[2] * b[1];
-out[5] = 0.21850968f * t;
-
-ta = 0.21850968f * a[4];
-tb = 0.21850968f * b[4];
-out[1] += ta * b[3] + tb * a[3];
-out[3]  = ta * b[1] + tb * a[1];
-t = a[1] * b[3] + a[3] * b[1];
-out[4] = 0.21850968f * t;
-
-ta = 0.28209480f * a[0] + 0.25231326f * a[6];
-tb = 0.28209480f * b[0] + 0.25231326f * b[6];
-out[2] += ta * b[2] + tb * a[2];
-t = a[2] * b[2];
-out[0] += 0.28209480f * t;
-out[6] += 0.25231326f * t;
-
-ta = 0.21850969f * a[7];
-tb = 0.21850969f * b[7];
-out[2] += ta * b[3] + tb * a[3];
-out[3] += ta * b[2] + tb * a[2];
-t = a[2] * b[3] + a[3] * b[2];
-out[7] = 0.21850969f * t;
-
-ta = 0.28209479f * a[0] - 0.12615663f * a[6] + 0.21850969f * a[8];
-tb = 0.28209479f * b[0] - 0.12615663f * b[6] + 0.21850969f * b[8];
-out[3] += ta * b[3] + tb * a[3];
-t = a[3] * b[3];
-out[0] += 0.28209479f * t;
-out[6] -= 0.12615663f * t;
-out[8] += 0.21850969f * t;
-
-ta = 0.28209479f * a[0] - 0.18022375f * a[6];
-tb = 0.28209479f * b[0] - 0.18022375f * b[6];
-out[4] += ta * b[4] + tb * a[4];
-t = a[4] * b[4];
-out[0] += 0.28209479f * t;
-out[6] -= 0.18022375f * t;
-
-ta = 0.15607835f * a[7];
-tb = 0.15607835f * b[7];
-out[4] += ta * b[5] + tb * a[5];
-out[5] += ta * b[4] + tb * a[4];
-t = a[4] * b[5] + a[5] * b[4];
-out[7] += 0.15607834f * t;
-
-ta = 0.28209479f * a[0] + 0.09011186 * a[6] - 0.15607835f * a[8];
-tb = 0.28209479f * b[0] + 0.09011186 * b[6] - 0.15607835f * b[8];
-out[5] += ta * b[5] + tb * a[5];
-t = a[5] * b[5];
-out[0] += 0.28209479f * t;
-out[6] += 0.09011186f * t;
-out[8] -= 0.15607835f * t;
-
-ta = 0.28209480f * a[0];
-tb = 0.28209480f * b[0];
-out[6] += ta * b[6] + tb * a[6];
-t = a[6] * b[6];
-out[0] += 0.28209480f * t;
+c3_t1(0, 0.28209479f, out[0], 

Re: d3dx9: Implement D3DXSHMultilpy5

2013-04-14 Thread Nozomi Kodama
Hello

thanks for the review.
I don't think that calling defines is the way to go. Indeed, I tested my patch 
and yours. Yours is about 12% slower than mine in my computer.
And now, we try to take care of efficiency of this dll. So, it is not the time  
to increase latency.



I used 10 digits since there are a lot of computation, I want to avoid as much 
as possible big rounding errors. If we want to uniformize, then we should 
uniformize d3dxshmultiply 2,3,4 with 10 digits.
But that is for an another patch.

Nozomi.





Looks pretty much ok, but isn't there a way to reduce the size a bit? 
Just see the dirty hack which is attached. D3DXSHMultiply6 will add a 
lot of lines too...

Also is there a reason why we use constants with different accuracy 
(e.g. 0.28209479f in D3DXSHMultiply4 and 0.2820948064f)?

Cheers
Rico


d3dx9: Implement D3DXSHMultilpy5

2013-04-12 Thread Nozomi Kodama
Is there a problem with this patch?
http://source.winehq.org/patches/data/9

Nozomi