Re: d3dx9: Implement D3DXSHMultilpy5
On 20.04.2013 20:49, Nozomi Kodama wrote: These bunch of define are pretty ugly anyway. I prefer a pretty duplicated code rather than these tons of defines. Define should be avoided as much as possible. Well, the only thing why I proposed something like this, was to reduce the code size a bit. As it looks like, the same constants and the same steps are used multiple times. This doesn't have to be defines, static inline functions would also do the job. I'm fine with either solution (this includes your solution), it was only a suggestion. :-) Cheers Rico
Re: d3dx9: Implement D3DXSHMultilpy5
I agrre with you. I should have keep at most 7 floats. Nozomi 2013/4/15 Nozomi Kodama nozomi.kod...@yahoo.com: Hello thanks for the review. I don't think that calling defines is the way to go. Indeed, I tested my patch and yours. Yours is about 12% slower than mine in my computer. And now, we try to take care of efficiency of this dll. So, it is not the time to increase latency. I used 10 digits since there are a lot of computation, I want to avoid as much as possible big rounding errors. If we want to uniformize, then we should uniformize d3dxshmultiply 2,3,4 with 10 digits. But that is for an another patch. Just to chip in about float precision. As already discussed, 32-bit float constants have about 7 decimals digits of precision, so extending these constants from 8 to 10 digits should not change anything. You would have to use doubles if you want more accurate values (but I don't think we should use doubles in d3dx9 dlls anyway). What you can do to ensure numeric accuracy is reviewing computations: the order of the operations (e.g. how you resolve operators associativity, etc) and storing/reusing intermediate results can make a difference. I wouldn't care about this stuff unless we get significantly different results compared to native. Nozomi. Looks pretty much ok, but isn't there a way to reduce the size a bit? Just see the dirty hack which is attached. D3DXSHMultiply6 will add a lot of lines too... Also is there a reason why we use constants with different accuracy (e.g. 0.28209479f in D3DXSHMultiply4 and 0.2820948064f)? Cheers Rico
Re: d3dx9: Implement D3DXSHMultilpy5
These bunch of define are pretty ugly anyway. I prefer a pretty duplicated code rather than these tons of defines. Define should be avoided as much as possible. On 15.04.2013 02:10, Nozomi Kodama wrote: Hello thanks for the review. I don't think that calling defines is the way to go. Indeed, I tested my patch and yours. Yours is about 12% slower than mine in my computer. And now, we try to take care of efficiency of this dll. So, it is not the time to increase latency. You are right. I'm not really sure why the code should be slower though. The #defines shouldn't have an impact on the performance, but it might be because it is translated to: ta = 0.28209479f * a[0] + -0.12615662f * a[6] + -0.21850968f * a[8]; tb = 0.28209479f * b[0] + -0.12615662f * b[6] + -0.21850968f * b[8]; out[1] = 0.0f + ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] = out[0] + 0.28209479f * t; out[6] = 0.0f + -0.12615662f * t; out[8] = 0.0f + -0.21850968f * t; instead of: ta = 0.28209479f * a[0] - 0.12615662f * a[6] - 0.21850968f * a[8]; tb = 0.28209479f * b[0] - 0.12615662f * b[6] - 0.21850968f * b[8]; out[1] = ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] += 0.28209479f * t; out[6] = -0.12615662f * t; out[8] = -0.21850968f * t; May be due to out[8] = -0.21850968f * t; vs out[8] = 0.0f + -0.21850968f * t;: 1. the extra 0.0f - Shouldn't the 0.0f get optimized away? Imho the macro could be fixed (e.g. use an if / separate macro / don't use a macro for this cases). 2. + -0.21850968f * t; should be as fast as -0.21850968f * t isn't it? Does anyone else have an objection about this? I just feel that the code size and the always reused constants could be written a little bit nicer. I'm not sure if the macro usage is really better, it was just a quick idea to avoid that much code duplication. Also changes like the suggested below are really error prone, because they change a lot of code mostly using copy and paste. Though I vote for the same style and precision usage in all cases. Hope this helps a bit Rico I used 10 digits since there are a lot of computation, I want to avoid as much as possible big rounding errors. If we want to uniformize, then we should uniformize d3dxshmultiply 2,3,4 with 10 digits. But that is for an another patch. Nozomi.
Re: d3dx9: Implement D3DXSHMultilpy5
2013/4/15 Nozomi Kodama nozomi.kod...@yahoo.com: Hello thanks for the review. I don't think that calling defines is the way to go. Indeed, I tested my patch and yours. Yours is about 12% slower than mine in my computer. And now, we try to take care of efficiency of this dll. So, it is not the time to increase latency. I used 10 digits since there are a lot of computation, I want to avoid as much as possible big rounding errors. If we want to uniformize, then we should uniformize d3dxshmultiply 2,3,4 with 10 digits. But that is for an another patch. Just to chip in about float precision. As already discussed, 32-bit float constants have about 7 decimals digits of precision, so extending these constants from 8 to 10 digits should not change anything. You would have to use doubles if you want more accurate values (but I don't think we should use doubles in d3dx9 dlls anyway). What you can do to ensure numeric accuracy is reviewing computations: the order of the operations (e.g. how you resolve operators associativity, etc) and storing/reusing intermediate results can make a difference. I wouldn't care about this stuff unless we get significantly different results compared to native. Nozomi. Looks pretty much ok, but isn't there a way to reduce the size a bit? Just see the dirty hack which is attached. D3DXSHMultiply6 will add a lot of lines too... Also is there a reason why we use constants with different accuracy (e.g. 0.28209479f in D3DXSHMultiply4 and 0.2820948064f)? Cheers Rico
Re: d3dx9: Implement D3DXSHMultilpy5
On 15.04.2013 02:10, Nozomi Kodama wrote: Hello thanks for the review. I don't think that calling defines is the way to go. Indeed, I tested my patch and yours. Yours is about 12% slower than mine in my computer. And now, we try to take care of efficiency of this dll. So, it is not the time to increase latency. You are right. I'm not really sure why the code should be slower though. The #defines shouldn't have an impact on the performance, but it might be because it is translated to: ta = 0.28209479f * a[0] + -0.12615662f * a[6] + -0.21850968f * a[8]; tb = 0.28209479f * b[0] + -0.12615662f * b[6] + -0.21850968f * b[8]; out[1] = 0.0f + ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] = out[0] + 0.28209479f * t; out[6] = 0.0f + -0.12615662f * t; out[8] = 0.0f + -0.21850968f * t; instead of: ta = 0.28209479f * a[0] - 0.12615662f * a[6] - 0.21850968f * a[8]; tb = 0.28209479f * b[0] - 0.12615662f * b[6] - 0.21850968f * b[8]; out[1] = ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] += 0.28209479f * t; out[6] = -0.12615662f * t; out[8] = -0.21850968f * t; May be due to out[8] = -0.21850968f * t; vs out[8] = 0.0f + -0.21850968f * t;: 1. the extra 0.0f - Shouldn't the 0.0f get optimized away? Imho the macro could be fixed (e.g. use an if / separate macro / don't use a macro for this cases). 2. + -0.21850968f * t; should be as fast as -0.21850968f * t isn't it? Does anyone else have an objection about this? I just feel that the code size and the always reused constants could be written a little bit nicer. I'm not sure if the macro usage is really better, it was just a quick idea to avoid that much code duplication. Also changes like the suggested below are really error prone, because they change a lot of code mostly using copy and paste. Though I vote for the same style and precision usage in all cases. Hope this helps a bit Rico I used 10 digits since there are a lot of computation, I want to avoid as much as possible big rounding errors. If we want to uniformize, then we should uniformize d3dxshmultiply 2,3,4 with 10 digits. But that is for an another patch. Nozomi.
Re: d3dx9: Implement D3DXSHMultilpy5
On Mon, Apr 15, 2013 at 10:24:28AM +0200, Rico Sch?ller wrote: You are right. I'm not really sure why the code should be slower though. The #defines shouldn't have an impact on the performance, but it might be because it is translated to: ta = 0.28209479f * a[0] + -0.12615662f * a[6] + -0.21850968f * a[8]; tb = 0.28209479f * b[0] + -0.12615662f * b[6] + -0.21850968f * b[8]; out[1] = 0.0f + ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] = out[0] + 0.28209479f * t; out[6] = 0.0f + -0.12615662f * t; out[8] = 0.0f + -0.21850968f * t; instead of: ta = 0.28209479f * a[0] - 0.12615662f * a[6] - 0.21850968f * a[8]; tb = 0.28209479f * b[0] - 0.12615662f * b[6] - 0.21850968f * b[8]; out[1] = ta * b[1] + tb * a[1]; t = a[1] * b[1]; out[0] += 0.28209479f * t; out[6] = -0.12615662f * t; out[8] = -0.21850968f * t; If everything is 'float' (no doubles anywhere) then I can't see why the above would compile to different code at any sane optimisation level - best to look at the object code. Unless the compiler knows that out[] can't overlap a[] or b[] the generated code is likely to be better if 't' is evaluated before the write to out[1]. David -- David Laight: da...@l8s.co.uk
Re: d3dx9: Implement D3DXSHMultilpy5
On 13.04.2013 01:55, Nozomi Kodama wrote: Is there a problem with this patch? http://source.winehq.org/patches/data/9 Nozomi Looks pretty much ok, but isn't there a way to reduce the size a bit? Just see the dirty hack which is attached. D3DXSHMultiply6 will add a lot of lines too... Also is there a reason why we use constants with different accuracy (e.g. 0.28209479f in D3DXSHMultiply4 and 0.2820948064f)? Cheers Rico commit 5939b27e87acbd1cac30faa975e4c3d246ff5f9d Author: Rico Schüller kgbric...@web.de Date: Sun Apr 14 16:49:16 2013 +0200 d3dx9 diff --git a/dlls/d3dx9_36/math.c b/dlls/d3dx9_36/math.c index 207d152..d675ad0 100644 --- a/dlls/d3dx9_36/math.c +++ b/dlls/d3dx9_36/math.c @@ -2580,6 +2580,63 @@ FLOAT * WINAPI D3DXSHMultiply2(FLOAT *out, const FLOAT *a, const FLOAT *b) return out; } +#define c1(i1, m1)\ +ta = m1 * a[i1];\ +tb = m1 * b[i1]; + +#define c2(i1, m1, i2, m2)\ +ta = m1 * a[i1] + m2 * a[i2];\ +tb = m1 * b[i1] + m2 * b[i2]; + +#define c3(i1, m1, i2, m2, i3, m3)\ +ta = m1 * a[i1] + m2 * a[i2] + m3 * a[i3];\ +tb = m1 * b[i1] + m2 * b[i2] + m3 * b[i3]; + +#define t1(i1, v1)\ +out[i1] = v1 + ta * b[i1] + tb * a[i1];\ +t = a[i1] * b[i1]; + +#define t2(i1, v1, i2, v2)\ +out[i1] = v1 + ta * b[i2] + tb * a[i2];\ +out[i2] = v2 + ta * b[i1] + tb * a[i1];\ +t = a[i1] * b[i2] + a[i2] * b[i1]; + +#define c1_t1(i1, m1, v1, i2, v2)\ +c1(i1, m1)\ +t1(i2, v2)\ +out[i1] = v1 + m1 * t; + +#define c1_t2(i1, m1, v1, i2, v2, i3, v3)\ +c1(i1, m1)\ +t2(i2, v2, i3, v3)\ +out[i1] = v1 + m1 * t; + +#define c2_t1(i1, m1, v1, i2, m2, v2, i3, v3)\ +c2(i1, m1, i2, m2)\ +t1(i3, v3)\ +out[i1] = v1 + m1 * t;\ +out[i2] = v2 + m2 * t; + +#define c2_t2(i1, m1, v1, i2, m2, v2, i3, v3, i4, v4)\ +c2(i1, m1, i2, m2)\ +t2(i3, v3, i4, v4)\ +out[i1] = v1 + m1 * t;\ +out[i2] = v2 + m2 * t; + +#define c3_t1(i1, m1, v1, i2, m2, v2, i3, m3, v3, i4, v4)\ +c3(i1, m1, i2, m2, i3, m3)\ +t1(i4, v4)\ +out[i1] = v1 + m1 * t;\ +out[i2] = v2 + m2 * t;\ +out[i3] = v3 + m3 * t; + +#define c3_t2(i1, m1, v1, i2, m2, v2, i3, m3, v3, i4, v4, i5, v5)\ +c3(i1, m1, i2, m2, i3, m3)\ +t2(i4, v4, i5, v5)\ +out[i1] = v1 + m1 * t;\ +out[i2] = v2 + m2 * t;\ +out[i3] = v3 + m3 * t; + FLOAT * WINAPI D3DXSHMultiply3(FLOAT *out, const FLOAT *a, const FLOAT *b) { FLOAT t, ta, tb; @@ -2587,94 +2644,19 @@ FLOAT * WINAPI D3DXSHMultiply3(FLOAT *out, const FLOAT *a, const FLOAT *b) TRACE(out %p, a %p, b %p\n, out, a, b); out[0] = 0.28209479f * a[0] * b[0]; - -ta = 0.28209479f * a[0] - 0.12615662f * a[6] - 0.21850968f * a[8]; -tb = 0.28209479f * b[0] - 0.12615662f * b[6] - 0.21850968f * b[8]; -out[1] = ta * b[1] + tb * a[1]; -t = a[1] * b[1]; -out[0] += 0.28209479f * t; -out[6] = -0.12615662f * t; -out[8] = -0.21850968f * t; - -ta = 0.21850968f * a[5]; -tb = 0.21850968f * b[5]; -out[1] += ta * b[2] + tb * a[2]; -out[2] = ta * b[1] + tb * a[1]; -t = a[1] * b[2] +a[2] * b[1]; -out[5] = 0.21850968f * t; - -ta = 0.21850968f * a[4]; -tb = 0.21850968f * b[4]; -out[1] += ta * b[3] + tb * a[3]; -out[3] = ta * b[1] + tb * a[1]; -t = a[1] * b[3] + a[3] * b[1]; -out[4] = 0.21850968f * t; - -ta = 0.28209480f * a[0] + 0.25231326f * a[6]; -tb = 0.28209480f * b[0] + 0.25231326f * b[6]; -out[2] += ta * b[2] + tb * a[2]; -t = a[2] * b[2]; -out[0] += 0.28209480f * t; -out[6] += 0.25231326f * t; - -ta = 0.21850969f * a[7]; -tb = 0.21850969f * b[7]; -out[2] += ta * b[3] + tb * a[3]; -out[3] += ta * b[2] + tb * a[2]; -t = a[2] * b[3] + a[3] * b[2]; -out[7] = 0.21850969f * t; - -ta = 0.28209479f * a[0] - 0.12615663f * a[6] + 0.21850969f * a[8]; -tb = 0.28209479f * b[0] - 0.12615663f * b[6] + 0.21850969f * b[8]; -out[3] += ta * b[3] + tb * a[3]; -t = a[3] * b[3]; -out[0] += 0.28209479f * t; -out[6] -= 0.12615663f * t; -out[8] += 0.21850969f * t; - -ta = 0.28209479f * a[0] - 0.18022375f * a[6]; -tb = 0.28209479f * b[0] - 0.18022375f * b[6]; -out[4] += ta * b[4] + tb * a[4]; -t = a[4] * b[4]; -out[0] += 0.28209479f * t; -out[6] -= 0.18022375f * t; - -ta = 0.15607835f * a[7]; -tb = 0.15607835f * b[7]; -out[4] += ta * b[5] + tb * a[5]; -out[5] += ta * b[4] + tb * a[4]; -t = a[4] * b[5] + a[5] * b[4]; -out[7] += 0.15607834f * t; - -ta = 0.28209479f * a[0] + 0.09011186 * a[6] - 0.15607835f * a[8]; -tb = 0.28209479f * b[0] + 0.09011186 * b[6] - 0.15607835f * b[8]; -out[5] += ta * b[5] + tb * a[5]; -t = a[5] * b[5]; -out[0] += 0.28209479f * t; -out[6] += 0.09011186f * t; -out[8] -= 0.15607835f * t; - -ta = 0.28209480f * a[0]; -tb = 0.28209480f * b[0]; -out[6] += ta * b[6] + tb * a[6]; -t = a[6] * b[6]; -out[0] += 0.28209480f * t; +c3_t1(0, 0.28209479f, out[0],
Re: d3dx9: Implement D3DXSHMultilpy5
Hello thanks for the review. I don't think that calling defines is the way to go. Indeed, I tested my patch and yours. Yours is about 12% slower than mine in my computer. And now, we try to take care of efficiency of this dll. So, it is not the time to increase latency. I used 10 digits since there are a lot of computation, I want to avoid as much as possible big rounding errors. If we want to uniformize, then we should uniformize d3dxshmultiply 2,3,4 with 10 digits. But that is for an another patch. Nozomi. Looks pretty much ok, but isn't there a way to reduce the size a bit? Just see the dirty hack which is attached. D3DXSHMultiply6 will add a lot of lines too... Also is there a reason why we use constants with different accuracy (e.g. 0.28209479f in D3DXSHMultiply4 and 0.2820948064f)? Cheers Rico
d3dx9: Implement D3DXSHMultilpy5
Is there a problem with this patch? http://source.winehq.org/patches/data/9 Nozomi