Re: [Mono-dev] Mono.Simd - slower than the normal implementation
Hi Alan, There a couple of issues with your code, let me get on them: -Until recently (last night), getters were not accelerated, which causes a significant slowdown. I fixed this in r118899. The generated code is not as good as it could be, but this will be fixed eventually. -Setters are still not accelerated, I'll work on this next week, so until then your code will suffer. -Once you use a single non-accelerated method on a Vector variable all operations on it will be slower due to how our JIT works - they still use sse instructions, but with a performance penalty. -Getters and setter are a hint of ill vectorized code. The last part of your unsafe code should use temps for the intermediate results. -In the unsafe case you should use a Vector4ui store instead of extracting each element. -For the safe case we still miss proper integration with arrays, in the form of methods to extract and store vectors from them. Your code looks a bit strange, the Vector4ui constructor indexes in particular. Have you checked that the output of the 3 methods are the same? I'll work on the Mono.Simd issues next week, getting setters to be accelerated, some methods to better integrate with arrays and other things like element extractors. Rodrigo On Sat, Nov 15, 2008 at 12:13 AM, Alan McGovern [EMAIL PROTECTED]wrote: I found a bit of code in the SHA1 implementation which i thought was ideal for SIMD optimisations. However, unless i resort to unsafe code, it's actually substantially slower! I've attached three implementations of the method here. The original, the safe SIMD and the unsafe SIMD. The runtimes are as follows: Original: 600ms Unsafe Simd: 450ms Safe Simd: 1700ms Also, the method is always called with a uint[] of length 80. Is this just the wrong place to be using simd? It seemed ideal because i need 75% less XOR's. If anyone has an ideas on whether SIMD could actually be useful for this case or not, let me know. Thanks, Alan. The original code is: private static void FillBuff(uint[] buff) { uint val; for (int i = 16; i 80; i += 8) { val = buff[i - 3] ^ buff[i - 8] ^ buff[i - 14] ^ buff[i - 16]; buff[i] = (val 1) | (val 31); val = buff[i - 2] ^ buff[i - 7] ^ buff[i - 13] ^ buff[i - 15]; buff[i + 1] = (val 1) | (val 31); val = buff[i - 1] ^ buff[i - 6] ^ buff[i - 12] ^ buff[i - 14]; buff[i + 2] = (val 1) | (val 31); val = buff[i + 0] ^ buff[i - 5] ^ buff[i - 11] ^ buff[i - 13]; buff[i + 3] = (val 1) | (val 31); val = buff[i + 1] ^ buff[i - 4] ^ buff[i - 10] ^ buff[i - 12]; buff[i + 4] = (val 1) | (val 31); val = buff[i + 2] ^ buff[i - 3] ^ buff[i - 9] ^ buff[i - 11]; buff[i + 5] = (val 1) | (val 31); val = buff[i + 3] ^ buff[i - 2] ^ buff[i - 8] ^ buff[i - 10]; buff[i + 6] = (val 1) | (val 31); val = buff[i + 4] ^ buff[i - 1] ^ buff[i - 7] ^ buff[i - 9]; buff[i + 7] = (val 1) | (val 31); } } The unsafe SIMD code is: public unsafe static void FillBuff(uint[] buffb) { fixed (uint* buff = buffb) { Vector4ui e; for (int t = 16; t buffb.Length; t += 4) { e = *((Vector4ui*)(buff [t-16])) ^ *((Vector4ui*)(buff [t-14])) ^ *((Vector4ui*)(buff [t- 8])) ^ *((Vector4ui*)(buff [t- 3])); e.W ^= buff[t]; buff[t] = (e.X 1) | (e.X 31); buff[t + 1] = (e.Y 1) | (e.Y 31); buff[t + 2] = (e.Z 1) | (e.Z 31); buff[t + 3] = (e.W 1) | (e.W 31) ^ ((e.X 2) | (e.X 30)); } } } The safe simd code is: public static void FillBuff(uint[] buff) { Vector4ui e; for (int t = 16; t buff.Length; t += 4) { e = new Vector4ui (buff [t-16],buff [t-15],buff [t-14],buff [t-13]) ^ new Vector4ui (buff [t-14],buff [t-13],buff [t-12],buff [t-11]) ^ new Vector4ui (buff [t-8], buff [t-7], buff [t-6], buff [t-5]) ^ new Vector4ui (buff [t-3], buff [t-2], buff [t-1], buff [t-0]); e.W ^= buff[t]; buff[t] =(e.X 1) | (e.X 31); buff[t + 1] = (e.Y 1) | (e.Y 31); buff[t + 2] = (e.Z 1) | (e.Z 31); buff[t + 3] = (e.W 1) | (e.W 31) ^ ((e.X 2) | (e.X 30)); } } ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com
Re: [Mono-dev] Mono.Simd - slower than the normal implementation
Hey, On Sat, Nov 15, 2008 at 3:50 PM, Rodrigo Kumpera [EMAIL PROTECTED] wrote: Hi Alan, -Getters and setter are a hint of ill vectorized code. In this particular scenario, I'm not sure how i can get rid of the use of getters/setters unless I use even more unsafe code. I don't know whether it's feasible or not, but it'd be great to be able to use this API without having to use unsafe code. At the moment, I don't think it's really possible to use this API without getters and setters. The last part of your unsafe code should use temps for the intermediate results. Do you mean that I should copy the vector 'e', which i got from XOR'ing my values, into another Vector4ui using the store operation? Then I should do my bitshifting/storing into uint[] from that one? -For the safe case we still miss proper integration with arrays, in the form of methods to extract and store vectors from them. I was thinking that the API could expose something like: Vector4ui.Create (uint[] array, int offset, ref Vector4ui result) which could be changed into: result = *((Vector4ui*)array [offset]); Though I'm sure you have ideas already on this ;) A similar method for storing the result into a uint[] would be great too. Your code looks a bit strange, the Vector4ui constructor indexes in particular. Have you checked that the output of the 3 methods are the same? Yes, there is a bug in my implementation there, I left out a bracket when setting the value of buff[t+3]. There should be an additional bracket around (e.W 1) | (e.W 31). Other than that, the implementation is correct. I've pasted the correct implementation of the unsafe and safe SIMD versions below. Just for reference purposes. I'll work on the Mono.Simd issues next week, getting setters to be accelerated, some methods to better integrate with arrays and other things like element extractors. Great stuff. Give me a shout when you've done that and I'll try to improve the above implementation. Though if you have time to spare while writing the SIMD code, you could take a look at it yourself ;) Thanks, Alan. Reference implementations (non buggy ;) ): public static void FillBuffSafe(uint[] buff) { for (int t = 16; t buff.Length; t += 4) { Vector4ui e = new Vector4ui(buff[t - 3], buff[t - 2], buff[t - 1], buff[t - 0]) ^ new Vector4ui(buff[t - 8], buff[t - 7], buff[t - 6], buff[t - 5]) ^ new Vector4ui(buff[t - 14], buff[t - 13], buff[t - 12], buff[t - 11]) ^ new Vector4ui(buff[t - 16], buff[t - 15], buff[t - 14], buff[t - 13]); e.W ^= buff[t]; buff[t] = (e.X 1) | (e.X 31); buff[t + 1] = (e.Y 1) | (e.Y 31); buff[t + 2] = (e.Z 1) | (e.Z 31); buff[t + 3] = ((e.W 1) | (e.W 31)) ^ ((e.X 2) | (e.X 30)); } } public unsafe static void FillBuffUnsafe(uint[] buffb) { fixed (uint* buff = buffb) { for (int t = 16; t buffb.Length; t += 4) { Vector4ui e = *((Vector4ui*)buff[t - 3]) ^ *((Vector4ui*)buff[t - 8]) ^ *((Vector4ui*)buff[t - 14]) ^ *((Vector4ui*)buff[t - 16]); e.W ^= buff[t]; buff[t] = (e.X 1) | (e.X 31); buff[t + 1] = (e.Y 1) | (e.Y 31); buff[t + 2] = (e.Z 1) | (e.Z 31); buff[t + 3] = ((e.W 1) | (e.W 31)) ^ ((e.X 2) | (e.X 30)); } } } Rodrigo On Sat, Nov 15, 2008 at 12:13 AM, Alan McGovern [EMAIL PROTECTED] wrote: I found a bit of code in the SHA1 implementation which i thought was ideal for SIMD optimisations. However, unless i resort to unsafe code, it's actually substantially slower! I've attached three implementations of the method here. The original, the safe SIMD and the unsafe SIMD. The runtimes are as follows: Original: 600ms Unsafe Simd: 450ms Safe Simd: 1700ms Also, the method is always called with a uint[] of length 80. Is this just the wrong place to be using simd? It seemed ideal because i need 75% less XOR's. If anyone has an ideas on whether SIMD could actually be useful for this case or not, let me know. Thanks, Alan. The original code is: private static void FillBuff(uint[] buff) { uint val; for (int i = 16; i 80; i += 8) { val = buff[i - 3] ^ buff[i - 8] ^ buff[i - 14] ^ buff[i - 16]; buff[i] = (val 1) | (val 31); val = buff[i - 2] ^ buff[i - 7] ^ buff[i - 13] ^ buff[i - 15]; buff[i + 1] = (val 1) | (val 31); val = buff[i - 1] ^ buff[i - 6] ^ buff[i - 12] ^ buff[i - 14]; buff[i + 2] = (val
Re: [Mono-dev] Mono.Simd - slower than the normal implementation
Here's my benchmarking file anyway, it may prove useful. Alan. On Sun, Nov 16, 2008 at 2:37 AM, Alan McGovern [EMAIL PROTECTED] wrote: Hey, On Sat, Nov 15, 2008 at 3:50 PM, Rodrigo Kumpera [EMAIL PROTECTED] wrote: Hi Alan, -Getters and setter are a hint of ill vectorized code. In this particular scenario, I'm not sure how i can get rid of the use of getters/setters unless I use even more unsafe code. I don't know whether it's feasible or not, but it'd be great to be able to use this API without having to use unsafe code. At the moment, I don't think it's really possible to use this API without getters and setters. The last part of your unsafe code should use temps for the intermediate results. Do you mean that I should copy the vector 'e', which i got from XOR'ing my values, into another Vector4ui using the store operation? Then I should do my bitshifting/storing into uint[] from that one? -For the safe case we still miss proper integration with arrays, in the form of methods to extract and store vectors from them. I was thinking that the API could expose something like: Vector4ui.Create (uint[] array, int offset, ref Vector4ui result) which could be changed into: result = *((Vector4ui*)array [offset]); Though I'm sure you have ideas already on this ;) A similar method for storing the result into a uint[] would be great too. Your code looks a bit strange, the Vector4ui constructor indexes in particular. Have you checked that the output of the 3 methods are the same? Yes, there is a bug in my implementation there, I left out a bracket when setting the value of buff[t+3]. There should be an additional bracket around (e.W 1) | (e.W 31). Other than that, the implementation is correct. I've pasted the correct implementation of the unsafe and safe SIMD versions below. Just for reference purposes. I'll work on the Mono.Simd issues next week, getting setters to be accelerated, some methods to better integrate with arrays and other things like element extractors. Great stuff. Give me a shout when you've done that and I'll try to improve the above implementation. Though if you have time to spare while writing the SIMD code, you could take a look at it yourself ;) Thanks, Alan. Reference implementations (non buggy ;) ): public static void FillBuffSafe(uint[] buff) { for (int t = 16; t buff.Length; t += 4) { Vector4ui e = new Vector4ui(buff[t - 3], buff[t - 2], buff[t - 1], buff[t - 0]) ^ new Vector4ui(buff[t - 8], buff[t - 7], buff[t - 6], buff[t - 5]) ^ new Vector4ui(buff[t - 14], buff[t - 13], buff[t - 12], buff[t - 11]) ^ new Vector4ui(buff[t - 16], buff[t - 15], buff[t - 14], buff[t - 13]); e.W ^= buff[t]; buff[t] = (e.X 1) | (e.X 31); buff[t + 1] = (e.Y 1) | (e.Y 31); buff[t + 2] = (e.Z 1) | (e.Z 31); buff[t + 3] = ((e.W 1) | (e.W 31)) ^ ((e.X 2) | (e.X 30)); } } public unsafe static void FillBuffUnsafe(uint[] buffb) { fixed (uint* buff = buffb) { for (int t = 16; t buffb.Length; t += 4) { Vector4ui e = *((Vector4ui*)buff[t - 3]) ^ *((Vector4ui*)buff[t - 8]) ^ *((Vector4ui*)buff[t - 14]) ^ *((Vector4ui*)buff[t - 16]); e.W ^= buff[t]; buff[t] = (e.X 1) | (e.X 31); buff[t + 1] = (e.Y 1) | (e.Y 31); buff[t + 2] = (e.Z 1) | (e.Z 31); buff[t + 3] = ((e.W 1) | (e.W 31)) ^ ((e.X 2) | (e.X 30)); } } } Rodrigo On Sat, Nov 15, 2008 at 12:13 AM, Alan McGovern [EMAIL PROTECTED] wrote: I found a bit of code in the SHA1 implementation which i thought was ideal for SIMD optimisations. However, unless i resort to unsafe code, it's actually substantially slower! I've attached three implementations of the method here. The original, the safe SIMD and the unsafe SIMD. The runtimes are as follows: Original: 600ms Unsafe Simd: 450ms Safe Simd: 1700ms Also, the method is always called with a uint[] of length 80. Is this just the wrong place to be using simd? It seemed ideal because i need 75% less XOR's. If anyone has an ideas on whether SIMD could actually be useful for this case or not, let me know. Thanks, Alan. The original code is: private static void FillBuff(uint[] buff) { uint val; for (int i = 16; i 80; i += 8) { val = buff[i - 3] ^ buff[i - 8] ^ buff[i - 14] ^ buff[i - 16]; buff[i] = (val 1) | (val 31); val = buff[i - 2] ^ buff[i - 7] ^ buff[i - 13] ^ buff[i -
[Mono-dev] Mono.Simd - slower than the normal implementation
I found a bit of code in the SHA1 implementation which i thought was ideal for SIMD optimisations. However, unless i resort to unsafe code, it's actually substantially slower! I've attached three implementations of the method here. The original, the safe SIMD and the unsafe SIMD. The runtimes are as follows: Original: 600ms Unsafe Simd: 450ms Safe Simd: 1700ms Also, the method is always called with a uint[] of length 80. Is this just the wrong place to be using simd? It seemed ideal because i need 75% less XOR's. If anyone has an ideas on whether SIMD could actually be useful for this case or not, let me know. Thanks, Alan. The original code is: private static void FillBuff(uint[] buff) { uint val; for (int i = 16; i 80; i += 8) { val = buff[i - 3] ^ buff[i - 8] ^ buff[i - 14] ^ buff[i - 16]; buff[i] = (val 1) | (val 31); val = buff[i - 2] ^ buff[i - 7] ^ buff[i - 13] ^ buff[i - 15]; buff[i + 1] = (val 1) | (val 31); val = buff[i - 1] ^ buff[i - 6] ^ buff[i - 12] ^ buff[i - 14]; buff[i + 2] = (val 1) | (val 31); val = buff[i + 0] ^ buff[i - 5] ^ buff[i - 11] ^ buff[i - 13]; buff[i + 3] = (val 1) | (val 31); val = buff[i + 1] ^ buff[i - 4] ^ buff[i - 10] ^ buff[i - 12]; buff[i + 4] = (val 1) | (val 31); val = buff[i + 2] ^ buff[i - 3] ^ buff[i - 9] ^ buff[i - 11]; buff[i + 5] = (val 1) | (val 31); val = buff[i + 3] ^ buff[i - 2] ^ buff[i - 8] ^ buff[i - 10]; buff[i + 6] = (val 1) | (val 31); val = buff[i + 4] ^ buff[i - 1] ^ buff[i - 7] ^ buff[i - 9]; buff[i + 7] = (val 1) | (val 31); } } The unsafe SIMD code is: public unsafe static void FillBuff(uint[] buffb) { fixed (uint* buff = buffb) { Vector4ui e; for (int t = 16; t buffb.Length; t += 4) { e = *((Vector4ui*)(buff [t-16])) ^ *((Vector4ui*)(buff [t-14])) ^ *((Vector4ui*)(buff [t- 8])) ^ *((Vector4ui*)(buff [t- 3])); e.W ^= buff[t]; buff[t] = (e.X 1) | (e.X 31); buff[t + 1] = (e.Y 1) | (e.Y 31); buff[t + 2] = (e.Z 1) | (e.Z 31); buff[t + 3] = (e.W 1) | (e.W 31) ^ ((e.X 2) | (e.X 30)); } } } The safe simd code is: public static void FillBuff(uint[] buff) { Vector4ui e; for (int t = 16; t buff.Length; t += 4) { e = new Vector4ui (buff [t-16],buff [t-15],buff [t-14],buff [t-13]) ^ new Vector4ui (buff [t-14],buff [t-13],buff [t-12],buff [t-11]) ^ new Vector4ui (buff [t-8], buff [t-7], buff [t-6], buff [t-5]) ^ new Vector4ui (buff [t-3], buff [t-2], buff [t-1], buff [t-0]); e.W ^= buff[t]; buff[t] =(e.X 1) | (e.X 31); buff[t + 1] = (e.Y 1) | (e.Y 31); buff[t + 2] = (e.Z 1) | (e.Z 31); buff[t + 3] = (e.W 1) | (e.W 31) ^ ((e.X 2) | (e.X 30)); } } ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list
Re: [Mono-dev] Mono.Simd - slower than the normal implementation
I forgot to mention that I'm on a 1.86GHZ core2duo and i was running with --optimize=simd. Alan. On Sat, Nov 15, 2008 at 2:13 AM, Alan McGovern [EMAIL PROTECTED] wrote: I found a bit of code in the SHA1 implementation which i thought was ideal for SIMD optimisations. However, unless i resort to unsafe code, it's actually substantially slower! I've attached three implementations of the method here. The original, the safe SIMD and the unsafe SIMD. The runtimes are as follows: Original: 600ms Unsafe Simd: 450ms Safe Simd: 1700ms Also, the method is always called with a uint[] of length 80. Is this just the wrong place to be using simd? It seemed ideal because i need 75% less XOR's. If anyone has an ideas on whether SIMD could actually be useful for this case or not, let me know. Thanks, Alan. The original code is: private static void FillBuff(uint[] buff) { uint val; for (int i = 16; i 80; i += 8) { val = buff[i - 3] ^ buff[i - 8] ^ buff[i - 14] ^ buff[i - 16]; buff[i] = (val 1) | (val 31); val = buff[i - 2] ^ buff[i - 7] ^ buff[i - 13] ^ buff[i - 15]; buff[i + 1] = (val 1) | (val 31); val = buff[i - 1] ^ buff[i - 6] ^ buff[i - 12] ^ buff[i - 14]; buff[i + 2] = (val 1) | (val 31); val = buff[i + 0] ^ buff[i - 5] ^ buff[i - 11] ^ buff[i - 13]; buff[i + 3] = (val 1) | (val 31); val = buff[i + 1] ^ buff[i - 4] ^ buff[i - 10] ^ buff[i - 12]; buff[i + 4] = (val 1) | (val 31); val = buff[i + 2] ^ buff[i - 3] ^ buff[i - 9] ^ buff[i - 11]; buff[i + 5] = (val 1) | (val 31); val = buff[i + 3] ^ buff[i - 2] ^ buff[i - 8] ^ buff[i - 10]; buff[i + 6] = (val 1) | (val 31); val = buff[i + 4] ^ buff[i - 1] ^ buff[i - 7] ^ buff[i - 9]; buff[i + 7] = (val 1) | (val 31); } } The unsafe SIMD code is: public unsafe static void FillBuff(uint[] buffb) { fixed (uint* buff = buffb) { Vector4ui e; for (int t = 16; t buffb.Length; t += 4) { e = *((Vector4ui*)(buff [t-16])) ^ *((Vector4ui*)(buff [t-14])) ^ *((Vector4ui*)(buff [t- 8])) ^ *((Vector4ui*)(buff [t- 3])); e.W ^= buff[t]; buff[t] = (e.X 1) | (e.X 31); buff[t + 1] = (e.Y 1) | (e.Y 31); buff[t + 2] = (e.Z 1) | (e.Z 31); buff[t + 3] = (e.W 1) | (e.W 31) ^ ((e.X 2) | (e.X 30)); } } } The safe simd code is: public static void FillBuff(uint[] buff) { Vector4ui e; for (int t = 16; t buff.Length; t += 4) { e = new Vector4ui (buff [t-16],buff [t-15],buff [t-14],buff [t-13]) ^ new Vector4ui (buff [t-14],buff [t-13],buff [t-12],buff [t-11]) ^ new Vector4ui (buff [t-8], buff [t-7], buff [t-6], buff [t-5]) ^ new Vector4ui (buff [t-3], buff [t-2], buff [t-1], buff [t-0]); e.W ^= buff[t]; buff[t] =(e.X 1) | (e.X 31); buff[t + 1] = (e.Y 1) | (e.Y 31); buff[t + 2] = (e.Z 1) | (e.Z 31); buff[t + 3] = (e.W 1) | (e.W 31) ^ ((e.X 2) | (e.X 30)); } } ___ Mono-devel-list mailing list Mono-devel-list@lists.ximian.com http://lists.ximian.com/mailman/listinfo/mono-devel-list