Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)
Hi, When using on armv8 architecture, does this mont mul ASM code have any optimization with linux-aarch64 configuration? There is ambiguity in posed question. In direct context of this discussion this as in this mont mul ASM code ought to refer to armv4-mont. And then the answer would have to be no, ARMv4 code can not used in aarch64 build. But it's also possible to generalize and consider this more like this thing, Montgomery multiplication module, you are talking about, is there aarch64 equivalent? And answer to such question would be yes, there is corresponding armv8-mont module in development branch. ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)
Hi Andy, When using on armv8 architecture, does this mont mul ASM code have any optimization with linux-aarch64 configuration? Thanks Ravichandra On Wed, Jun 17, 2015 at 3:06 PM, Andy Polyakov ap...@openssl.org wrote: Hi, With some experimentation, it turns out that if I *stop* using the crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the time for a simplish test to establish and close a simple SSL connection went from 28 seconds to 18. (It's quite a slow target at any time). In other words, this optimised version has slowed things down dramatically. Has anyone queried the value of the asm of armv4-mont.pl any time in the last few years? [snip] I found the cause - although OPENSSL_BN_ASM_MONT was defined, I hadn't noticed that a colleague had put a #define OPENSSL_NO_ASM somewhere else (this isn't linux but a port to our own OS). It turns out that (surprisingly) this combination changes behaviour rather than barfing - it's even explicitly catered for in bn_asm.c. In other words sanity restored. Phew! Incidentally, as next step I was going to ask for copy of your bn_asm.o (yes, binary .o, yes, bn_asm.o, not armv4-mont.o), and bn_mul_mont should have shown up and presumably noticed as unexpected... Regardless, the effect is that a different bn_mul_mont implementation gets used, and the armv4-mont.pl implementation gets ignored entirely. Right. And as mentioned in commentary bn_mul_mont in bn_asm.c is just a template with no performance promises attached. Note that it still exhibits previously mentioned breaking point... With that fixed, I now have greatly improved performance as expected. So that with armv4-mont actually in the loop, the breaking point is still beyond practical key lengths, even on ARM9. An unfortunate waste of time for us both, but thanks for the assistance. Given that presented timings for ARM9 are kind of astronomic you might want to consider if it's possible to use other algorithms. EC is getting wider adoption now and can perform better. Not to mention that optimized NIST P-256 EC was recently added... Cheers. ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)
Hi, With some experimentation, it turns out that if I *stop* using the crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the time for a simplish test to establish and close a simple SSL connection went from 28 seconds to 18. (It's quite a slow target at any time). In other words, this optimised version has slowed things down dramatically. Has anyone queried the value of the asm of armv4-mont.pl any time in the last few years? [snip] I found the cause - although OPENSSL_BN_ASM_MONT was defined, I hadn't noticed that a colleague had put a #define OPENSSL_NO_ASM somewhere else (this isn't linux but a port to our own OS). It turns out that (surprisingly) this combination changes behaviour rather than barfing - it's even explicitly catered for in bn_asm.c. In other words sanity restored. Phew! Incidentally, as next step I was going to ask for copy of your bn_asm.o (yes, binary .o, yes, bn_asm.o, not armv4-mont.o), and bn_mul_mont should have shown up and presumably noticed as unexpected... Regardless, the effect is that a different bn_mul_mont implementation gets used, and the armv4-mont.pl implementation gets ignored entirely. Right. And as mentioned in commentary bn_mul_mont in bn_asm.c is just a template with no performance promises attached. Note that it still exhibits previously mentioned breaking point... With that fixed, I now have greatly improved performance as expected. So that with armv4-mont actually in the loop, the breaking point is still beyond practical key lengths, even on ARM9. An unfortunate waste of time for us both, but thanks for the assistance. Given that presented timings for ARM9 are kind of astronomic you might want to consider if it's possible to use other algorithms. EC is getting wider adoption now and can perform better. Not to mention that optimized NIST P-256 EC was recently added... Cheers. ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)
I meant armv4-mont code. As the answer to that question is no, i would have asked if we have a support for armv8-mont. Your response answers both of my questions. Thanks Ravichandra On Wed, Jun 17, 2015 at 5:41 PM, Andy Polyakov ap...@openssl.org wrote: Hi, When using on armv8 architecture, does this mont mul ASM code have any optimization with linux-aarch64 configuration? There is ambiguity in posed question. In direct context of this discussion this as in this mont mul ASM code ought to refer to armv4-mont. And then the answer would have to be no, ARMv4 code can not used in aarch64 build. But it's also possible to generalize and consider this more like this thing, Montgomery multiplication module, you are talking about, is there aarch64 equivalent? And answer to such question would be yes, there is corresponding armv8-mont module in development branch. ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)
Hi, After the changes to DH requiring longer key lengths, I switched to 2048-bit keys, but was finding this was now making my test runs on an embedded ARM9 target annoyingly slow; so thought I'd investigate to see if there was anything to improve. With some experimentation, it turns out that if I *stop* using the crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the time for a simplish test to establish and close a simple SSL connection went from 28 seconds to 18. (It's quite a slow target at any time). In other words, this optimised version has slowed things down dramatically. Has anyone queried the value of the asm of armv4-mont.pl any time in the last few years? Yes, of course. For reference, here are speed rsa2048 dsa2048 results from Cortex-A8. Numbers are operations per second, so that higher is better. Without armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 0.052684s 0.001421s 19.0703.5 dsa 2048 bits 0.014576s 0.017526s 68.6 57.1 With armv4-mont.pl but without NEON (ARM SIMD extension): rsa 2048 bits 0.039255s 0.001140s 25.5877.3 dsa 2048 bits 0.011630s 0.013900s 86.0 71.9 With armv4-mont.pl and NEON on: rsa 2048 bits 0.021053s 0.000606s 47.5 1650.2 dsa 2048 bits 0.006084s 0.006985s164.4143.2 Well, RSA/DSA are not DH, but they are very representative when it comes to sheer BIGNUM performance. And of course Cortex-A8 is not ARM9, but at least it shows that statement about armv4-mont.pl being bad for performance does not hold universally true. It's rather contrary, as similar picture can be observed on most ARM processors (well, all I tested). Is it just that compilers have become better (I'm only using gcc 4.7.3, so not bleeding edge even). I don't think so. BIGNUM performance can be delicate balance between multiple factors and it's not impossible to end up on the other side of breaking point. What breaking point? If you examine performance improvement with and without Montgomery multiplication module, you'll notice that there are processors on which improvement coefficient declines with key length. I mean you'll observe lower improvement longer key is. This indicates that there ought to be point past which you can as well observe worse performance, not better. So far such points fell outside practical key lengths on tested systems, ARM or not. Well, except for s390x-mont module [which by the way even discusses reasons for why such breaking point exists, see commentary in bn/asm/s390x-mont.pl]. In other words I argue that your case is case of finding yourself on the other side of said breaking point on specific CPU, not case of armv4-mont.pl being universally inferior. It does come a little bit unexpected in sense that I wouldn't expect it to hit the point at 2048-bit key length on any specific ARM processor, but on the other hard it's not impossible (all it takes is multiplication instruction stalling pipe-line for long enough to tip the balance). Anyway, it's uncertain to me whether armv4-mont.pl should remain. Assuming that majority of ARM users are not ARM9 users, most would have to disagree :-) So what does it leave us? One can argue that OpenSSL could detect the breaking point at run-time and act accordingly, but it's tricky and is likely to have too narrow use. One can argue that OpenSSL can be further optimized so that breaking point is moved further (if not eliminated), which is more practical, because it should improve performance on all processors, but this is not something that happens over night. Meanwhile just documenting the case and providing instructions on how to disengage the module is probably reasonable compromise. Would you agree? One can make arrangements so that said instructions would be super-simple... FYI, I couldn't discern any difference whether using armv4-gf2m or not, but that doesn't mean it's bad. armv4-gf2m is involved in Elliptic Curve, and of specific kind. Your problem description doesn't sound like it should affect you. But even if it did, it's unlike that you'll notice regression, because there are no breaking points in that case. ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)
What's more, I dug out a Cortex-A9 target (Atmel CycloneV board, operating with single core only) and got this without armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 0.127342s 0.003628s 7.9275.6 dsa 2048 bits 0.035971s 0.042778s 27.8 23.4 and this with armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 0.172931s 0.005222s 5.8191.5 dsa 2048 bits 0.052565s 0.061350s 19.0 16.3 For reference, here is what I get on 1GHz Cortex-A9 Without armv4-mont: rsa 2048 bits 0.041590s 0.001116s 24.0896.0 dsa 2048 bits 0.011574s 0.013831s 86.4 72.3 With armv4-mont, no NEON rsa 2048 bits 0.033003s 0.000954s 30.3 1048.4 dsa 2048 bits 0.009794s 0.011211s102.1 89.2 NEON (recall that A9 is an odd-ball) rsa 2048 bits 0.034281s 0.000987s 29.2 1012.8 dsa 2048 bits 0.010163s 0.012027s 98.4 83.1 Here is 600MHz ARM11xx, an ARMv6 processor. Without armv4-mont: rsa 2048 bits 0.110889s 0.002923s 9.0342.1 dsa 2048 bits 0.030182s 0.036533s 33.1 27.4 With armv4-mont: rsa 2048 bits 0.087895s 0.002569s 11.4389.2 dsa 2048 bits 0.026412s 0.031384s 37.9 31.9 ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)
With some experimentation, it turns out that if I *stop* using the crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the time for a simplish test to establish and close a simple SSL connection went from 28 seconds to 18. (It's quite a slow target at any time). In other words, this optimised version has slowed things down dramatically. Has anyone queried the value of the asm of armv4-mont.pl any time in the last few years? Yes, of course. For reference, here are speed rsa2048 dsa2048 results from Cortex-A8. Numbers are operations per second, so that higher is better. Without armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 0.052684s 0.001421s 19.0703.5 dsa 2048 bits 0.014576s 0.017526s 68.6 57.1 With armv4-mont.pl but without NEON (ARM SIMD extension): rsa 2048 bits 0.039255s 0.001140s 25.5877.3 dsa 2048 bits 0.011630s 0.013900s 86.0 71.9 Wow, I get very different results on my ARM9 target. Without armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 2.567500s 0.072826s 0.4 13.7 dsa 2048 bits 0.722857s 0.865833s 1.4 1.2 With armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 3.43s 0.104896s 0.3 9.5 dsa 2048 bits 1.058000s 1.253750s 0.9 0.8 Can you provide data for speed rsa dsa, which tests variety of length? As mentioned earlier, we should observe decreasing improvement coefficient, be it positive or negative... What's more, I dug out a Cortex-A9 target (Atmel CycloneV board, operating with single core only) and got this without armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 0.127342s 0.003628s 7.9275.6 dsa 2048 bits 0.035971s 0.042778s 27.8 23.4 and this with armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 0.172931s 0.005222s 5.8191.5 dsa 2048 bits 0.052565s 0.061350s 19.0 16.3 As you can see, in both cases using armv4-mont.pl makes it 30% slower. So whatever is going on, it isn't down to the CPU. I think there must be something else going on. I'll get back to you. This is odd. Two questions. As far as I understand Cyclone V is FPGA, so what does Cortex-A9 target mean in the context? Is it actual Cortex-A9 with FPGA beside it, or is it ARM processor loaded to FPGA? I don't think one can give any performance guarantees in latter case. Two, can you show /proc/cpuinfo? On side note. Specifically Cortex-A9 has turned to be an odd-ball. It's mentioned in commentary section, for some reason NEON doesn't give any improvement on A9 on longer key lengths, but losses are considered acceptable because it improves performance on other NEON-capable processors. Well, this doesn't explain above discrepancies, which is why it's a side note... ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)
Hi, Thanks for the reply. On 16/06/15 13:09, Andy Polyakov wrote: With some experimentation, it turns out that if I *stop* using the crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the time for a simplish test to establish and close a simple SSL connection went from 28 seconds to 18. (It's quite a slow target at any time). In other words, this optimised version has slowed things down dramatically. Has anyone queried the value of the asm of armv4-mont.pl any time in the last few years? Yes, of course. For reference, here are speed rsa2048 dsa2048 results from Cortex-A8. Numbers are operations per second, so that higher is better. Without armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 0.052684s 0.001421s 19.0703.5 dsa 2048 bits 0.014576s 0.017526s 68.6 57.1 With armv4-mont.pl but without NEON (ARM SIMD extension): rsa 2048 bits 0.039255s 0.001140s 25.5877.3 dsa 2048 bits 0.011630s 0.013900s 86.0 71.9 Wow, I get very different results on my ARM9 target. Without armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 2.567500s 0.072826s 0.4 13.7 dsa 2048 bits 0.722857s 0.865833s 1.4 1.2 With armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 3.43s 0.104896s 0.3 9.5 dsa 2048 bits 1.058000s 1.253750s 0.9 0.8 What's more, I dug out a Cortex-A9 target (Atmel CycloneV board, operating with single core only) and got this without armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 0.127342s 0.003628s 7.9275.6 dsa 2048 bits 0.035971s 0.042778s 27.8 23.4 and this with armv4-mont.pl: signverifysign/s verify/s rsa 2048 bits 0.172931s 0.005222s 5.8191.5 dsa 2048 bits 0.052565s 0.061350s 19.0 16.3 As you can see, in both cases using armv4-mont.pl makes it 30% slower. So whatever is going on, it isn't down to the CPU. I think there must be something else going on. I'll get back to you. Jifl -- eCosCentric Limited http://www.eCosCentric.com/ The eCos experts Barnwell House, Barnwell Drive, Cambridge, UK. Tel: +44 1223 245571 Registered in England and Wales: Reg No 4422071. --[Si fractum non sit, noli id reficere]-- Opinions==mine ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)
On 16/06/15 22:12, Andy Polyakov wrote: With some experimentation, it turns out that if I *stop* using the crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the time for a simplish test to establish and close a simple SSL connection went from 28 seconds to 18. (It's quite a slow target at any time). In other words, this optimised version has slowed things down dramatically. Has anyone queried the value of the asm of armv4-mont.pl any time in the last few years? [snip] Hi Andy, I found the cause - although OPENSSL_BN_ASM_MONT was defined, I hadn't noticed that a colleague had put a #define OPENSSL_NO_ASM somewhere else (this isn't linux but a port to our own OS). It turns out that (surprisingly) this combination changes behaviour rather than barfing - it's even explicitly catered for in bn_asm.c. Regardless, the effect is that a different bn_mul_mont implementation gets used, and the armv4-mont.pl implementation gets ignored entirely. With that fixed, I now have greatly improved performance as expected. An unfortunate waste of time for us both, but thanks for the assistance. Jifl -- --[Si fractum non sit, noli id reficere]-- Opinions==mine ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
[openssl-dev] ARM optimised montgomery multiplication (armv4-mont)
Hi, After the changes to DH requiring longer key lengths, I switched to 2048-bit keys, but was finding this was now making my test runs on an embedded ARM9 target annoyingly slow; so thought I'd investigate to see if there was anything to improve. With some experimentation, it turns out that if I *stop* using the crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the time for a simplish test to establish and close a simple SSL connection went from 28 seconds to 18. (It's quite a slow target at any time). In other words, this optimised version has slowed things down dramatically. Has anyone queried the value of the asm of armv4-mont.pl any time in the last few years? Is it just that compilers have become better (I'm only using gcc 4.7.3, so not bleeding edge even). Anyway, it's uncertain to me whether armv4-mont.pl should remain. Does anyone care to try it on other ARM cores? FYI, I couldn't discern any difference whether using armv4-gf2m or not, but that doesn't mean it's bad. Jifl -- eCosCentric Limited http://www.eCosCentric.com/ Barnwell House, Barnwell Drive, Cambridge, UK. Tel: +44 1223 245571 Registered in England and Wales: Reg No 4422071. --[Si fractum non sit, noli id reficere]-- Opinions==mine ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev