Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)

2015-06-17 Thread Andy Polyakov
Hi,

 When using on armv8 architecture, does this mont mul ASM code have
 any optimization with linux-aarch64 configuration?

There is ambiguity in posed question. In direct context of this
discussion this as in this mont mul ASM code ought to refer to
armv4-mont. And then the answer would have to be no, ARMv4 code can not
used in aarch64 build. But it's also possible to generalize and
consider this more like this thing, Montgomery multiplication module,
you are talking about, is there aarch64 equivalent? And answer to such
question would be yes, there is corresponding armv8-mont module in
development branch.

___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)

2015-06-17 Thread Ravichandra
Hi Andy,
When using on armv8 architecture, does this mont mul ASM code have any
optimization with linux-aarch64 configuration?

Thanks
Ravichandra

On Wed, Jun 17, 2015 at 3:06 PM, Andy Polyakov ap...@openssl.org wrote:

 Hi,

  With some experimentation, it turns out that if I *stop* using the
  crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version,
 the time for
  a simplish test to establish and close a simple SSL connection went
 from 28
  seconds to 18. (It's quite a slow target at any time).
 
  In other words, this optimised version has slowed things down
 dramatically.
  Has anyone queried the value of the asm of armv4-mont.pl any time
 in the last
  few years?
  [snip]
 
  I found the cause - although OPENSSL_BN_ASM_MONT was defined, I hadn't
 noticed
  that a colleague had put a #define OPENSSL_NO_ASM somewhere else (this
 isn't
  linux but a port to our own OS). It turns out that (surprisingly) this
  combination changes behaviour rather than barfing - it's even explicitly
  catered for in bn_asm.c.

 In other words sanity restored. Phew! Incidentally, as next step I was
 going to ask for copy of your bn_asm.o (yes, binary .o, yes, bn_asm.o,
 not armv4-mont.o), and bn_mul_mont should have shown up and presumably
 noticed as unexpected...

  Regardless, the effect is that a different bn_mul_mont implementation
 gets
  used, and the armv4-mont.pl implementation gets ignored entirely.

 Right. And as mentioned in commentary bn_mul_mont in bn_asm.c is just a
 template with no performance promises attached. Note that it still
 exhibits previously mentioned breaking point...

  With that fixed, I now have greatly improved performance as expected.

 So that with armv4-mont actually in the loop, the breaking point is
 still beyond practical key lengths, even on ARM9.

  An
  unfortunate waste of time for us both, but thanks for the assistance.

 Given that presented timings for ARM9 are kind of astronomic you might
 want to consider if it's possible to use other algorithms. EC is getting
 wider adoption now and can perform better. Not to mention that optimized
 NIST P-256 EC was recently added...

 Cheers.

 ___
 openssl-dev mailing list
 To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev

___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)

2015-06-17 Thread Andy Polyakov
Hi,

 With some experimentation, it turns out that if I *stop* using the
 crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the 
 time for
 a simplish test to establish and close a simple SSL connection went from 
 28
 seconds to 18. (It's quite a slow target at any time).

 In other words, this optimised version has slowed things down 
 dramatically.
 Has anyone queried the value of the asm of armv4-mont.pl any time in the 
 last
 few years?
 [snip]
 
 I found the cause - although OPENSSL_BN_ASM_MONT was defined, I hadn't noticed
 that a colleague had put a #define OPENSSL_NO_ASM somewhere else (this isn't
 linux but a port to our own OS). It turns out that (surprisingly) this
 combination changes behaviour rather than barfing - it's even explicitly
 catered for in bn_asm.c.

In other words sanity restored. Phew! Incidentally, as next step I was
going to ask for copy of your bn_asm.o (yes, binary .o, yes, bn_asm.o,
not armv4-mont.o), and bn_mul_mont should have shown up and presumably
noticed as unexpected...

 Regardless, the effect is that a different bn_mul_mont implementation gets
 used, and the armv4-mont.pl implementation gets ignored entirely.

Right. And as mentioned in commentary bn_mul_mont in bn_asm.c is just a
template with no performance promises attached. Note that it still
exhibits previously mentioned breaking point...

 With that fixed, I now have greatly improved performance as expected.

So that with armv4-mont actually in the loop, the breaking point is
still beyond practical key lengths, even on ARM9.

 An
 unfortunate waste of time for us both, but thanks for the assistance.

Given that presented timings for ARM9 are kind of astronomic you might
want to consider if it's possible to use other algorithms. EC is getting
wider adoption now and can perform better. Not to mention that optimized
NIST P-256 EC was recently added...

Cheers.

___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)

2015-06-17 Thread Ravichandra
I meant armv4-mont code. As the answer to that question is no, i would have
asked if we have a support for armv8-mont. Your response answers both of my
questions.

Thanks
Ravichandra

On Wed, Jun 17, 2015 at 5:41 PM, Andy Polyakov ap...@openssl.org wrote:

 Hi,

  When using on armv8 architecture, does this mont mul ASM code have
  any optimization with linux-aarch64 configuration?

 There is ambiguity in posed question. In direct context of this
 discussion this as in this mont mul ASM code ought to refer to
 armv4-mont. And then the answer would have to be no, ARMv4 code can not
 used in aarch64 build. But it's also possible to generalize and
 consider this more like this thing, Montgomery multiplication module,
 you are talking about, is there aarch64 equivalent? And answer to such
 question would be yes, there is corresponding armv8-mont module in
 development branch.

 ___
 openssl-dev mailing list
 To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev

___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)

2015-06-16 Thread Andy Polyakov
Hi,

 After the changes to DH requiring longer key lengths, I switched to 2048-bit
 keys, but was finding this was now making my test runs on an embedded ARM9
 target annoyingly slow; so thought I'd investigate to see if there was
 anything to improve.
 
 With some experimentation, it turns out that if I *stop* using the
 crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the time for
 a simplish test to establish and close a simple SSL connection went from 28
 seconds to 18. (It's quite a slow target at any time).
 
 In other words, this optimised version has slowed things down dramatically.
 Has anyone queried the value of the asm of armv4-mont.pl any time in the last
 few years?

Yes, of course. For reference, here are speed rsa2048 dsa2048 results
from Cortex-A8. Numbers are operations per second, so that higher is better.

Without armv4-mont.pl:

  signverifysign/s verify/s
rsa 2048 bits 0.052684s 0.001421s 19.0703.5
dsa 2048 bits 0.014576s 0.017526s 68.6 57.1

With armv4-mont.pl but without NEON (ARM SIMD extension):

rsa 2048 bits 0.039255s 0.001140s 25.5877.3
dsa 2048 bits 0.011630s 0.013900s 86.0 71.9

With armv4-mont.pl and NEON on:

rsa 2048 bits 0.021053s 0.000606s 47.5   1650.2
dsa 2048 bits 0.006084s 0.006985s164.4143.2

Well, RSA/DSA are not DH, but they are very representative when it comes
to sheer BIGNUM performance. And of course Cortex-A8 is not ARM9, but at
least it shows that statement about armv4-mont.pl being bad for
performance does not hold universally true. It's rather contrary, as
similar picture can be observed on most ARM processors (well, all I tested).

 Is it just that compilers have become better (I'm only using gcc
 4.7.3, so not bleeding edge even).

I don't think so. BIGNUM performance can be delicate balance between
multiple factors and it's not impossible to end up on the other side of
breaking point. What breaking point? If you examine performance
improvement with and without Montgomery multiplication module, you'll
notice that there are processors on which improvement coefficient
declines with key length. I mean you'll observe lower improvement longer
key is. This indicates that there ought to be point past which you can
as well observe worse performance, not better. So far such points fell
outside practical key lengths on tested systems, ARM or not. Well,
except for s390x-mont module [which by the way even discusses reasons
for why such breaking point exists, see commentary in
bn/asm/s390x-mont.pl]. In other words I argue that your case is case of
finding yourself on the other side of said breaking point on specific
CPU, not case of armv4-mont.pl being universally inferior. It does come
a little bit unexpected in sense that I wouldn't expect it to hit the
point at 2048-bit key length on any specific ARM processor, but on the
other hard it's not impossible (all it takes is multiplication
instruction stalling pipe-line for long enough to tip the balance).

 Anyway, it's uncertain to me whether armv4-mont.pl should remain.

Assuming that majority of ARM users are not ARM9 users, most would have
to disagree :-) So what does it leave us? One can argue that OpenSSL
could detect the breaking point at run-time and act accordingly, but
it's tricky and is likely to have too narrow use. One can argue that
OpenSSL can be further optimized so that breaking point is moved further
(if not eliminated), which is more practical, because it should improve
performance on all processors, but this is not something that happens
over night. Meanwhile just documenting the case and providing
instructions on how to disengage the module is probably reasonable
compromise. Would you agree? One can make arrangements so that said
instructions would be super-simple...

 FYI, I couldn't discern any difference whether using armv4-gf2m or not, but
 that doesn't mean it's bad.

armv4-gf2m is involved in Elliptic Curve, and of specific kind. Your
problem description doesn't sound like it should affect you. But even if
it did, it's unlike that you'll notice regression, because there are no
breaking points in that case.

___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)

2015-06-16 Thread Andy Polyakov
 What's more, I dug out a Cortex-A9 target (Atmel CycloneV board, operating
 with single core only) and got this without armv4-mont.pl:
   signverifysign/s verify/s
 rsa 2048 bits 0.127342s 0.003628s  7.9275.6
 dsa 2048 bits 0.035971s 0.042778s 27.8 23.4
 
 and this with armv4-mont.pl:
   signverifysign/s verify/s
 rsa 2048 bits 0.172931s 0.005222s  5.8191.5
 dsa 2048 bits 0.052565s 0.061350s 19.0 16.3

For reference, here is what I get on 1GHz Cortex-A9

Without armv4-mont:

rsa 2048 bits 0.041590s 0.001116s 24.0896.0
dsa 2048 bits 0.011574s 0.013831s 86.4 72.3

With armv4-mont, no NEON

rsa 2048 bits 0.033003s 0.000954s 30.3   1048.4
dsa 2048 bits 0.009794s 0.011211s102.1 89.2

NEON (recall that A9 is an odd-ball)

rsa 2048 bits 0.034281s 0.000987s 29.2   1012.8
dsa 2048 bits 0.010163s 0.012027s 98.4 83.1

Here is 600MHz ARM11xx, an ARMv6 processor.

Without armv4-mont:

rsa 2048 bits 0.110889s 0.002923s  9.0342.1
dsa 2048 bits 0.030182s 0.036533s 33.1 27.4

With armv4-mont:

rsa 2048 bits 0.087895s 0.002569s 11.4389.2
dsa 2048 bits 0.026412s 0.031384s 37.9 31.9

___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)

2015-06-16 Thread Andy Polyakov
 With some experimentation, it turns out that if I *stop* using the
 crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the time 
 for
 a simplish test to establish and close a simple SSL connection went from 28
 seconds to 18. (It's quite a slow target at any time).

 In other words, this optimised version has slowed things down 
 dramatically.
 Has anyone queried the value of the asm of armv4-mont.pl any time in the 
 last
 few years?
 Yes, of course. For reference, here are speed rsa2048 dsa2048 results
 from Cortex-A8. Numbers are operations per second, so that higher is better.

 Without armv4-mont.pl:

   signverifysign/s verify/s
 rsa 2048 bits 0.052684s 0.001421s 19.0703.5
 dsa 2048 bits 0.014576s 0.017526s 68.6 57.1

 With armv4-mont.pl but without NEON (ARM SIMD extension):

 rsa 2048 bits 0.039255s 0.001140s 25.5877.3
 dsa 2048 bits 0.011630s 0.013900s 86.0 71.9
 
 
 Wow, I get very different results on my ARM9 target. Without armv4-mont.pl:
   signverifysign/s verify/s
 rsa 2048 bits 2.567500s 0.072826s  0.4 13.7
 dsa 2048 bits 0.722857s 0.865833s  1.4  1.2
 
 With armv4-mont.pl:
   signverifysign/s verify/s
 rsa 2048 bits 3.43s 0.104896s  0.3  9.5
 dsa 2048 bits 1.058000s 1.253750s  0.9  0.8

Can you provide data for speed rsa dsa, which tests variety of length?
As mentioned earlier, we should observe decreasing improvement
coefficient, be it positive or negative...

 What's more, I dug out a Cortex-A9 target (Atmel CycloneV board, operating
 with single core only) and got this without armv4-mont.pl:
   signverifysign/s verify/s
 rsa 2048 bits 0.127342s 0.003628s  7.9275.6
 dsa 2048 bits 0.035971s 0.042778s 27.8 23.4
 
 and this with armv4-mont.pl:
   signverifysign/s verify/s
 rsa 2048 bits 0.172931s 0.005222s  5.8191.5
 dsa 2048 bits 0.052565s 0.061350s 19.0 16.3
 
 As you can see, in both cases using armv4-mont.pl makes it 30% slower. So
 whatever is going on, it isn't down to the CPU. I think there must be
 something else going on. I'll get back to you.

This is odd. Two questions. As far as I understand Cyclone V is FPGA, so
what does Cortex-A9 target mean in the context? Is it actual Cortex-A9
with FPGA beside it, or is it ARM processor loaded to FPGA? I don't
think one can give any performance guarantees in latter case. Two, can
you show /proc/cpuinfo?

On side note. Specifically Cortex-A9 has turned to be an odd-ball. It's
mentioned in commentary section, for some reason NEON doesn't give any
improvement on A9 on longer key lengths, but losses are considered
acceptable because it improves performance on other NEON-capable
processors. Well, this doesn't explain above discrepancies, which is why
it's a side note...

___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)

2015-06-16 Thread Jonathan Larmour
Hi,

Thanks for the reply.

On 16/06/15 13:09, Andy Polyakov wrote:

 With some experimentation, it turns out that if I *stop* using the
 crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the time 
 for
 a simplish test to establish and close a simple SSL connection went from 28
 seconds to 18. (It's quite a slow target at any time).

 In other words, this optimised version has slowed things down dramatically.
 Has anyone queried the value of the asm of armv4-mont.pl any time in the last
 few years?
 
 Yes, of course. For reference, here are speed rsa2048 dsa2048 results
 from Cortex-A8. Numbers are operations per second, so that higher is better.
 
 Without armv4-mont.pl:
 
   signverifysign/s verify/s
 rsa 2048 bits 0.052684s 0.001421s 19.0703.5
 dsa 2048 bits 0.014576s 0.017526s 68.6 57.1
 
 With armv4-mont.pl but without NEON (ARM SIMD extension):
 
 rsa 2048 bits 0.039255s 0.001140s 25.5877.3
 dsa 2048 bits 0.011630s 0.013900s 86.0 71.9


Wow, I get very different results on my ARM9 target. Without armv4-mont.pl:
  signverifysign/s verify/s
rsa 2048 bits 2.567500s 0.072826s  0.4 13.7
dsa 2048 bits 0.722857s 0.865833s  1.4  1.2

With armv4-mont.pl:
  signverifysign/s verify/s
rsa 2048 bits 3.43s 0.104896s  0.3  9.5
dsa 2048 bits 1.058000s 1.253750s  0.9  0.8

What's more, I dug out a Cortex-A9 target (Atmel CycloneV board, operating
with single core only) and got this without armv4-mont.pl:
  signverifysign/s verify/s
rsa 2048 bits 0.127342s 0.003628s  7.9275.6
dsa 2048 bits 0.035971s 0.042778s 27.8 23.4

and this with armv4-mont.pl:
  signverifysign/s verify/s
rsa 2048 bits 0.172931s 0.005222s  5.8191.5
dsa 2048 bits 0.052565s 0.061350s 19.0 16.3

As you can see, in both cases using armv4-mont.pl makes it 30% slower. So
whatever is going on, it isn't down to the CPU. I think there must be
something else going on. I'll get back to you.

Jifl
-- 
eCosCentric Limited  http://www.eCosCentric.com/ The eCos experts
Barnwell House, Barnwell Drive, Cambridge, UK.   Tel: +44 1223 245571
Registered in England and Wales: Reg No 4422071.
--[Si fractum non sit, noli id reficere]--   Opinions==mine
___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] ARM optimised montgomery multiplication (armv4-mont)

2015-06-16 Thread Jonathan Larmour
On 16/06/15 22:12, Andy Polyakov wrote:
 With some experimentation, it turns out that if I *stop* using the
 crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the time 
 for
 a simplish test to establish and close a simple SSL connection went from 28
 seconds to 18. (It's quite a slow target at any time).

 In other words, this optimised version has slowed things down 
 dramatically.
 Has anyone queried the value of the asm of armv4-mont.pl any time in the 
 last
 few years?
[snip]

Hi Andy,

I found the cause - although OPENSSL_BN_ASM_MONT was defined, I hadn't noticed
that a colleague had put a #define OPENSSL_NO_ASM somewhere else (this isn't
linux but a port to our own OS). It turns out that (surprisingly) this
combination changes behaviour rather than barfing - it's even explicitly
catered for in bn_asm.c.

Regardless, the effect is that a different bn_mul_mont implementation gets
used, and the armv4-mont.pl implementation gets ignored entirely.

With that fixed, I now have greatly improved performance as expected. An
unfortunate waste of time for us both, but thanks for the assistance.

Jifl
-- 
--[Si fractum non sit, noli id reficere]--   Opinions==mine
___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


[openssl-dev] ARM optimised montgomery multiplication (armv4-mont)

2015-06-15 Thread Jonathan Larmour
Hi,

After the changes to DH requiring longer key lengths, I switched to 2048-bit
keys, but was finding this was now making my test runs on an embedded ARM9
target annoyingly slow; so thought I'd investigate to see if there was
anything to improve.

With some experimentation, it turns out that if I *stop* using the
crypto/bn/asm/bn/armv4-mont.pl generated asm optimised version, the time for
a simplish test to establish and close a simple SSL connection went from 28
seconds to 18. (It's quite a slow target at any time).

In other words, this optimised version has slowed things down dramatically.
Has anyone queried the value of the asm of armv4-mont.pl any time in the last
few years? Is it just that compilers have become better (I'm only using gcc
4.7.3, so not bleeding edge even).

Anyway, it's uncertain to me whether armv4-mont.pl should remain. Does anyone
care to try it on other ARM cores?

FYI, I couldn't discern any difference whether using armv4-gf2m or not, but
that doesn't mean it's bad.

Jifl
-- 
eCosCentric Limited  http://www.eCosCentric.com/
Barnwell House, Barnwell Drive, Cambridge, UK.   Tel: +44 1223 245571
Registered in England and Wales: Reg No 4422071.
--[Si fractum non sit, noli id reficere]--   Opinions==mine
___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev