Re: [openssl-dev] Usage of assembler code on ARM architectures

2015-03-17 Thread Andy Polyakov
 My mistake, it looks like my memory was wrong on two accounts.  First,
 it was AES, not SHA, where I observed the no-asm was faster.  Second, it
 was on the PowerPC cross-compiled target, not ARM.  The results from
 openssl speed aes-128-cbc are:
 
 type 16 bytes 64 bytes256 bytes   1024 bytes   8192
 bytes
 w/o no-asm   31010.47k32988.82k33549.41k33693.05k   
 33825.67k
 no-asm   42431.46k46485.14k47479.20k47874.86k   
 47829.36k
 
 This is using a Freescale 8548.

This is no mystery at all, and kind of intentional. If you examine
commentary in aes-ppc.pl you'll notice that that it relies on compact
subroutines, those that are using 256-byte S-boxes, which require more
computations. It mentions that compact encrypt is ~2 times slower than
traditional encrypt. On the other side of scales is insecurity of
traditional subroutine which is susceptible to cache-timing attacks.
Well, it's not like compact is not susceptible, but it's *much* more
resistant. Indeed, vulnerability is quantified by probability of a cache
line not being accessed as result of block operation, and in compact
case is as low as (1-32/256)^160=5e-10 vs. (1-4/256)^160=0.08 for
processor in question. Note that C version is even worse than
non-compact assembly subroutine.

You might argue that there is no room for adversary in *your*
application and performance should be favoured. By no room I mean that
it's probably locked down embedded system and adversary having ability
to execute own code is considered big enough problem. Yes, but you have
to *argue* in favour. Maybe it should be a compile option...

___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] Usage of assembler code on ARM architectures

2015-03-17 Thread stefan.n...@t-online.de
Hi,

Thanks for the answers to my questions - here come some more.

 Apple assembler uses a little bit different syntax and you can't
 assemble current modules as they are.

... as I found out myself just after asking the original question, but
of course, the following is good to know:

 There is perlasm/arm-xlate.pl that enables assembly for 64-bit
 iOS, and it's being modified to cover even 32-bit iOS.

Is that something that can/will be backported to 1.0.2- (or even 1.0.1-)
branch, once it's working?

 More specifically. Android has two distinct ARM targets, in sense that if
 you build JNI-enabled application, then you'd have to provide two ARM
 shared libraries, right?

Here, you lost me. So far, I'm building only one shared library for ARM,
using the no_asm variant of OpenSSL. And so far, there weren't complaints
about unsupported devices, so what do you mean by two distinct ARM
targets?

  Regards,
Stefan


___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] Usage of assembler code on ARM architectures

2015-03-17 Thread Andy Polyakov
Hi,

 There is perlasm/arm-xlate.pl that enables assembly for 64-bit
 iOS, and it's being modified to cover even 32-bit iOS.
 
 Is that something that can/will be backported to 1.0.2- (or even 1.0.1-)
 branch, once it's working?

Well, it would have to be *your* responsibility, because 1.0.2, as well
as 1.0.1, are closed for new features.

 More specifically. Android has two distinct ARM targets, in sense that if
 you build JNI-enabled application, then you'd have to provide two ARM
 shared libraries, right?
 
 Here, you lost me. So far, I'm building only one shared library for ARM,
 using the no_asm variant of OpenSSL. And so far, there weren't complaints
 about unsupported devices, so what do you mean by two distinct ARM
 targets?

On Android you can build kind of fat apps, when same .apk contains JNI
shared object modules targeting different hardware architectures, right?
For example ARM, x86, MIPS. As far as I understand contemporary Android
ARM platforms come in two flavours: armhf/armv7-a and traditional
armeabi. This means that along with say x86 module there is room for
*pair* of ARM shared libraries targeting these two ABIs. Google's idea
is naturally to provide better performance on former. For OpenSSL
performance choice of ABI doesn't really matter (because we don't do
much floating point), but it can be part of application that otherwise
uses a lot of floating point and therefore is sensitive to ABI choice.
This is how pair of shared libraries comes into picture. Does it mean
that we better have two config lines reflecting this? That's where we
need your support. To help us formulate what is sensible, what are
expectations and that it would actually benefit applications.

___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] Usage of assembler code on ARM architectures

2015-03-16 Thread John Foley
My mistake, it looks like my memory was wrong on two accounts.  First,
it was AES, not SHA, where I observed the no-asm was faster.  Second, it
was on the PowerPC cross-compiled target, not ARM.  The results from
openssl speed aes-128-cbc are:

type 16 bytes 64 bytes256 bytes   1024 bytes   8192
bytes
w/o no-asm   31010.47k32988.82k33549.41k33693.05k   
33825.67k
no-asm   42431.46k46485.14k47479.20k47874.86k   
47829.36k

This is using a Freescale 8548.


On 03/12/2015 03:37 PM, Andy Polyakov wrote:
 I can't speak directly to your question on the iphone-cross target, but
 can warn you that your mileage will vary when using the ARM assembly
 modules.  We observed that some algorithms actually run slower when
 using the ARM assembly modules.  It's been a couple of years and I don't
 recall the details, but want to say some of the hash algorithms were
 actually faster when using no-asm.
 Well, I can imagine compiler succeeding to generate code better than
 sha1-armv4-large, but I can't imagine compiler beating sha256 or sha512.
 Was it really some of algorithm*s* or just one? Anyway, why
 sha1-amrv4-large? Two reasons: a) inner loops are not unrolled; b)
 over-reliance on merged rotate-n-arithmetic. Over-reliance means that
 it uses more such instructions than actually necessary, which can
 negatively affect performance. I realized this after having hard time
 getting sha256/512 to work well on Cortex-A53 (see sha512-armv8.pl, it's
 64-bit module, but principle of merged rotate-n-arithmetic is same). It
 should also be noted that now there are additional code paths in
 sha1-armv4-large, namely NEON and ARMv8.

 The results are likely to vary
 depending on the actual chipset used.
 Right, ARM universe is very diverse. Assembly modules, i.e. all, not
 only ARM, are maintained to provide near-optimal performance across
 range of platforms, but sometimes optimizations conflict. In either case
 prerequisite is access to wide range of platforms and feedback. In order
 words, bring it up.

 You'll probably want to test the
 performance on the target hardware using the openssl speed command. 
 You can do this on a jailbroken iOS device via SSH.
 For the record. I do development on non-jailbroken unit, so that it's
 not hard requirement.

 ___
 openssl-dev mailing list
 To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev



___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


[openssl-dev] Usage of assembler code on ARM architectures

2015-03-12 Thread stefan.n...@t-online.de
   Hi,

While looking at the Configure script, I found that there is the armv4_asm 
variable, which seems to promise a speedup for ARM architectures (and the 4 
in ARMv4 sounds like it should work everywhere?).
However, further looking at that Configure file, I see it's only used for 
linux-armv4 and android-armv7, but not for e.g. iphoneos-cross. 
Does that imply you know/suspect it doesn't work anyway? Or does it imply there 
is no measurable speedup? Or does it just imply you never bothered to actually 
test it? And in the last case, would you expect it's going to work (or 
almost) or would you rather expect it's going to be lots of trouble?
Similar question for Android: You only use the assembler code for the 
android-armv7 configuration. For maximum compatibility, I'm usually compiling 
with -march=armv5te, which still sounds like using armv4 assembler should 
be safe, but for some reason, you're restricting its use to the android-armv7 
configuration which explicitly sets -march=armv7-a. Why?

  Regards,
   Stefan


___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] Usage of assembler code on ARM architectures

2015-03-12 Thread Andy Polyakov
 I can't speak directly to your question on the iphone-cross target, but
 can warn you that your mileage will vary when using the ARM assembly
 modules.  We observed that some algorithms actually run slower when
 using the ARM assembly modules.  It's been a couple of years and I don't
 recall the details, but want to say some of the hash algorithms were
 actually faster when using no-asm.

Well, I can imagine compiler succeeding to generate code better than
sha1-armv4-large, but I can't imagine compiler beating sha256 or sha512.
Was it really some of algorithm*s* or just one? Anyway, why
sha1-amrv4-large? Two reasons: a) inner loops are not unrolled; b)
over-reliance on merged rotate-n-arithmetic. Over-reliance means that
it uses more such instructions than actually necessary, which can
negatively affect performance. I realized this after having hard time
getting sha256/512 to work well on Cortex-A53 (see sha512-armv8.pl, it's
64-bit module, but principle of merged rotate-n-arithmetic is same). It
should also be noted that now there are additional code paths in
sha1-armv4-large, namely NEON and ARMv8.

 The results are likely to vary
 depending on the actual chipset used.

Right, ARM universe is very diverse. Assembly modules, i.e. all, not
only ARM, are maintained to provide near-optimal performance across
range of platforms, but sometimes optimizations conflict. In either case
prerequisite is access to wide range of platforms and feedback. In order
words, bring it up.

 You'll probably want to test the
 performance on the target hardware using the openssl speed command. 
 You can do this on a jailbroken iOS device via SSH.

For the record. I do development on non-jailbroken unit, so that it's
not hard requirement.

___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] Usage of assembler code on ARM architectures

2015-03-12 Thread Andy Polyakov
Hi,

 While looking at the Configure script, I found that there is the
 armv4_asm variable, which seems to promise a speedup for ARM
 architectures (and the 4 in ARMv4 sounds like it should work
 everywhere?).

Yes. v4 denotes only *minimal* requirement. There is conditionally
compiled code that targets v7 and even v8.

 However, further looking at that Configure file, I
 see it's only used for linux-armv4 and android-armv7, but not for
 e.g. iphoneos-cross. Does that imply you know/suspect it doesn't work
 anyway?

Apple assembler uses a little bit different syntax and you can't
assemble current modules as they are. There is perlasm/arm-xlate.pl that
enables assembly for 64-bit iOS, and it's being modified to cover even
32-bit iOS.

 Or does it imply there is no measurable speedup?

You'll observe as much speedup on iOS as on Linux/Android.

 Or does it
 just imply you never bothered to actually test it? And in the last
 case, would you expect it's going to work (or almost) or would you
 rather expect it's going to be lots of trouble?

See above.

 Similar question for
 Android: You only use the assembler code for the android-armv7
 configuration. For maximum compatibility, I'm usually compiling with
 -march=armv5te, which still sounds like using armv4 assembler
 should be safe, but for some reason, you're restricting its use to
 the android-armv7 configuration which explicitly sets
 -march=armv7-a. Why?

Because that target was conceived to solve very specific problem, one
can say too specific. In other words, yes, it's appropriate to extend
support and introduce additional or unified target in linux-armv4 style.
What would be more appropriate? I mean additional or unified? More
specifically. Android has two distinct ARM targets, in sense that if you
build JNI-enabled application, then you'd have to provide two ARM shared
libraries, right? Question is is if both can be build with unified
config (see linux-armv4 for example) or does it have to be two config lines?

___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev