Re: [openssl-dev] Usage of assembler code on ARM architectures
My mistake, it looks like my memory was wrong on two accounts. First, it was AES, not SHA, where I observed the no-asm was faster. Second, it was on the PowerPC cross-compiled target, not ARM. The results from openssl speed aes-128-cbc are: type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes w/o no-asm 31010.47k32988.82k33549.41k33693.05k 33825.67k no-asm 42431.46k46485.14k47479.20k47874.86k 47829.36k This is using a Freescale 8548. This is no mystery at all, and kind of intentional. If you examine commentary in aes-ppc.pl you'll notice that that it relies on compact subroutines, those that are using 256-byte S-boxes, which require more computations. It mentions that compact encrypt is ~2 times slower than traditional encrypt. On the other side of scales is insecurity of traditional subroutine which is susceptible to cache-timing attacks. Well, it's not like compact is not susceptible, but it's *much* more resistant. Indeed, vulnerability is quantified by probability of a cache line not being accessed as result of block operation, and in compact case is as low as (1-32/256)^160=5e-10 vs. (1-4/256)^160=0.08 for processor in question. Note that C version is even worse than non-compact assembly subroutine. You might argue that there is no room for adversary in *your* application and performance should be favoured. By no room I mean that it's probably locked down embedded system and adversary having ability to execute own code is considered big enough problem. Yes, but you have to *argue* in favour. Maybe it should be a compile option... ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] Usage of assembler code on ARM architectures
Hi, Thanks for the answers to my questions - here come some more. Apple assembler uses a little bit different syntax and you can't assemble current modules as they are. ... as I found out myself just after asking the original question, but of course, the following is good to know: There is perlasm/arm-xlate.pl that enables assembly for 64-bit iOS, and it's being modified to cover even 32-bit iOS. Is that something that can/will be backported to 1.0.2- (or even 1.0.1-) branch, once it's working? More specifically. Android has two distinct ARM targets, in sense that if you build JNI-enabled application, then you'd have to provide two ARM shared libraries, right? Here, you lost me. So far, I'm building only one shared library for ARM, using the no_asm variant of OpenSSL. And so far, there weren't complaints about unsupported devices, so what do you mean by two distinct ARM targets? Regards, Stefan ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] Usage of assembler code on ARM architectures
Hi, There is perlasm/arm-xlate.pl that enables assembly for 64-bit iOS, and it's being modified to cover even 32-bit iOS. Is that something that can/will be backported to 1.0.2- (or even 1.0.1-) branch, once it's working? Well, it would have to be *your* responsibility, because 1.0.2, as well as 1.0.1, are closed for new features. More specifically. Android has two distinct ARM targets, in sense that if you build JNI-enabled application, then you'd have to provide two ARM shared libraries, right? Here, you lost me. So far, I'm building only one shared library for ARM, using the no_asm variant of OpenSSL. And so far, there weren't complaints about unsupported devices, so what do you mean by two distinct ARM targets? On Android you can build kind of fat apps, when same .apk contains JNI shared object modules targeting different hardware architectures, right? For example ARM, x86, MIPS. As far as I understand contemporary Android ARM platforms come in two flavours: armhf/armv7-a and traditional armeabi. This means that along with say x86 module there is room for *pair* of ARM shared libraries targeting these two ABIs. Google's idea is naturally to provide better performance on former. For OpenSSL performance choice of ABI doesn't really matter (because we don't do much floating point), but it can be part of application that otherwise uses a lot of floating point and therefore is sensitive to ABI choice. This is how pair of shared libraries comes into picture. Does it mean that we better have two config lines reflecting this? That's where we need your support. To help us formulate what is sensible, what are expectations and that it would actually benefit applications. ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] Usage of assembler code on ARM architectures
My mistake, it looks like my memory was wrong on two accounts. First, it was AES, not SHA, where I observed the no-asm was faster. Second, it was on the PowerPC cross-compiled target, not ARM. The results from openssl speed aes-128-cbc are: type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes w/o no-asm 31010.47k32988.82k33549.41k33693.05k 33825.67k no-asm 42431.46k46485.14k47479.20k47874.86k 47829.36k This is using a Freescale 8548. On 03/12/2015 03:37 PM, Andy Polyakov wrote: I can't speak directly to your question on the iphone-cross target, but can warn you that your mileage will vary when using the ARM assembly modules. We observed that some algorithms actually run slower when using the ARM assembly modules. It's been a couple of years and I don't recall the details, but want to say some of the hash algorithms were actually faster when using no-asm. Well, I can imagine compiler succeeding to generate code better than sha1-armv4-large, but I can't imagine compiler beating sha256 or sha512. Was it really some of algorithm*s* or just one? Anyway, why sha1-amrv4-large? Two reasons: a) inner loops are not unrolled; b) over-reliance on merged rotate-n-arithmetic. Over-reliance means that it uses more such instructions than actually necessary, which can negatively affect performance. I realized this after having hard time getting sha256/512 to work well on Cortex-A53 (see sha512-armv8.pl, it's 64-bit module, but principle of merged rotate-n-arithmetic is same). It should also be noted that now there are additional code paths in sha1-armv4-large, namely NEON and ARMv8. The results are likely to vary depending on the actual chipset used. Right, ARM universe is very diverse. Assembly modules, i.e. all, not only ARM, are maintained to provide near-optimal performance across range of platforms, but sometimes optimizations conflict. In either case prerequisite is access to wide range of platforms and feedback. In order words, bring it up. You'll probably want to test the performance on the target hardware using the openssl speed command. You can do this on a jailbroken iOS device via SSH. For the record. I do development on non-jailbroken unit, so that it's not hard requirement. ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
[openssl-dev] Usage of assembler code on ARM architectures
Hi, While looking at the Configure script, I found that there is the armv4_asm variable, which seems to promise a speedup for ARM architectures (and the 4 in ARMv4 sounds like it should work everywhere?). However, further looking at that Configure file, I see it's only used for linux-armv4 and android-armv7, but not for e.g. iphoneos-cross. Does that imply you know/suspect it doesn't work anyway? Or does it imply there is no measurable speedup? Or does it just imply you never bothered to actually test it? And in the last case, would you expect it's going to work (or almost) or would you rather expect it's going to be lots of trouble? Similar question for Android: You only use the assembler code for the android-armv7 configuration. For maximum compatibility, I'm usually compiling with -march=armv5te, which still sounds like using armv4 assembler should be safe, but for some reason, you're restricting its use to the android-armv7 configuration which explicitly sets -march=armv7-a. Why? Regards, Stefan ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] Usage of assembler code on ARM architectures
I can't speak directly to your question on the iphone-cross target, but can warn you that your mileage will vary when using the ARM assembly modules. We observed that some algorithms actually run slower when using the ARM assembly modules. It's been a couple of years and I don't recall the details, but want to say some of the hash algorithms were actually faster when using no-asm. Well, I can imagine compiler succeeding to generate code better than sha1-armv4-large, but I can't imagine compiler beating sha256 or sha512. Was it really some of algorithm*s* or just one? Anyway, why sha1-amrv4-large? Two reasons: a) inner loops are not unrolled; b) over-reliance on merged rotate-n-arithmetic. Over-reliance means that it uses more such instructions than actually necessary, which can negatively affect performance. I realized this after having hard time getting sha256/512 to work well on Cortex-A53 (see sha512-armv8.pl, it's 64-bit module, but principle of merged rotate-n-arithmetic is same). It should also be noted that now there are additional code paths in sha1-armv4-large, namely NEON and ARMv8. The results are likely to vary depending on the actual chipset used. Right, ARM universe is very diverse. Assembly modules, i.e. all, not only ARM, are maintained to provide near-optimal performance across range of platforms, but sometimes optimizations conflict. In either case prerequisite is access to wide range of platforms and feedback. In order words, bring it up. You'll probably want to test the performance on the target hardware using the openssl speed command. You can do this on a jailbroken iOS device via SSH. For the record. I do development on non-jailbroken unit, so that it's not hard requirement. ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] Usage of assembler code on ARM architectures
Hi, While looking at the Configure script, I found that there is the armv4_asm variable, which seems to promise a speedup for ARM architectures (and the 4 in ARMv4 sounds like it should work everywhere?). Yes. v4 denotes only *minimal* requirement. There is conditionally compiled code that targets v7 and even v8. However, further looking at that Configure file, I see it's only used for linux-armv4 and android-armv7, but not for e.g. iphoneos-cross. Does that imply you know/suspect it doesn't work anyway? Apple assembler uses a little bit different syntax and you can't assemble current modules as they are. There is perlasm/arm-xlate.pl that enables assembly for 64-bit iOS, and it's being modified to cover even 32-bit iOS. Or does it imply there is no measurable speedup? You'll observe as much speedup on iOS as on Linux/Android. Or does it just imply you never bothered to actually test it? And in the last case, would you expect it's going to work (or almost) or would you rather expect it's going to be lots of trouble? See above. Similar question for Android: You only use the assembler code for the android-armv7 configuration. For maximum compatibility, I'm usually compiling with -march=armv5te, which still sounds like using armv4 assembler should be safe, but for some reason, you're restricting its use to the android-armv7 configuration which explicitly sets -march=armv7-a. Why? Because that target was conceived to solve very specific problem, one can say too specific. In other words, yes, it's appropriate to extend support and introduce additional or unified target in linux-armv4 style. What would be more appropriate? I mean additional or unified? More specifically. Android has two distinct ARM targets, in sense that if you build JNI-enabled application, then you'd have to provide two ARM shared libraries, right? Question is is if both can be build with unified config (see linux-armv4 for example) or does it have to be two config lines? ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev