Re: OpenSSL performance woes with ubsec crypto engine (Broadcom BCM5820/BCM5823/BMC5825/BMC582x)
On Thu, Jan 31, 2008, Peter Waltenberg wrote: > OPENSSL_cleanse() doesn't zero memory regions, it fills them with > pseudo-random data. > Edit crypto/mem_clr.c and replace that code with memset(ptr,'\0',len); and > just clear the region - you'll see a significant performance boost if > that's your majorbottleneck. > > Just be aware that some hypothetical compiler could decide to skip the > memset - I can't remember which compiler that is, but it's the one that > comes with the free tinfoil hats . > Note also that there is an assembly language version of OPENSSL_cleanse() in 0.9.9-dev which is significantly faster than the C version. Steve. -- Dr Stephen N. Henson. Email, S/MIME and PGP keys: see homepage OpenSSL project core developer and freelance consultant. Homepage: http://www.drh-consultancy.demon.co.uk __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: OpenSSL performance woes with ubsec crypto engine (Broadcom BCM5820/BCM5823/BMC5825/BMC582x)
OPENSSL_cleanse() doesn't zero memory regions, it fills them with pseudo-random data. Edit crypto/mem_clr.c and replace that code with memset(ptr,'\0',len); and just clear the region - you'll see a significant performance boost if that's your majorbottleneck. Just be aware that some hypothetical compiler could decide to skip the memset - I can't remember which compiler that is, but it's the one that comes with the free tinfoil hats . Peter From: Thor Lancelot Simon <[EMAIL PROTECTED]> To: openssl-dev@openssl.org Date: 31/01/2008 06:19 Subject:Re: OpenSSL performance woes with ubsec crypto engine (Broadcom BCM5820/BCM5823/BMC5825/BMC582x) On Wed, Jan 30, 2008 at 09:32:34PM +0200, Paul Sheer wrote: > Hi, > > I have a BMC5825 card from Silicom that is supposed to do over > 10'000 rsa per second. Never going to happen. The context switches to talk to the accellerator are too expensive, and OpenSSL doesn't support (nor have any way to support) modern accellerators' SSL-handshake nor SSL-record operations. 2000/sec is a good place to be, on a client. Expect less on a server, unfortunately. > I replaced OPENSSL_cleanse() {...} with { memset(); } already > - IT WAS THE TOP FUNCTION IN MY FIRST GPROF RUN! Yes. This is the OPENSSL_cleanse() of a maximum-sized SSL record, right at the outset of any session. It is amazingly expensive, but I have had trouble ascertaining whether it can be safely arranged for it to zero less data. If any of the OpenSSL developers are listening, I would really love some feedback on this. > The card supports hardware SHA1 and MD5 - but it's not used > because OpenSSL divides each md operation into an init(), > update() and final() stage. But the card wants a one shot. > So the crypto card API does not fit the software API Which version of OpenSSL are you using? It appears that in -current an engine can provide an HMAC method, and the Broadcom hardware does directly support HMAC. The old way, where engines saw only raw hash operations (MD5 or SHA) even though SSLv3 or TLS was doing HMAC, was completely insane. I wish there were a way for an engine to provide an SSL-record-encryption or -decryption method. Most modern accellerators do those, too. Thor __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED] __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: OpenSSL performance woes with ubsec crypto engine (Broadcom BCM5820/BCM5823/BMC5825/BMC582x)
no I meant that I am already getting 2000/sec on the *server*. By my calculations I should be able to get 3000/sec on the server with the optimizations I want to do. > 2000/sec is a good place to be, on a client. Expect less on a > server, unfortunately. > > > I replaced OPENSSL_cleanse() {...} with { memset(); } already > > - IT WAS THE TOP FUNCTION IN MY FIRST GPROF RUN! > > Yes. This is the OPENSSL_cleanse() of a maximum-sized SSL record, > right at the outset of any session. It is amazingly expensive, but > I have had trouble ascertaining whether it can be safely arranged > for it to zero less data. It's the algorithm that's expensive. What's wrong with memset? *duck* I memset the full packet length and it drops of the first page of gprof output. I mean how paranoid do you need to be hear? > Which version of OpenSSL are you using? openssl-0.9.8g > It appears that in -current > an engine can provide an HMAC method, oh? ok thanks > and the Broadcom hardware does > directly support HMAC. The old way, where engines saw only raw hash > operations (MD5 or SHA) even though SSLv3 or TLS was doing HMAC, was > completely insane. > > I wish there were a way for an engine to provide an SSL-record-encryption > or -decryption method. Most modern accellerators do those, too. > > Yep -paul
Re: OpenSSL performance woes with ubsec crypto engine (Broadcom BCM5820/BCM5823/BMC5825/BMC582x)
On Wed, Jan 30, 2008 at 09:32:34PM +0200, Paul Sheer wrote: > Hi, > > I have a BMC5825 card from Silicom that is supposed to do over > 10'000 rsa per second. Never going to happen. The context switches to talk to the accellerator are too expensive, and OpenSSL doesn't support (nor have any way to support) modern accellerators' SSL-handshake nor SSL-record operations. 2000/sec is a good place to be, on a client. Expect less on a server, unfortunately. > I replaced OPENSSL_cleanse() {...} with { memset(); } already > - IT WAS THE TOP FUNCTION IN MY FIRST GPROF RUN! Yes. This is the OPENSSL_cleanse() of a maximum-sized SSL record, right at the outset of any session. It is amazingly expensive, but I have had trouble ascertaining whether it can be safely arranged for it to zero less data. If any of the OpenSSL developers are listening, I would really love some feedback on this. > The card supports hardware SHA1 and MD5 - but it's not used > because OpenSSL divides each md operation into an init(), > update() and final() stage. But the card wants a one shot. > So the crypto card API does not fit the software API Which version of OpenSSL are you using? It appears that in -current an engine can provide an HMAC method, and the Broadcom hardware does directly support HMAC. The old way, where engines saw only raw hash operations (MD5 or SHA) even though SSLv3 or TLS was doing HMAC, was completely insane. I wish there were a way for an engine to provide an SSL-record-encryption or -decryption method. Most modern accellerators do those, too. Thor __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
OpenSSL performance woes with ubsec crypto engine (Broadcom BCM5820/BCM5823/BMC5825/BMC582x)
Hi, I have a BMC5825 card from Silicom that is supposed to do over 10'000 rsa per second. In practice Proto Balance can do about 1900 fresh SSL connections per second, on an Intel Core2 Duo 2.2Ghz. But I think more work can vastly improve this. (Without the card I get about 700 per second - thus the card improves the performance by about 270%) I compiled with -O1 -g -pg and the gprof output is below. I replaced OPENSSL_cleanse() {...} with { memset(); } already - IT WAS THE TOP FUNCTION IN MY FIRST GPROF RUN! My test does not use sessions. It downloads a minimal web page, "", with 200 clients concurrently. The malloc at the top is surprisingly expensive: it is called mostly from EVP_DigestInit_ex(). Refactoring to eliminate this malloc would be worthwhile I think. The card supports hardware SHA1 and MD5 - but it's not used because OpenSSL divides each md operation into an init(), update() and final() stage. But the card wants a one shot. So the crypto card API does not fit the software API :-( OpenSSL *really* needs to be fixed to properly support hardware md's I see Silicom's BMC586x/BMC5861/BMC5862 OpenSSL patch plugs in code everywhere to directly call their card's SSL signing function - a sorry solution indeed. By eliminating the top 6 functions listed below, another 30% cpu can be saved at least. Kinds regards -paul --=-- Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds secondscalls s/call s/call name 15.22 0.28 0.28 861099 0.00 0.00 malloc << !! 8.15 0.43 0.15 md5_block_asm_host_order 6.52 0.55 0.12 461101 0.00 0.00 sha1_block_host_order 4.89 0.64 0.09 234451 0.00 0.00 sha1_block_data_order 3.80 0.71 0.07 340275 0.00 0.00 sslconnection_thread_bas 2.72 0.76 0.05 725772 0.00 0.00 SHA1_Update 2.72 0.81 0.05 673096 0.00 0.00 asn1_i2d_ex_primitive 2.17 0.85 0.041 0.04 1.51 _thread_os_thread 2.17 0.89 0.04 RC4 1.63 0.92 0.03 818060 0.00 0.00 asn1_ex_i2c 1.63 0.95 0.03 438186 0.00 0.00 HMAC_Init_ex 1.63 0.98 0.03 355077 0.00 0.00 SHA1_Final 1.63 1.01 0.0382992 0.00 0.00 ssl3_read_bytes 1.09 1.03 0.02 1361354 0.00 0.00 EVP_MD_CTX_cleanup
RE: Static global - bug? (Re: Two valgrind warnings in OpenSSL -possible bug???)
> > 3) You cannot link to the pthreads library and still use fork, and > David, you absolutely cannot link with pthreads and still use fork() > It doesn't work except in a few very simplistic scenarios. > -paul What you are saying just doesn't make any sense. I agree that it is difficult to use fork properly in a process that creates multiple threads. But this has nothing whatsoever to do with linking with the pthreads library nor with compiling your code multi-threaded. You can write code that has multiple internal models, say one that uses 'fork' the way unthreaded processes normally do and one that uses multiple-threads, and compile it using your platform's options for compiling a multi-threaded process. You can then select at run time whether to create threads or to call 'fork'. On no platform that claims POSIX compliance will you have any problems at all. The problem you are complaining about simply does not exist. Yes, it's difficult to use 'fork' in a process that actually creates multiple threads. But whether or not a process *creates* multiple threads is not a compilation issue, it's a run time issue. So it can't possibly cause you to need to compile two versions. I defy you to show me any platform where 'fork' breaks just because you specify multi-threaded compiler options or link to the threading library. You can most certainly do: int multi_threaded; if(multi_threaded) { // some code that calls pthread_create } else { // some code that calls fork } And compile it multi-threaded and link it to the pthreads library and you will have *NO* issues with 'fork'. The issues you are talking about with 'fork' are all run-time issues. None of them require compiling two copies of a library. DS __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: memory corruption after usin BN_mod_inverse
Hi, Yair Elharrar! For me it looks bad. :-/ Because, BN_sub doesn't handle this situation (r = b): 1) BN_sub call BN_uadd(r,a,b), but r = b, then 2) BN_sub change r->neg, but r = b, then 3) BN_sub call BN_expand(r), then 4) BN_sub call BN_ucmp(a,b), but b here is not that b that was at the begin of BN_sub, then 5) BN_sub call BN_usub(r,a,b) or BN_usub(r,b,a), but ... May be I've used wrong words, but my thought was that calling BN_sub(Y,n,Y) from BN_mod_inverse leads to unpredictable behavior. And this is not subject of standard of C rather using it. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
RE: memory corruption after usin BN_mod_inverse
Hi Eugene, ISO/IEC 9899 doesn't discuss this directly, but says in section 6.7.5.1: "...const int *ptr_to_constant; int *const constant_ptr; The contents of any object pointed to by ptr_to_constant shall not be modified through that pointer..." in BN_sub, "b" is a const BIGNUM *, hence the content referenced by it may not be modified _through b_. The content (*b) cannot be placed in read-only storage as it is referenced, not created, by this declaration. This implies that it's OK to modify it _through r_. If you were to create a const BIGNUM Z, then attempt to BN_sub(&Z, n, &Z) then you would be violating constness by passing Z as the first (non-const) argument. As it stands, however, the code looks fine to me. -Yair -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Wednesday, January 30, 2008 5:16 PM To: openssl-dev@openssl.org Subject: Re: memory corruption after usin BN_mod_inverse Hi, Yair Elharrar! > Sorry, I don't think that breaks any const rules. > See explanation and example in ISO/IEC 14882 section 7.1.5.1. First of all, OpenSSL was written in C, so ISO/IEC 14882 is not a subject to reffer to (it is the C++ standard). Let's see in ISO/IEC 9899 section 6.7.3: "The implementation may place a const object that is not volatile in a read-only region of storage." That's enough. Then, if you look in BN_sub you'll easy understand that behavior will be undefined if r and b point to the same object. -- Eugene. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED] This email and any files transmitted with it are confidential material. They are intended solely for the use of the designated individual or entity to whom they are addressed. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this email in error please immediately notify the sender and delete or destroy any copy of this message __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: memory corruption after usin BN_mod_inverse
Hi, Yair Elharrar! > Sorry, I don't think that breaks any const rules. > See explanation and example in ISO/IEC 14882 section 7.1.5.1. First of all, OpenSSL was written in C, so ISO/IEC 14882 is not a subject to reffer to (it is the C++ standard). Let's see in ISO/IEC 9899 section 6.7.3: "The implementation may place a const object that is not volatile in a read-only region of storage." That's enough. Then, if you look in BN_sub you'll easy understand that behavior will be undefined if r and b point to the same object. -- Eugene. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
RE: memory corruption after usin BN_mod_inverse
Sorry, I don't think that breaks any const rules. See explanation and example in ISO/IEC 14882 section 7.1.5.1. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Wednesday, January 30, 2008 3:59 PM To: openssl-dev@openssl.org Subject: memory corruption after usin BN_mod_inverse Hello! During the OpenSSL source investigation I found some strange call in function BN_mod_inverse: ... if (sign < 0) { if (!BN_sub(Y,n,Y)) goto err; } ... But! Declaration of BN_sub looks like this: int BN_sub(BIGNUM *r, const BIGNUM *a, const BIGNUM *b) In some circumstances r will be expanded in BN_sub, so original call "BN_sub(Y,n,Y)" breaks the rule of const. -- Eugene. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED] This email and any files transmitted with it are confidential material. They are intended solely for the use of the designated individual or entity to whom they are addressed. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, use, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this email in error please immediately notify the sender and delete or destroy any copy of this message __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
memory corruption after usin BN_mod_inverse
Hello! During the OpenSSL source investigation I found some strange call in function BN_mod_inverse: ... if (sign < 0) { if (!BN_sub(Y,n,Y)) goto err; } ... But! Declaration of BN_sub looks like this: int BN_sub(BIGNUM *r, const BIGNUM *a, const BIGNUM *b) In some circumstances r will be expanded in BN_sub, so original call "BN_sub(Y,n,Y)" breaks the rule of const. -- Eugene. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
[openssl.org #1637] Memory leak in SSL_set_tlsext_host_name
Memory allocated in SSL_set_tlsext_host_name() isn't freed in SSL_free(). As a workaround one can do SSL_set_tlsext_host_name(ssl, NULL) before SSL_free(), but I don't imagine this was what was meant to be implemented. The bug is easy to replicate using the code below, using valgrind or your favorite memory profiler. This was not a problem back in 0.9.8b, but 0.9.8f and 0.9.8g have this problem. #include int main(int argc, char **argv) { SSL *ssl; SSL_CTX *ctx; SSL_library_init(); ctx = SSL_CTX_new(SSLv23_client_method()); ssl = SSL_new(ctx); SSL_set_tlsext_host_name(ssl, "hostname"); SSL_free(ssl); SSL_CTX_free(ctx); CRYPTO_cleanup_all_ex_data(); return 0; } __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: Static global - bug? (Re: Two valgrind warnings in OpenSSL -possible bug???)
> So you had a bug in your code. So what? No bug - read this: http://www.unix.org/version2/whatsnew/threadspaper.ps : Registration of fork handlers (pthread_atfork( )). The fork handlers are routines that are to be executed in association with calls to the fork( ) function. There are three classes of fork handlers: prepare, parent, and child. Prepare fork handlers are executed prior to fork() processing, in the context of the calling thread. Parent fork handlers are executed upon completion of fork() processing in the parent, again in the context of the calling thread. Child fork handlers are executed upon completion of fork() processing in the child, in the context of the single thread initially existing in the child process. Fork handlers are envisioned as a mechanism for dealing with the problem of orphaned mutexes that can occur when a multi-threaded process calls fork(). The problem arises when threads other than the calling thread own mutexes at the time of the call to fork( ). Since the non-calling threads are not replicated in the child process, the child process is created with mutexes locked by non-existent threads. These mutexes can therefore never be unlocked. Fork handlers are intended to resolve the problem of orphaned mutexes in the following way. Prepare fork handlers can be written to lock all mutexes. In this way, orphaned mutexes are avoided, and the resources protected by the mutexes are not left in inconsistent states. This is due to the fact that the calling thread itself, which is replicated in the child process, has locked all mutexes. Thus, both the parent and child processes have all mutexes locked upon completion of fork() processing, at which time the parent and child fork handlers execute. The parent and child fork handlers unlock mutexes locked by the prepare fork handler. Fork handlers are especially useful in enabling independently-developed libraries and application programs to protect themselves from one another. A multi-threaded library can protect itself from application programs that issue fork( ) operations, possibly without even knowing that the library is multi-threaded, by providing fork handlers. Similarly, an application program can protect itself from fork( ) operations issued by library functions > 3) You cannot link to the pthreads library and still use fork, and David, you absolutely cannot link with pthreads and still use fork() It doesn't work except in a few very simplistic scenarios. -paul