Hi Matt Trying out your patch. Will keep you posted. In meanwhile we ran into more valgrind issues .. on the server end. Can you please comment on them?
==621== 8,680 (1,488 direct, 7,192 indirect) bytes in 62 blocks are definitely lost in loss record 899 of 952 ==621== at 0x4A05F80: malloc (vg_replace_malloc.c:296) ==621== by 0x5BFCC86: default_malloc_ex (mem.c:79) ==621== by 0x5BFD315: CRYPTO_malloc (mem.c:308) ==621== by 0x5D2414D: pitem_new (pqueue.c:73) ==621== by 0x5958F74: dtls1_buffer_message (d1_both.c:1233) ==621== by 0x594E3B2: dtls1_send_server_done (d1_srvr.c:1032) ==621== by 0x594D696: dtls1_accept (d1_srvr.c:564) ==621== by 0x595C555: SSL_accept (ssl_lib.c:940) ==621== by 0x59539F7: dtls1_listen (d1_lib.c:491) ==621== by 0x59533BF: dtls1_ctrl (d1_lib.c:267) ==621== by 0x595CAF2: SSL_ctrl (ssl_lib.c:1106) ==621== by 0x416229: server_ssl_event_cb (server.c:3823) ==621== ==621== 67,766 (1,488 direct, 66,278 indirect) bytes in 62 blocks are definitely lost in loss record 933 of 952 ==621== at 0x4A05F80: malloc (vg_replace_malloc.c:296) ==621== by 0x5BFCC86: default_malloc_ex (mem.c:79) ==621== by 0x5BFD315: CRYPTO_malloc (mem.c:308) ==621== by 0x5D2414D: pitem_new (pqueue.c:73) ==621== by 0x5958F74: dtls1_buffer_message (d1_both.c:1233) ==621== by 0x594FAD4: dtls1_send_server_certificate (d1_srvr.c:1612) ==621== by 0x594D367: dtls1_accept (d1_srvr.c:426) ==621== by 0x595C555: SSL_accept (ssl_lib.c:940) ==621== by 0x59539F7: dtls1_listen (d1_lib.c:491) ==621== by 0x59533BF: dtls1_ctrl (d1_lib.c:267) ==621== by 0x595CAF2: SSL_ctrl (ssl_lib.c:1106) ==621== by 0x416229:server_ssl_event_cb (server.c:3823) ==621== ==621== LEAK SUMMARY: ==621== definitely lost: 2,976 bytes in 124 blocks ==621== indirectly lost: 73,470 bytes in 248 blocks ==621== possibly lost: 288 bytes in 1 blocks Thanks -Praveen On Tue, Nov 25, 2014 at 6:28 AM, Matt Caswell via RT <r...@openssl.org> wrote: > On Mon Nov 24 21:52:04 2014, prav...@viptela.com wrote: > > * state = 4384,* > > This is SSL3_ST_CR_SRVR_HELLO_A, i.e. we are trying to read a ServerHello. > This > confirms what we expected. > > > > > So if s->init_num is 0 then frag_len is 0 and frag->fragment gets > > set to > > > NULL. > > What I missed in the above is that there are some OPENSSL_assert calls in > dtls_buffer_message that check init_num, so it cannot be 0. Something else > is > happening. > > > > *Agreed. All good points. Just another data point, is that we ran > > valgrind > > on another node, saw a leak in this related code. See if this helps > > you.* > > > > *==697== HEAP SUMMARY: > > ==697== in use at exit: 1,282,108 bytes in 20,788 blocks > > ==697== total heap usage: 664,349 allocs, 643,561 frees, 105,419,006 > > bytes allocated > > ==697== > > ==697== 120 bytes in 1 blocks are definitely lost in loss record 27 of > > 96 > > ==697== at 0x4A05F80: malloc (vg_replace_malloc.c:296) > > ==697== by 0x5BFBC86: default_malloc_ex (mem.c:79) > > ==697== by 0x5BFC315: CRYPTO_malloc (mem.c:308) > > ==697== by 0x5955875: dtls1_hm_fragment_new (d1_both.c:199) > > ==697== by 0x5956817: dtls1_reassemble_fragment (d1_both.c:625) > > ==697== by 0x595720A: dtls1_get_message_fragment (d1_both.c:852) > > ==697== by 0x5956174: dtls1_get_message (d1_both.c:443) > > ==697== by 0x59504DA: dtls1_get_hello_verify (d1_clnt.c:918) > > ==697== by 0x594F5AB: dtls1_connect (d1_clnt.c:360) > > ==697== by 0x595B591: SSL_connect (ssl_lib.c:949) > > ==697== by 0x430409: ssl_connect_timer_cb (vdaemon_peer.c:303) > > ==697== by 0x48573E: timer_exec_pri (timer.c:612) > > ==697== > > ==697== LEAK SUMMARY: > > ==697== definitely lost: 120 bytes in 1 blocks > > ==697== indirectly lost: 0 bytes in 0 blocks > > ==697== possibly lost: 0 bytes in 0 blocks > > ==697== still reachable: 1,281,988 bytes in 20,787 blocks > > ==697== suppressed: 0 bytes in 0 blocks > > ==697== Reachable blocks (those to which a pointer was found) are not > > shown. > > ==697== To see them, rerun with: --leak-check=full > > --show-leak-kinds=all > > ==697== > > ==697== For counts of detected and suppressed errors, rerun with: -v > > ==697== Use --track-origins=yes to see where uninitialised values come > > from * > > > > *==697== ERROR SUMMARY: 126394 errors from 117 contexts (suppressed: 1 > > from > > 1)* > > > > That's very interesting. I've tracked that down to a problem in > dtls1_clear_queues which is failing to correct free bufferred fragments. > I've > attached a patch. Please let me know if you have any problems with it. > Unfortunately I think this is unconnected to your main problem. > > > > > > > > If I sent you some instrumented code would you be able to apply it > > and see > > > if > > > that helps us narrow down what's going on? > > > > > > > *[viptela.com <http://viptela.com>] * > > > > *Ofcourse. But as I mentioned earlier, we dont know the likelyhood of > > this > > happening again. Please send me any instrumented patch. We will keep > > trying.* > > Ok, thanks. I've attached a second patch which adds a number of > OPENSSL_assert > calls at various points to check that frag->fragment is not null. I'm > hoping it > will help us track down why its not being correctly set. If you get another > crash with this patch applied, then please capture the core and let me know > what you find out. > > Thanks > > Matt > > -- Regards -Praveen