Update - I have isolated the intermittent issue down to the interchangeable functions s_mp_exptmod_fast() and s_mp_exptmod() - by default s_mp_exptmod_fast() is compiled instead of s_mp_exptmod() [BN_MP_EXPTMOD_FAST_C] but both functions intermittently fail and I decided to use s_mp_exptmod() as my focus because it's slightly simpler.
s_mp_exptmod() is called indirectly by rsa.c::buf_put_rsa_sign()'s call to mp_exptmod(). For the intermittent failing case if I call mp_exptmod() / s_mp_exptmod() immediately again with the same source mp_int structures it yields the correct data. Example - debug code bolded: DEF_MP_INT(rsa_s_backup); DEF_MP_INT(rsa_s_backup_2); mp_copy (&rsa_s, &rsa_s_backup); mp_copy (&rsa_s, &rsa_s_backup_2); if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s) != MP_OKAY) { dropbear_exit("RSA error"); } if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s_backup) != MP_OKAY) { dropbear_exit("RSA error"); } if (mp_exptmod(&rsa_tmp1, key->d, key->n, &rsa_s_backup_2) != MP_OKAY) { dropbear_exit("RSA error"); } printf("after mp_exptmod\n"); dump_mp_int("rsa_s", &rsa_s); dump_mp_int("rsa_s_backup", &rsa_s_backup); dump_mp_int("rsa_s_backup_2", &rsa_s_backup_2); comp_mp_int("rsa_s", "rsa_s_backup", &rsa_s, &rsa_s_backup); comp_mp_int("rsa_s_backup", "rsa_s_backup_2", &rsa_s_backup, &rsa_s_backup_2); mp_clear(&rsa_s_backup); mp_clear(&rsa_s_backup_2); Sample output from a failure, which contains the first portion of each mp_int->dp. Bolded text has wrong data: after mp_exptmod rsa_s [0xbef6c358]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 30 e1 8f 00 J...........0... rsa_s->dp [0x008fe130]: 0000 05 fb c0 0f 68 91 ff 0a 9f 05 57 0b 35 a2 bd 05 ....h.....W.5... 0010 57 ec a0 0b 34 3c b1 0f fa 8b b5 08 ed aa 9c 04 W...4<.......... 0020 7e 88 bb 04 12 42 51 05 9a 6d 7d 0a 98 ef 12 0c ~....BQ..m}..... 0030 76 e0 f4 0f ea 89 d7 0c 87 b0 76 03 12 a1 2d 0e v.........v...-. 0040 d7 3c df 06 0f 54 92 04 23 90 .<...T..#. rsa_s_backup [0xbef6c398]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 00 d8 8f 00 J............... rsa_s_backup->dp [0x008fd800]: 0000 ec 9f a0 01 d4 8e e8 07 c3 ae df 0b 45 61 e6 06 ............Ea.. 0010 a1 99 59 03 d7 49 24 02 50 a6 ac 0a de a2 5c 0d ..Y..I$.P.....\. 0020 cb b7 3c 05 33 cb da 08 28 10 f2 04 14 69 d6 07 ..<.3...(....i.. 0030 8c 8e a5 04 f5 fc 92 0c ba 88 d9 04 71 b4 b2 08 ............q... 0040 bc 4f c7 0d de 73 f9 06 0d bf .O...s.... rsa_s_backup_2 [0xbef6c3a8]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 e0 d1 8f 00 J............... rsa_s_backup_2->dp [0x008fd1e0]: 0000 ec 9f a0 01 d4 8e e8 07 c3 ae df 0b 45 61 e6 06 ............Ea.. 0010 a1 99 59 03 d7 49 24 02 50 a6 ac 0a de a2 5c 0d ..Y..I$.P.....\. 0020 cb b7 3c 05 33 cb da 08 28 10 f2 04 14 69 d6 07 ..<.3...(....i.. 0030 8c 8e a5 04 f5 fc 92 0c ba 88 d9 04 71 b4 b2 08 ............q... 0040 bc 4f c7 0d de 73 f9 06 0d bf .O...s.... rsa_s and rsa_s_backup differ Sometimes it's the second or third call that yields the incorrect data. In this instance it was the second call: after mp_exptmod rsa_s [0xbe9a6358]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 30 c1 40 02 J...........0.@. rsa_s->dp [0x0240c130]: 0000 25 b9 db 00 ec 62 00 0d 80 2d b0 0d 00 13 d3 06 %....b...-...... 0010 3f ec 8b 0a af 5d e9 03 2d f4 4b 0c 6c 3c 72 08 ?....]..-.K.l<r. 0020 5d 52 6a 08 21 4c dd 01 a2 59 1a 03 33 16 97 0f ]Rj.!L...Y..3... 0030 c7 69 c2 08 0b 61 d6 03 b9 86 fc 01 27 15 c8 0c .i...a......'... 0040 dd 03 b1 04 78 c7 9f 0f d8 9c ....x..... rsa_s_backup [0xbe9a6398]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 00 b8 40 02 J.............@. rsa_s_backup->dp [0x0240b800]: 0000 df 86 0c 0a 6c 2f 68 09 f9 a1 37 01 26 02 e7 0b ....l/h...7.&... 0010 69 5c b8 0e 0b 95 3a 0d 26 24 00 0e 97 6f dc 0b i\....:.&$...o.. 0020 64 95 ed 0a c0 75 53 03 66 3d ff 0b 26 4b ce 09 d....uS.f=..&K.. 0030 89 12 d2 03 9b 9b 0b 09 19 2c 5a 00 2c 99 fc 0b .........,Z.,... 0040 ea ad 61 09 38 e1 6a 0a 49 a5 ..a.8.j.I. rsa_s_backup_2 [0xbe9a63a8]: 0000 4a 00 00 00 c0 00 00 00 00 00 00 00 e0 b1 40 02 J.............@. rsa_s_backup_2->dp [0x0240b1e0]: 0000 25 b9 db 00 ec 62 00 0d 80 2d b0 0d 00 13 d3 06 %....b...-...... 0010 3f ec 8b 0a af 5d e9 03 2d f4 4b 0c 6c 3c 72 08 ?....]..-.K.l<r. 0020 5d 52 6a 08 21 4c dd 01 a2 59 1a 03 33 16 97 0f ]Rj.!L...Y..3... 0030 c7 69 c2 08 0b 61 d6 03 b9 86 fc 01 27 15 c8 0c .i...a......'... 0040 dd 03 b1 04 78 c7 9f 0f d8 9c ....x..... rsa_s and rsa_s_backup differ I have heavily instrumented s_mp_exptmod() but due to the complexity of the calcualtions performed it's proving very difficult to root down to the issue. What I can tell so far is the failure point within s_mp_exptmod() varies from instance to instance, which is odd because the only potential variant between my three, back-to-back invocations are the memory allocations (buffer locations) triggered by mp_exptmod(), although the invocations usually get provided the same buffer addresses. I tried various scaffolding code on the core memory allocation routines to isolate any buffer overruns/overwrites the logic might be performing, including padding each allocation by a large block of bytes, but the intermittent failure case still occurs. The behavior I'm observing almost appears as if the execution context is being corrupted (ie, processor registers) because the failure point moves around the various elements of the logic within the routine from one failure to the next - sometimes I see an early-stage mp_int structure with the wrong data, sometimes one that has undergone many transformations - all within s_mp_exptmod(). Do you know if OpenWRT has any way to disable SMP at runtime, or a method or technique to provide a critical section around a block of code to prevent any preemptive task switches? ________________________________ From: Horshack <horsh...@live.com> Sent: Thursday, March 19, 2020 7:11 AM To: Matt Johnston <m...@ucc.asn.au> Cc: dropbear@ucc.asn.au <dropbear@ucc.asn.au> Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Thanks Matt, I'll give that a shot when I get a build environment set up for the server-side/openwrt. I also plan to look at the RSA blinding logic in buf_put_rsa_sign(). Considering the intermittency of the issue I'm thinking the issue has some correlation or dependency to the random data generated or transformed by that logic. Crypto is well outside my core competency so it'll be slow-going. ________________________________ From: Matt Johnston <m...@ucc.asn.au> Sent: Thursday, March 19, 2020 7:04 AM To: Horshack <horsh...@live.com> Cc: dropbear@ucc.asn.au <dropbear@ucc.asn.au> Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, The first thing I'd try would be to build with -O0 compilation flags to rule out compiler optimisations doing something strange. Cheers, Matt On Thu 19/3/2020, at 3:42 pm, Horshack <horsh...@live.com<mailto:horsh...@live.com>> wrote: Update - I cloned and built the dbclient source so I could enable the debug tracing facility to get more information about the 'Bad hostkey signature'. The intermittent failure is detected in recv_msg_kexdh_reply() -> buf_rsa_verify() -> mp_cmd(). If I bypass the buf_rsa_verify() call then the session proceeds normally without issue, which indicates everything else in the key exchange is working 100% of the time. I'll dig deeper to see why the signed host key sent by the server is wrong. ________________________________ From: Horshack Sent: Wednesday, March 18, 2020 9:36 AM To: dropbear@ucc.asn.au<mailto:dropbear@ucc.asn.au> <dropbear@ucc.asn.au<mailto:dropbear@ucc.asn.au>> Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800 Hi, I have a strange issue on my Netgear X4S R7800. Running either DD-WRT or OpenWrt, approximately 30-70% of my SSH login attempts fail. For OpenSSH clients the error reported is "error in libcrypto". For the PuTTY client the error is more descriptive - "Signature from server's host key is invalid". The failure occurs even when using the OpenSSH client built in to OpenWrt itself (ie, SSH'ing into the router from the router via an existing remote SSH session). The failure appears to be at the tail end of the key exchange, before authentication. I've tried varying the cipher (aes128-ctr / aes256-ctr), the MAC (hmac-sha1 / hmac-sha2-256), and the key exchange algo (curve25519-sha256 / curve25519-sha...@libssh.org<mailto:curve25519-sha...@libssh.org> / diffie-hellman-group14-sha256 / diffie-hellman-group14-sha1) but the intermittent failure still occurs. The frequency of failure is about the same for all these configuration options except for diffie-hellman-group14-sha256, which fails much more frequently - it sometimes takes hundreds of attempts to succeed. Perhaps that will provide a clue to the underlying cause. Once an SSH login succeeds the connection is stable. However if I initiate a manual rekey operation via ~R then the key re-exchange fails. The router is otherwise very stable with no noticeable issues. I'm an embedded firmware engineer but have never worked on DD-WRT/OpenWrt firmware or dropbear. I have a conceptual understanding of the key exchange algo but haven't looked at the actual code of any implementation including Dropbear's. I'm seek ideas on how to troubleshoot this issue. Considering the problem is intermittent I'm thinking it's some variant in the key generation/exchange algorithm that's failing due to some issue with the router, or a more remote possibility, an issue with the Dropbear implementation. Here are pastebin links to the PuTTY full debug logs (w/raw data dumps) for both the failure and success cases: Failure Case: https://pastebin.com/MS2BtFmW Success Case: https://pastebin.com/c4j66Ga9 The only message I see from dropbear for a failed connection attempt is: authpriv.info<http://authpriv.info> dropbear[15948]: Child connection from 192.168.1.249:54819 authpriv.info<http://authpriv.info> dropbear[15948]: Exit before auth: Disconnect received Thanks!