I excluded context switches as a possible culprit by looping until a corruption 
happened for which no context switches occurred while the test was running (ie, 
at the start of the test I would save the # of involuntary/voluntary context 
switches from /proc/<pid>/status, then check those counts again after the 
failure - if they were different I restarted the test and kept looping until a 
failure happened in which the ctx switch counts were the same.

________________________________
From: dropbear-bounces+horshack=live....@ucc.asn.au 
<dropbear-bounces+horshack=live....@ucc.asn.au> on behalf of Sebastian 
Gottschall <s.gottsch...@dd-wrt.com>
Sent: Tuesday, March 24, 2020 9:13 PM
To: dropbear@ucc.asn.au <dropbear@ucc.asn.au>
Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800


if the corruption is caused by a context switch the problem can be caused by 
the kernel.
try the following and disable "CONFIG_KERNEL_MODE_NEON"
in the kernel config. this will disable some kernel crypto assembly code

Am 24.03.2020 um 16:11 schrieb Matt Johnston:
Good work narrowing down a test case there.
That's an interesting finding - I guess it might be worth posting on OpenWRT 
lists/forum to try find other testers.
Could it be power related if the tight multiplication loop is stressing it 
somehow? It doesn't seem to be using the Neon instruction for anything apart 
from loads/stores though - is there something that the compiler should be doing 
mixing Neon and non-Neon operations?

Cheers,
Matt

(Your emails got held up being over 100kB, I've trimmed the reply below and let 
them through. Apologies to everyone for the stale old one that got let through 
with them just now, I wasn't looking closely)

On Tue 24/3/2020, at 11:23 am, Horshack ‪‬ 
<horsh...@live.com<mailto:horsh...@live.com>> wrote:

I was able to isolate the issue to just a handful of assembly instructions 
within fast_s_mp_sqr(), related to the squaring loop. I broke that code out 
into a separate utility that reproduces the issue within a few seconds. The 
failure is somewhat sensitive to the data pattern and very sensitive to timing, 
indicating a likely memory/data path issue within my particular router. I'm 
guessing it's the IPQ8065 and not the SDRAM because I can get it to fail with a 
tiny data set easily fits within DCACHE. I can alter the frequency of the 
failure with a single ARM memory barrier instruction, which at first implied a 
superscalar data ordering condition but the memory barrier also alters the 
timing through the DCACHE so that is likely the effect it's having. I was able 
to exclude the VFP/Neon register corruption as the cause with some test code. I 
also excluded any context switch-speciifc issue by measuring the # of context 
switches in /proc/<pid>/status and catching a failure where no switches had 
occurred. I also modified the affinity so the utility runs on just one 
processor to rule out a specific core having the issue.

I put the source and binary of my utility on github - if anyone on this mailing 
list has this model router can you give it a try if possible? You only need the 
ipq8065-sqrbug (binary) and run-ipq8065-sqrbug.sh (script). Here's the link to 
the repository: https://github.com/horshack-dpreview/ipq8065-sqrbug


________________________________
From: Horshack ‪‬ <horsh...@live.com<mailto:horsh...@live.com>>
Sent: Saturday, March 21, 2020 7:54 AM
To: dropbear@ucc.asn.au<mailto:dropbear@ucc.asn.au> 
<dropbear@ucc.asn.au<mailto:dropbear@ucc.asn.au>>
Subject: SSH key exchange fails 30-70% of the time on Netgear X4S R7800

Including mailing list for my last two messages below...

Begin forwarded message:

From: Horshack ‪‬ <horsh...@live.com<mailto:horsh...@live.com>>
Date: March 21, 2020 at 7:35:18 AM PDT
To: Matt Johnston <m...@ucc.asn.au<mailto:m...@ucc.asn.au>>
Cc: "dropbear@ucc.asn.au<mailto:dropbear@ucc.asn.au>" 
<dropbear@ucc.asn.au<mailto:dropbear@ucc.asn.au>>
Subject: Re:  SSH key exchange fails 30-70% of the time on Netgear X4S R7800


Disassembly of fast_s_mp_sqr() and other libtommath functions reveals gcc is 
utilizing the arm NEON SIMD instructions and registers for calculations 
involved with libtommath's mp_word scalar. Based on the 64-bit word corruption 
I see I'm guessing the SIMD registers aren't being preserved/restored properly 
somewhere, probably during a context switch, specifically s16–s31 (d8–d15, 
q4–q7), which AAPCS says must be preserved and which I see being used in the 
disassembly of fast_s_mp_sqr(). I'lll write some test code later today to see 
if this is the case, and if so, try to track down where and why the registers 
aren't being preserved.

________________________________
From: Horshack ‪‬ <horsh...@live.com<mailto:horsh...@live.com>>
Sent: Saturday, March 21, 2020 1:11 AM
To: Matt Johnston <m...@ucc.asn.au<mailto:m...@ucc.asn.au>>
Cc: dropbear@ucc.asn.au<mailto:dropbear@ucc.asn.au> 
<dropbear@ucc.asn.au<mailto:dropbear@ucc.asn.au>>
Subject: Re: SSH key exchange fails 30-70% of the time on Netgear X4S R7800

I have one of the failure paths isolated down to a single corrupt 64-bit word 
in memory, which required a significant amount of code instrumentation to 
achieve. I implemented a code execution history buffer that gets filled at 
various checkpoints within s_mp_exptmod() and some of the modules called by it. 
To facilitate this history mechanism I packaged all of s_mp_exptmod()'s local 
variables inside a structure , which consists of saving the local scalar vars 
in addition to crc32's of all the mp_int data structures with a separate crc32 
of the mp_int.dp payload (data). When a failure occurs, ie one or more of the 
three back-to-back debug invocations of s_mp_exptmod yields a mismatching 
signed key result, I  dump out the history elements for each of the invocations 
to determine the first code checkpoint where failing invocation departed from 
the known correct invocation.

*snipped*


Reply via email to