On 1/9/19 12:08 AM, Gautham R Shenoy wrote: > I did some testing during the holidays. Here are the observations: > > 1) With just your patch (without any additional debug patch), if I run > DLPAR on /off operations on a system that has SMT=off, I am able to > see a crash involving RTAS stack corruption within an hour's time. > > 2) With the debug patch (appended below) which has additional debug to > capture the callers of stop-self, start-cpu, set-power-levels, the > system is able to perform DLPAR on/off operations on a system with > SMT=off for three days. And then, it crashed with the dead CPU showing > a "Bad kernel stack pointer". From this log, I can clearly > see that there were no concurrent calls to stop-self, start-cpu, > set-power-levels. The only concurrent RTAS calls were the dying CPU > calling "stop-self", and the CPU running the DLPAR operation calling > "query-cpu-stopped-state". The crash signature is appended below as > well. > > 3) Modifying your patch to remove the udelay and increase the loop > count from 25 to 1000 doesn't improve the situation. We are still able > to see the crash. > > 4) With my patch, even without any additional debug, I was able to > observe the system run the tests successfully for over a week (I > started the tests before the Christmas weekend, and forgot to turn it > off!)
So does this mean that the problem is fixed with your patch? > > It appears that there is a narrow race window involving rtas-stop-self > and rtas-query-cpu-stopped-state calls that can be observed with your > patch. Adding any printk's seems to reduce the probability of hitting > this race window. It might be worth the while to check with RTAS > folks, if they suspect something here. What would the RTAS folks be looking at here? The 'narrow race window' is with respect to a patch that it sound like we should not be using. Thanks. Michael -- Michael W. Bringmann Linux Technology Center IBM Corporation Tie-Line 363-5196 External: (512) 286-5196 Cell: (512) 466-0650 m...@linux.vnet.ibm.com