Note from Scott Mayes on latest crash: Michael,
Since the partition crashed, I was able to get the last .2 seconds worth of RTAS call trace leading up to the crash. Best I could tell from that bit of trace was that the removal of a processor involved the following steps: -- Call to stop-self for a given thread -- Repeated calls to query-cpu-stopped-state (which eventually indicated the thread was stopped) -- Call to get-sensor-state for the thread to check its entity-state (9003) sensor which returned 'dr-entity-present' -- Call to set-indicator to set the isolation-state (9001) indicator to ISOLATE state -- Call to set-indicator to set the allocation-state (9003) indicator to UNUSABLE state I noticed one example of thread x28 getting through all of these steps just fine, but for thread x20, although the query-cpu-stopped state returned 0 status (STOPPED), a subsequent call to set-indicator to ISOLATE failed. This failure was near the end of the trace, but was not the very last RTAS call made in the trace. The set-indicator failure reported to Linux was a -9001 (Valid outstanding translation) which was mapped from a 0x502 (Invalid thread state) return code from PHYP's H_SET_DR_STATE h-call. On 12/10/2018 02:31 PM, Thiago Jung Bauermann wrote: > > Hello Michael, > > Michael Bringmann <m...@linux.vnet.ibm.com> writes: > >> I have asked Scott Mayes to take a look at one of these crashes from >> the phyp side. I will let you know if he finds anything notable. > > Thanks! It might make sense to test whether booting with > cede_offline=off makes the bug go away. Scott is looking at the system. I will try once he is finished. > > One suspicion I have is regarding the code handling CPU_STATE_INACTIVE. >>From what I understand, it is a powerpc-specific CPU state and from the > perspective of the generic CPU hotplug state machine, inactive CPUs are > already fully offline. Which means that the locking performed by the > generic code state machine doesn't apply to transitioning CPUs from > INACTIVE to OFFLINE state. Perhaps the bug is that there is more than > one CPU making that transition at the same time? That would cause two > CPUs to call RTAS stop-self. > > I haven't checked whether this is really possible or not, though. It's > just a conjecture. Michael > > -- > Thiago Jung Bauermann > IBM Linux Technology Center > > -- Michael W. Bringmann Linux Technology Center IBM Corporation Tie-Line 363-5196 External: (512) 286-5196 Cell: (512) 466-0650 m...@linux.vnet.ibm.com