Public bug reported: A hung state machine in the chip's NMU logic can trigger a fatal condition that will be flagged by hardware through a checkstop. Hence, customers that have a Power 9 Whitherspoon (equipped with GPUs) will experience a crash on their server when using NVIDIA's toolkit.
The server will crash with the following hardware failing message: Unrecoverable Hardware Failure, (Critical) A system checkstop occurred (AffectedSubsystem: Canister/Appliance, PID: 19703), Resolved: 0 In this case, a `NCUFIR[10] tlbie master timeout` has been observed by only starting the NVIDIA ATS driver. This issue is being triggered because the NMU logic is getting stuck when a page is upgraded from RO -> RW without a following tlbie. This is addressed with the following patches: bd5050e38aec3055ff4257ade987d808ac93b582 powerpc/mm/radix: Change pte relax sequence to handle nest MMU hang e4c1112c3fc503fc78379fa61450bfda3f0717fe powerpc/mm: Change function prototype 044003b52a78bcbda7103633c351da16505096cf powerpc/mm/radix: Move function from radix.h to pgtable-radix.c f069ff396d657ac7bdb5de866c3ec28b8d08d953 powerpc/mm/hugetlb: Update huge_ptep_set_access_flags to call __ptep_set_access_flags directly ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) Status: New ** Tags: architecture-ppc64le bugnameltc-170972 severity-critical targetmilestone-inin1804 ** Tags added: architecture-ppc64le bugnameltc-170972 severity-critical targetmilestone-inin1804 ** Changed in: ubuntu Assignee: (unassigned) => Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) ** Package changed: ubuntu => linux (Ubuntu) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1789772 Title: tlbie master timeout checkstop (using NVidia/GPU) Status in linux package in Ubuntu: New Bug description: A hung state machine in the chip's NMU logic can trigger a fatal condition that will be flagged by hardware through a checkstop. Hence, customers that have a Power 9 Whitherspoon (equipped with GPUs) will experience a crash on their server when using NVIDIA's toolkit. The server will crash with the following hardware failing message: Unrecoverable Hardware Failure, (Critical) A system checkstop occurred (AffectedSubsystem: Canister/Appliance, PID: 19703), Resolved: 0 In this case, a `NCUFIR[10] tlbie master timeout` has been observed by only starting the NVIDIA ATS driver. This issue is being triggered because the NMU logic is getting stuck when a page is upgraded from RO -> RW without a following tlbie. This is addressed with the following patches: bd5050e38aec3055ff4257ade987d808ac93b582 powerpc/mm/radix: Change pte relax sequence to handle nest MMU hang e4c1112c3fc503fc78379fa61450bfda3f0717fe powerpc/mm: Change function prototype 044003b52a78bcbda7103633c351da16505096cf powerpc/mm/radix: Move function from radix.h to pgtable-radix.c f069ff396d657ac7bdb5de866c3ec28b8d08d953 powerpc/mm/hugetlb: Update huge_ptep_set_access_flags to call __ptep_set_access_flags directly To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1789772/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp