This bug is missing log files that will aid in diagnosing the problem.
>From a terminal window please run:
apport-collect 2096860
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable
to run this command, please add a comment stating that and change the
bug status to 'Confirmed'.
** Changed in: linux (Ubuntu)
Status: New => Incomplete
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2096860
Title:
lvl 5 pagetable system hang
Status in linux package in Ubuntu:
Incomplete
Status in linux-hwe-6.8 package in Ubuntu:
New
Bug description:
A hang occurs with a possible kernel BUG at arch/x86/mm/init_64.c:154
during the memmap_init_zone_device initialization call in the AMDGPU
init sequence.
When the kernel BUG error occurs, this is the expected good result
after the [drm] JPEG decode line. memmap_init_zone_device should
execute, then amdgpum HMM, and this is where the kernel BUG happens.
=========================
Aug 09 00:07:09.659512 host-ruby-942e kernel: [drm] JPEG decode initialized
successfully.
Aug 09 00:07:09.659521 host-ruby-942e kernel: memmap_init_zone_device
initialised 16777216 pages in 136ms
Aug 09 00:07:09.659531 host-ruby-942e kernel: amdgpum HMM registered 65520MB
device memory
Aug 09 00:07:09.659694 host-ruby-942e kernel: kfd kfd: amdgpu: Allocated
3989536 bytes on gart
Aug 09 00:07:09.659838 host-ruby-942e kernel: kfd kfd: amdgpu: Total number
of KFD nodes to be created: 1
Aug 09 00:07:09.659849 host-ruby-942e kernel: amdgpu: Virtual CRAT table
created for GPU
Aug 09 00:07:09.659858 host-ruby-942e kernel: amdgpu: Topology: Add dGPU node
[0x740f:0x1002]
Aug 09 00:07:09.659985 host-ruby-942e kernel: kfd kfd: amdgpu: added device
1002:740f
====================
The issue is a timing-related race condition when setting up the CPU
page tables during the AMDGPU driver initialization. The potential
issue could fall under Linux memory management for this 5-level page
table error
The issue occurs during a server reboot stress. Server environment
should have at least 1 x AMD MI210 GPU with amd gpu driver installed
and enabled. Use ipmitool to drive chassis cold boot in a loop with
loop count set to 1000. We are able to reliably reproduce this issue
beyond 500 boot cycles.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2096860/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp