This bug is missing log files that will aid in diagnosing the problem.
>From a terminal window please run:

apport-collect 2096860

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable
to run this command, please add a comment stating that and change the
bug status to 'Confirmed'.


** Changed in: linux (Ubuntu)
       Status: New => Incomplete

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2096860

Title:
  lvl 5 pagetable system hang

Status in linux package in Ubuntu:
  Incomplete
Status in linux-hwe-6.8 package in Ubuntu:
  New

Bug description:
  A hang occurs with a possible kernel BUG at arch/x86/mm/init_64.c:154
  during the memmap_init_zone_device initialization call in the AMDGPU
  init sequence.

  When the kernel BUG error occurs, this is the expected good result
  after the [drm] JPEG decode line. memmap_init_zone_device should
  execute, then amdgpum HMM, and this is where the kernel BUG happens.

  =========================

  Aug 09 00:07:09.659512 host-ruby-942e kernel: [drm] JPEG decode initialized 
successfully.
  Aug 09 00:07:09.659521 host-ruby-942e kernel: memmap_init_zone_device 
initialised 16777216 pages in 136ms
  Aug 09 00:07:09.659531 host-ruby-942e kernel: amdgpum HMM registered 65520MB 
device memory
  Aug 09 00:07:09.659694 host-ruby-942e kernel: kfd kfd: amdgpu: Allocated 
3989536 bytes on gart
  Aug 09 00:07:09.659838 host-ruby-942e kernel: kfd kfd: amdgpu: Total number 
of KFD nodes to be created: 1
  Aug 09 00:07:09.659849 host-ruby-942e kernel: amdgpu: Virtual CRAT table 
created for GPU
  Aug 09 00:07:09.659858 host-ruby-942e kernel: amdgpu: Topology: Add dGPU node 
[0x740f:0x1002]
  Aug 09 00:07:09.659985 host-ruby-942e kernel: kfd kfd: amdgpu: added device 
1002:740f
  ====================

  The issue is a timing-related race condition when setting up the CPU
  page tables during the AMDGPU driver initialization. The potential
  issue could fall under Linux memory management for this 5-level page
  table error


  The issue occurs during a server reboot stress. Server environment
  should have at least 1 x AMD MI210 GPU with amd gpu driver installed
  and enabled. Use ipmitool to drive chassis cold boot in a loop with
  loop count set to 1000. We are able to reliably reproduce this issue
  beyond 500 boot cycles.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2096860/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to