Re: [PATCH] drm/amdgpu/atomfirmware: silence UBSAN warning
On Mon, 2024-07-01 at 12:55 -0400, Alex Deucher wrote: > This is a variably sized array. > > Link: > https://lists.freedesktop.org/archives/amd-gfx/2024-June/110420.html > Signed-off-by: Alex Deucher > --- > drivers/gpu/drm/amd/include/atomfirmware.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/include/atomfirmware.h > b/drivers/gpu/drm/amd/include/atomfirmware.h > index 571691837200..09cbc3afd6d8 100644 > --- a/drivers/gpu/drm/amd/include/atomfirmware.h > +++ b/drivers/gpu/drm/amd/include/atomfirmware.h > @@ -734,7 +734,7 @@ struct atom_gpio_pin_lut_v2_1 > { > struct atom_common_table_header table_header; > /*the real number of this included in the structure is calcualted > by using the (whole structure size - the header size)/size of > atom_gpio_pin_lut */ > - struct atom_gpio_pin_assignment gpio_pin[8]; > + struct atom_gpio_pin_assignment gpio_pin[]; > }; > > Works for me: Tested-by: Jeff Layton
amdgpu UBSAN warnings in 6.10.0-rc5
I've been testing some vfs patches (multigrain timestamps) on my personal desktop with a 6.10.0-rc5-ish kernel, and have hit a number of warnings in the amdgpu driver, including a UBSAN warning that looks like a potential array overrun: [8.772608] [ cut here ] [8.772609] UBSAN: array-index-out-of-bounds in drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:680:23 [8.772612] index 8 is out of range for type 'atom_gpio_pin_assignment [8]' [8.772614] CPU: 13 PID: 508 Comm: (udev-worker) Not tainted 6.10.0-rc5-00292-gb3efd5c27332 #35 [8.772616] Hardware name: Micro-Star International Co., Ltd. MS-7E27/PRO B650M-P (MS-7E27), BIOS 1.A0 06/07/2024 [8.772618] Call Trace: [8.772620] [8.772621] dump_stack_lvl+0x5d/0x80 [8.772629] ubsan_epilogue+0x5/0x30 [8.772633] __ubsan_handle_out_of_bounds.cold+0x46/0x4b [8.772636] bios_parser_get_gpio_pin_info+0x11c/0x150 [amdgpu] [8.773016] link_get_hpd_gpio+0x7e/0xd0 [amdgpu] [8.773205] construct_phy+0x26d/0xd40 [amdgpu] [8.773355] ? srso_alias_return_thunk+0x5/0xfbef5 [8.773370] ? link_create+0x210/0x250 [amdgpu] [8.773493] ? srso_alias_return_thunk+0x5/0xfbef5 [8.773495] link_create+0x210/0x250 [amdgpu] [8.773610] ? srso_alias_return_thunk+0x5/0xfbef5 [8.773612] create_links+0x151/0x530 [amdgpu] [8.773759] dc_create+0x401/0x7b0 [amdgpu] [8.773883] ? srso_alias_return_thunk+0x5/0xfbef5 [8.773886] amdgpu_dm_init.isra.0+0x32f/0x22d0 [amdgpu] [8.774045] ? irq_work_queue+0x2d/0x50 [8.774048] ? srso_alias_return_thunk+0x5/0xfbef5 [8.774050] ? srso_alias_return_thunk+0x5/0xfbef5 [8.774052] ? vprintk_emit+0x176/0x2a0 [8.774056] ? dev_vprintk_emit+0x181/0x1b0 [8.774063] dm_hw_init+0x12/0x30 [amdgpu] [8.774187] amdgpu_device_init.cold+0x1c43/0x1f90 [amdgpu] [8.774373] amdgpu_driver_load_kms+0x19/0x70 [amdgpu] [8.774507] amdgpu_pci_probe+0x1a7/0x4b0 [amdgpu] [8.774631] local_pci_probe+0x42/0x90 [8.774635] pci_device_probe+0xc1/0x2a0 [8.774638] really_probe+0xdb/0x340 [8.774642] ? pm_runtime_barrier+0x54/0x90 [8.774644] ? __pfx___driver_attach+0x10/0x10 [8.774646] __driver_probe_device+0x78/0x110 [8.774648] driver_probe_device+0x1f/0xa0 [8.774650] __driver_attach+0xba/0x1c0 [8.774652] bus_for_each_dev+0x8c/0xe0 [8.774655] bus_add_driver+0x142/0x220 [8.774657] driver_register+0x72/0xd0 [8.774660] ? __pfx_amdgpu_init+0x10/0x10 [amdgpu] [8.774779] do_one_initcall+0x58/0x310 [8.774784] do_init_module+0x90/0x250 [8.774787] init_module_from_file+0x86/0xc0 [8.774791] idempotent_init_module+0x121/0x2b0 [8.774794] __x64_sys_finit_module+0x5e/0xb0 [8.774796] do_syscall_64+0x82/0x160 [8.774799] ? __pfx_page_put_link+0x10/0x10 [8.774804] ? srso_alias_return_thunk+0x5/0xfbef5 [8.774806] ? do_sys_openat2+0x9c/0xe0 [8.774809] ? srso_alias_return_thunk+0x5/0xfbef5 [8.774810] ? syscall_exit_to_user_mode+0x72/0x220 [8.774813] ? srso_alias_return_thunk+0x5/0xfbef5 [8.774815] ? do_syscall_64+0x8e/0x160 [8.774816] ? srso_alias_return_thunk+0x5/0xfbef5 [8.774818] ? __seccomp_filter+0x303/0x520 [8.774820] ? srso_alias_return_thunk+0x5/0xfbef5 [8.774824] ? srso_alias_return_thunk+0x5/0xfbef5 [8.774825] ? syscall_exit_to_user_mode+0x72/0x220 [8.774827] ? srso_alias_return_thunk+0x5/0xfbef5 [8.774829] ? do_syscall_64+0x8e/0x160 [8.774830] ? do_syscall_64+0x8e/0x160 [8.774831] ? srso_alias_return_thunk+0x5/0xfbef5 [8.774833] ? srso_alias_return_thunk+0x5/0xfbef5 [8.774835] entry_SYSCALL_64_after_hwframe+0x76/0x7e [8.774837] RIP: 0033:0x7fa5f44391bd [8.774848] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2b cc 0c 00 f7 d8 64 89 01 48 [8.774850] RSP: 002b:7fff5d55a5a8 EFLAGS: 0246 ORIG_RAX: 0139 [8.774852] RAX: ffda RBX: 555b3bfe6a50 RCX: 7fa5f44391bd [8.774854] RDX: RSI: 7fa5f455507d RDI: 002c [8.774855] RBP: 7fff5d55a660 R08: 0001 R09: 7fff5d55a5f0 [8.774855] R10: 0050 R11: 0246 R12: 7fa5f455507d [8.774856] R13: 0002 R14: 555b3bfebb30 R15: 555b3bff63d0 [8.774859] [8.774864] ---[ end trace ]--- It looks like "count" probably needs to be clamped to ARRAY_SIZE(header->gpio_pin) in bios_parser_get_gpio_pin_info ? dmesg is attached. There are couple of other warnings in there too after the UBSAN one, but this one looks the most worrisome. -- Jeff Layton amd-warnings-dmesg.out.gz Description: application/gzip
Re: WARNING in allocate_mst_payload
The monitors are individually connected to the same card. The Dell 34" is connected via displayport, and the Sceptre monitor is connected via HDMI. Thanks, Jeff On Tue, 2023-11-21 at 14:13 +0100, Christian König wrote: > Hi Jeff, > > first of all adding Harry from our display team. > > From a quick look the obvious missing information is how are your > monitors wired up? Are those individually DP or HDMI connected to the PC > or are they daisy chained through MST? > > If it's daisy chained please double check that you don't have a faulty > connection and maybe individually connect them for a test if possible. > > Regards, > Christian. > > Am 21.11.23 um 14:01 schrieb Jeff Layton: > > I have a recurring problem where my workstation tries to put the monitor > > to sleep, which triggers a warning down in the depths of the video card > > driver. When I return to the machine the monitor is black, but not in > > powersave mode and all of the windows on my desktop have been shuffled > > off to the second monitor. > > > > I've seen this since at least v6.3 or so (though the problem may predate > > that). The kernel is stock Fedora kernel. It's occurs fairly reliably, > > and I'm happy to help test patches. > > > > I took a quick look at the sources and the reported line corresponds > > with this assertion in allocate_mst_payload: > > > > ASSERT(proposed_table.stream_count > 0); > > > > I've attached the output from wayland-info, and the stack traces follow. > > Let me know if any other info would be helpful. > > > > > > [ 4655.946669] [ cut here ] > > [ 4655.946677] WARNING: CPU: 12 PID: 3979 at > > drivers/gpu/drm/amd/amdgpu/../display/dc/link/link_dpms.c:1484 > > link_set_dpms_on+0xbe5/0xca0 [amdgpu] > > [ 4655.947689] Modules linked in: uinput xt_mark rfcomm snd_seq_dummy > > snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core tun xt_CHECKSUM > > xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp > > nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 > > nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 > > nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw > > ip6table_security iptable_nat nf_nat bridge nf_conntrack stp llc > > nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security > > ip_set nf_tables nfnetlink ip6table_filter iptable_filter qrtr bnep > > binfmt_misc xfs vfat fat ppdev snd_hda_codec_realtek snd_hda_codec_generic > > intel_rapl_msr snd_hda_codec_hdmi ledtrig_audio snd_hda_intel > > intel_rapl_common snd_intel_dspcfg edac_mce_amd snd_intel_sdw_acpi > > snd_usb_audio snd_hda_codec uvcvideo kvm_amd btusb snd_usbmidi_lib > > snd_hda_core btrtl snd_ump btbcm snd_rawmidi btintel snd_hwdep uvc btmtk > > kvm videobuf2_vmalloc videobuf2_memops bluetooth snd_seq > > [ 4655.947889] snd_seq_device irqbypass videobuf2_v4l2 rapl xpad > > videobuf2_common snd_pcm ff_memless wmi_bmof mxm_wmi pcspkr acpi_cpufreq > > videodev k10temp rfkill i2c_piix4 snd_timer snd mc soundcore parport_pc > > parport gpio_amdpt gpio_generic joydev nfsd auth_rpcgss nfs_acl lockd grace > > sunrpc loop zram amdgpu i2c_algo_bit drm_ttm_helper ttm video > > drm_suballoc_helper uas amdxcp iommu_v2 crct10dif_pclmul crc32_pclmul > > drm_buddy crc32c_intel gpu_sched polyval_clmulni usb_storage r8169 > > polyval_generic drm_display_helper nvme ghash_clmulni_intel nvme_core ccp > > sha512_ssse3 cec sp5100_tco nvme_common wmi scsi_dh_rdac scsi_dh_emc > > scsi_dh_alua ip6_tables ip_tables dm_multipath fuse > > [ 4655.948051] CPU: 12 PID: 3979 Comm: KMS thread Kdump: loaded Not tainted > > 6.5.11-300.fc39.x86_64 #1 > > [ 4655.948058] Hardware name: Micro-Star International Co., Ltd. > > MS-7A33/X370 SLI PLUS (MS-7A33), BIOS 3.JR 11/29/2019 > > [ 4655.948062] RIP: 0010:link_set_dpms_on+0xbe5/0xca0 [amdgpu] > > [ 4655.949058] Code: e9 3f fc ff ff 48 c7 c7 98 c7 20 c1 e8 d4 33 e8 e4 e9 > > c0 fe ff ff 48 8b bb d0 01 00 00 48 89 de e8 40 d0 ed ff e9 25 ff ff ff > > <0f> 0b e9 88 fd ff ff 41 c6 85 50 04 00 00 00 e9 d1 f8 ff ff 49 8b > > [ 4655.949064] RSP: 0018:be344ac2b430 EFLAGS: 00010246 > > [ 4655.949071] RAX: RBX: 953f0bfab000 RCX: > > 0005 > > [ 4655.949076] RDX: c120c6a8 RSI: 0002 RDI: > > > > [ 4655.949080] RBP: 953f0bfab000 R08: R09: > > 0005 > > [ 4655.949084] R10: 953ece152800 R11: 953ed0eb9960 R12
WARNING in allocate_mst_payload
I have a recurring problem where my workstation tries to put the monitor to sleep, which triggers a warning down in the depths of the video card driver. When I return to the machine the monitor is black, but not in powersave mode and all of the windows on my desktop have been shuffled off to the second monitor. I've seen this since at least v6.3 or so (though the problem may predate that). The kernel is stock Fedora kernel. It's occurs fairly reliably, and I'm happy to help test patches. I took a quick look at the sources and the reported line corresponds with this assertion in allocate_mst_payload: ASSERT(proposed_table.stream_count > 0); I've attached the output from wayland-info, and the stack traces follow. Let me know if any other info would be helpful. [ 4655.946669] [ cut here ] [ 4655.946677] WARNING: CPU: 12 PID: 3979 at drivers/gpu/drm/amd/amdgpu/../display/dc/link/link_dpms.c:1484 link_set_dpms_on+0xbe5/0xca0 [amdgpu] [ 4655.947689] Modules linked in: uinput xt_mark rfcomm snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core tun xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat bridge nf_conntrack stp llc nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter iptable_filter qrtr bnep binfmt_misc xfs vfat fat ppdev snd_hda_codec_realtek snd_hda_codec_generic intel_rapl_msr snd_hda_codec_hdmi ledtrig_audio snd_hda_intel intel_rapl_common snd_intel_dspcfg edac_mce_amd snd_intel_sdw_acpi snd_usb_audio snd_hda_codec uvcvideo kvm_amd btusb snd_usbmidi_lib snd_hda_core btrtl snd_ump btbcm snd_rawmidi btintel snd_hwdep uvc btmtk kvm videobuf2_vmalloc videobuf2_memops bluetooth snd_seq [ 4655.947889] snd_seq_device irqbypass videobuf2_v4l2 rapl xpad videobuf2_common snd_pcm ff_memless wmi_bmof mxm_wmi pcspkr acpi_cpufreq videodev k10temp rfkill i2c_piix4 snd_timer snd mc soundcore parport_pc parport gpio_amdpt gpio_generic joydev nfsd auth_rpcgss nfs_acl lockd grace sunrpc loop zram amdgpu i2c_algo_bit drm_ttm_helper ttm video drm_suballoc_helper uas amdxcp iommu_v2 crct10dif_pclmul crc32_pclmul drm_buddy crc32c_intel gpu_sched polyval_clmulni usb_storage r8169 polyval_generic drm_display_helper nvme ghash_clmulni_intel nvme_core ccp sha512_ssse3 cec sp5100_tco nvme_common wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables dm_multipath fuse [ 4655.948051] CPU: 12 PID: 3979 Comm: KMS thread Kdump: loaded Not tainted 6.5.11-300.fc39.x86_64 #1 [ 4655.948058] Hardware name: Micro-Star International Co., Ltd. MS-7A33/X370 SLI PLUS (MS-7A33), BIOS 3.JR 11/29/2019 [ 4655.948062] RIP: 0010:link_set_dpms_on+0xbe5/0xca0 [amdgpu] [ 4655.949058] Code: e9 3f fc ff ff 48 c7 c7 98 c7 20 c1 e8 d4 33 e8 e4 e9 c0 fe ff ff 48 8b bb d0 01 00 00 48 89 de e8 40 d0 ed ff e9 25 ff ff ff <0f> 0b e9 88 fd ff ff 41 c6 85 50 04 00 00 00 e9 d1 f8 ff ff 49 8b [ 4655.949064] RSP: 0018:be344ac2b430 EFLAGS: 00010246 [ 4655.949071] RAX: RBX: 953f0bfab000 RCX: 0005 [ 4655.949076] RDX: c120c6a8 RSI: 0002 RDI: [ 4655.949080] RBP: 953f0bfab000 R08: R09: 0005 [ 4655.949084] R10: 953ece152800 R11: 953ed0eb9960 R12: 95401e4c0b38 [ 4655.949088] R13: 0006 R14: 953ed99c R15: 95401e4c0df0 [ 4655.949093] FS: 7fba5b8856c0() GS:954d9ed0() knlGS: [ 4655.949099] CS: 0010 DS: ES: CR0: 80050033 [ 4655.949104] CR2: 1eca52061810 CR3: 0001ad4ac000 CR4: 003506e0 [ 4655.949109] Call Trace: [ 4655.949114] [ 4655.949118] ? link_set_dpms_on+0xbe5/0xca0 [amdgpu] [ 4655.950106] ? __warn+0x81/0x130 [ 4655.950118] ? link_set_dpms_on+0xbe5/0xca0 [amdgpu] [ 4655.951130] ? report_bug+0x171/0x1a0 [ 4655.951144] ? handle_bug+0x3c/0x80 [ 4655.951153] ? exc_invalid_op+0x17/0x70 [ 4655.951160] ? asm_exc_invalid_op+0x1a/0x20 [ 4655.951178] ? link_set_dpms_on+0xbe5/0xca0 [amdgpu] [ 4655.952193] dce110_apply_ctx_to_hw+0x535/0x700 [amdgpu] [ 4655.953141] dc_commit_state_no_check+0x3cd/0xef0 [amdgpu] [ 4655.954083] dc_commit_streams+0x29b/0x400 [amdgpu] [ 4655.955032] amdgpu_dm_atomic_commit_tail+0x5e8/0x3b10 [amdgpu] [ 4655.956023] ? dcn30_populate_dml_writeback_from_context+0x35/0x50 [amdgpu] [ 4655.956963] ? srso_return_thunk+0x5/0x10 [ 4655.956972] ? dcn30_populate_dml_writeback_from_context+0x35/0x50 [amdgpu] [ 4655.957912] ? srso_return_thunk+0x5/0x10 [ 4655.957926] ? srso_return_thunk+0x5/0x10 [ 4655.957934] ? srso_return_thunk+0x5/0x10 [ 4655.957940] ? dcn30_internal_validate_bw+0x992/0x9d0
DPMS problems with radeon card and dual monitor setup
98] ? srso_return_thunk+0x5/0x10 [ 4125.632905] ? srso_return_thunk+0x5/0x10 [ 4125.632912] ? srso_return_thunk+0x5/0x10 [ 4125.632920] ? raw_spin_rq_lock_nested+0x1c/0x80 [ 4125.632927] ? srso_return_thunk+0x5/0x10 [ 4125.632934] ? psi_group_change+0x213/0x3c0 [ 4125.632945] ? srso_return_thunk+0x5/0x10 [ 4125.632952] ? psi_task_switch+0xd6/0x230 [ 4125.632959] ? srso_return_thunk+0x5/0x10 [ 4125.632966] ? finish_task_switch.isra.0+0x94/0x2f0 [ 4125.632977] ? srso_return_thunk+0x5/0x10 [ 4125.632984] ? __schedule+0x3f6/0x14c0 [ 4125.632993] ? srso_return_thunk+0x5/0x10 [ 4125.633001] ? dma_fence_array_create+0x48/0x110 [ 4125.633015] ? srso_return_thunk+0x5/0x10 [ 4125.633022] ? __slab_free+0xf1/0x330 [ 4125.633029] ? srso_return_thunk+0x5/0x10 [ 4125.633036] ? __slab_free+0xf1/0x330 [ 4125.633049] ? srso_return_thunk+0x5/0x10 [ 4125.633056] ? wait_for_completion_timeout+0x13e/0x170 [ 4125.633063] ? wait_for_completion_interruptible+0x139/0x1e0 [ 4125.633072] ? srso_return_thunk+0x5/0x10 [ 4125.633086] commit_tail+0x94/0x130 [ 4125.633095] drm_atomic_helper_commit+0x11a/0x140 [ 4125.633103] drm_atomic_commit+0x9a/0xd0 [ 4125.633111] ? __pfx___drm_printfn_info+0x10/0x10 [ 4125.633122] drm_mode_atomic_ioctl+0x9b5/0xbc0 [ 4125.633135] ? __pfx_drm_mode_createblob_ioctl+0x10/0x10 [ 4125.633150] ? __pfx_drm_mode_atomic_ioctl+0x10/0x10 [ 4125.633159] drm_ioctl_kernel+0xcd/0x170 [ 4125.633169] drm_ioctl+0x26d/0x4b0 [ 4125.633178] ? __pfx_drm_mode_atomic_ioctl+0x10/0x10 [ 4125.633198] amdgpu_drm_ioctl+0x4e/0x90 [amdgpu] [ 4125.633994] __x64_sys_ioctl+0x97/0xd0 [ 4125.634004] do_syscall_64+0x60/0x90 [ 4125.634013] ? srso_return_thunk+0x5/0x10 [ 4125.634020] ? syscall_exit_to_user_mode+0x2b/0x40 [ 4125.634028] ? srso_return_thunk+0x5/0x10 [ 4125.634036] ? do_syscall_64+0x6c/0x90 [ 4125.634042] ? srso_return_thunk+0x5/0x10 [ 4125.634049] ? syscall_exit_to_user_mode+0x2b/0x40 [ 4125.634057] ? srso_return_thunk+0x5/0x10 [ 4125.634064] ? do_syscall_64+0x6c/0x90 [ 4125.634071] ? srso_return_thunk+0x5/0x10 [ 4125.634078] ? __irq_exit_rcu+0x4b/0xc0 [ 4125.634088] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 4125.634095] RIP: 0033:0x7f8bbfd28edd [ 4125.634112] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 [ 4125.634117] RSP: 002b:7ffe0a954fa0 EFLAGS: 0246 ORIG_RAX: 0010 [ 4125.634124] RAX: ffda RBX: 560753e3f1a0 RCX: 7f8bbfd28edd [ 4125.634129] RDX: 7ffe0a955040 RSI: c03864bc RDI: 000a [ 4125.634133] RBP: 7ffe0a954ff0 R08: 0007 R09: 0014 [ 4125.634137] R10: 0011 R11: 0246 R12: 7ffe0a955040 [ 4125.634141] R13: c03864bc R14: 000a R15: 56075466b960 [ 4125.634156] [ 4125.634159] ---[ end trace 0000 ]--- Thanks, -- Jeff Layton
Re: NULL pointer dereference in drm_dp_add_payload_part2+0xca/0x100
On Sat, 2023-04-08 at 07:46 -0400, Jeff Layton wrote: > I've hit some repeated crashes in drm_dp_add_payload_part2. Here's one > from this morning that occurred not long after booting the machine. I > hadn't even logged in yet -- it was still at a gdm prompt: > > Apr 08 05:34:20 tleilax kernel: amdgpu :30:00.0: [drm] Failed to create > MST payload for port 74d1d8eb: -5 > Apr 08 05:34:20 tleilax kernel: BUG: kernel NULL pointer dereference, > address: 0008 > Apr 08 05:34:20 tleilax kernel: #PF: supervisor read access in kernel mode > Apr 08 05:34:20 tleilax kernel: #PF: error_code(0x) - not-present page > Apr 08 05:34:20 tleilax kernel: PGD 0 P4D 0 > Apr 08 05:34:20 tleilax kernel: Oops: [#1] PREEMPT SMP NOPTI > Apr 08 05:34:20 tleilax kernel: CPU: 8 PID: 2278 Comm: gnome-shell Kdump: > loaded Not tainted 6.2.9-200.fc37.x86_64 #1 > Apr 08 05:34:20 tleilax kernel: Hardware name: Micro-Star International Co., > Ltd. MS-7A33/X370 SLI PLUS (MS-7A33), BIOS 3.JR 11/29/2019 > Apr 08 05:34:20 tleilax kernel: RIP: 0010:drm_dp_add_payload_part2+0xca/0x100 > [drm_display_helper] > Apr 08 05:34:20 tleilax kernel: Code: 8b 7e 08 44 89 e9 4c 89 c2 48 c7 c6 60 > d2 55 c0 e8 ab 69 54 c5 44 89 e8 5b 5d 41 5c 41 5d e9 2d 73 a2 c5 48 8b 80 60 > 05 00 00 <48> 8b 76 08 4c 8b 40 60 48 85 f6 74 04 48 8b 76 08 4> > Apr 08 05:34:20 tleilax kernel: RSP: 0018:a4238a2db590 EFLAGS: 00010246 > Apr 08 05:34:20 tleilax kernel: RAX: 961550cac000 RBX: 961550cac000 > RCX: c055ca98 > Apr 08 05:34:20 tleilax kernel: RDX: 9615a6326140 RSI: > RDI: 9615578a4568 > Apr 08 05:34:20 tleilax kernel: RBP: 0001 R08: fffb > R09: > Apr 08 05:34:20 tleilax kernel: R10: 0002 R11: 0100 > R12: 9615578a4000 > Apr 08 05:34:20 tleilax kernel: R13: 96154a5b8de0 R14: c0d9d980 > R15: 9615589c1f90 > Apr 08 05:34:20 tleilax kernel: FS: 7f1c8ad775c0() > GS:96241f00() knlGS: > Apr 08 05:34:20 tleilax kernel: CS: 0010 DS: ES: CR0: > 80050033 > Apr 08 05:34:20 tleilax kernel: CR2: 0008 CR3: 00012f908000 > CR4: 003506e0 > Apr 08 05:34:20 tleilax kernel: Call Trace: > Apr 08 05:34:20 tleilax kernel: > Apr 08 05:34:20 tleilax kernel: > dm_helpers_dp_mst_send_payload_allocation+0x83/0xb0 [amdgpu] > Apr 08 05:34:20 tleilax kernel: dc_link_allocate_mst_payload+0x16d/0x280 > [amdgpu] > Apr 08 05:34:20 tleilax kernel: core_link_enable_stream+0x8ec/0xa10 [amdgpu] > Apr 08 05:34:20 tleilax kernel: ? optc1_set_drr+0x136/0x1e0 [amdgpu] > Apr 08 05:34:20 tleilax kernel: dce110_apply_ctx_to_hw+0x61b/0x670 [amdgpu] > Apr 08 05:34:20 tleilax kernel: dc_commit_state_no_check+0x39b/0xcd0 [amdgpu] > Apr 08 05:34:20 tleilax kernel: dc_commit_state+0x107/0x120 [amdgpu] > Apr 08 05:34:20 tleilax kernel: amdgpu_dm_atomic_commit_tail+0x5bf/0x2d20 > [amdgpu] > Apr 08 05:34:20 tleilax kernel: ? cpufreq_this_cpu_can_update+0x12/0x60 > Apr 08 05:34:20 tleilax kernel: ? sugov_get_util+0x7e/0x90 > Apr 08 05:34:20 tleilax kernel: ? sugov_update_single_freq+0xb7/0x180 > Apr 08 05:34:20 tleilax kernel: ? _raw_spin_lock+0x13/0x40 > Apr 08 05:34:20 tleilax kernel: ? raw_spin_rq_lock_nested+0x1e/0x70 > Apr 08 05:34:20 tleilax kernel: ? psi_group_change+0x168/0x400 > Apr 08 05:34:20 tleilax kernel: ? _raw_spin_unlock+0x15/0x30 > Apr 08 05:34:20 tleilax kernel: ? finish_task_switch.isra.0+0x9b/0x300 > Apr 08 05:34:20 tleilax kernel: ? __switch_to+0x106/0x410 > Apr 08 05:34:20 tleilax kernel: ? __schedule+0x3d4/0x13c0 > Apr 08 05:34:20 tleilax kernel: ? dma_resv_get_fences+0x11b/0x220 > Apr 08 05:34:20 tleilax kernel: ? get_nohz_timer_target+0x18/0x190 > Apr 08 05:34:20 tleilax kernel: ? lock_timer_base+0x61/0x80 > Apr 08 05:34:20 tleilax kernel: ? _raw_spin_unlock_irqrestore+0x23/0x40 > Apr 08 05:34:20 tleilax kernel: ? __mod_timer+0x29e/0x3d0 > Apr 08 05:34:20 tleilax kernel: ? preempt_count_add+0x6a/0xa0 > Apr 08 05:34:20 tleilax kernel: ? _raw_spin_lock_irq+0x19/0x40 > Apr 08 05:34:20 tleilax kernel: ? _raw_spin_unlock_irq+0x1b/0x40 > Apr 08 05:34:20 tleilax kernel: ? wait_for_completion_timeout+0x13a/0x170 > Apr 08 05:34:20 tleilax kernel: ? > wait_for_completion_interruptible+0x135/0x1e0 > Apr 08 05:34:20 tleilax kernel: ? __pfx_dma_fence_default_wait_cb+0x10/0x10 > Apr 08 05:34:20 tleilax kernel: commit_tail+0x94/0x130 > Apr 08 05:34:20 tleilax kernel: drm_atomic_helper_commit+0x112/0x140 > Apr 08 05:34:20 tleilax kernel: drm_atomic_commit+0x96/0xc0 > Apr 08 05:34:20 tleilax kernel: ? __pfx___drm_printfn_info+0x10/0x10 > Apr 08 05:34:20 tleilax kernel: drm_mode_at
NULL pointer dereference in drm_dp_add_payload_part2+0xca/0x100
yscall_64+0x5b/0x80 Apr 08 05:34:20 tleilax kernel: ? __x64_sys_ioctl+0xa8/0xd0 Apr 08 05:34:20 tleilax kernel: ? syscall_exit_to_user_mode+0x17/0x40 Apr 08 05:34:20 tleilax kernel: ? do_syscall_64+0x67/0x80 Apr 08 05:34:20 tleilax kernel: ? sched_clock_cpu+0xb/0xc0 Apr 08 05:34:20 tleilax kernel: ? __irq_exit_rcu+0x3d/0x140 Apr 08 05:34:20 tleilax kernel: entry_SYSCALL_64_after_hwframe+0x72/0xdc Apr 08 05:34:20 tleilax kernel: RIP: 0033:0x7f1c8e723d6f Apr 08 05:34:20 tleilax kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 0> Apr 08 05:34:20 tleilax kernel: RSP: 002b:7ffea61067d0 EFLAGS: 0246 ORIG_RAX: 0010 Apr 08 05:34:20 tleilax kernel: RAX: ffda RBX: 5571af410fb0 RCX: 7f1c8e723d6f Apr 08 05:34:20 tleilax kernel: RDX: 7ffea6106870 RSI: c03864bc RDI: 000a Apr 08 05:34:20 tleilax kernel: RBP: 7ffea6106870 R08: 0011 R09: 0011 Apr 08 05:34:20 tleilax kernel: R10: 5571ae320010 R11: 0246 R12: c03864bc Apr 08 05:34:20 tleilax kernel: R13: 000a R14: 5571ae6ff140 R15: 5571b0261950 Apr 08 05:34:20 tleilax kernel: Apr 08 05:34:20 tleilax kernel: Modules linked in: rfcomm snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp nf_conntrack_netbi> Apr 08 05:34:20 tleilax kernel: videobuf2_memops rapl mxm_wmi videobuf2_v4l2 wmi_bmof snd_pcm k10temp rfkill pcspkr videobuf2_common i2c_piix4 snd_timer joydev videodev snd mc parport_pc soundcore parport gpio_amdpt g> Apr 08 05:34:20 tleilax kernel: CR2: 0008 Apr 08 05:34:20 tleilax kernel: ---[ end trace ]--- $ ./scripts/faddr2line --list /usr/lib/debug/lib/modules/6.2.9-200.fc37.x86_64/kernel/drivers/gpu/drm/display/drm_display_helper.ko.debug drm_dp_add_payload_part2+0xca/0x100 drm_dp_add_payload_part2+0xca/0x100: drm_dp_add_payload_part2 at /usr/src/debug/kernel-6.2.9/linux-6.2.9-200.fc37.x86_64/drivers/gpu/drm/display/drm_dp_mst_topology.c:3407 3402 { 3403 int ret = 0; 3404 3405 /* Skip failed payloads */ 3406 if (payload->vc_start_slot == -1) { >3407< drm_dbg_kms(state->dev, "Part 1 of payload creation for >%s failed, skipping part 2\n", 3408 payload->port->connector->name); 3409 return -EIO; 3410 } 3411 3412 ret = drm_dp_create_payload_step2(mgr, payload); Since %rsi is NULL and the ->dev field is 8 bytes into the struct, I'm guessing that means that "state" was NULL here. I'm assuming that the real bug is in the caller (and I'm happy to help track that down), but would it make sense to allow this function to gracefully handle a NULL state pointer? IOW something like this? drm_dbg_kms(state ? state->dev : NULL, "Part 1 of payload creation for %s failed, skipping part 2\n", I think that would at least prevent this problem from crashing the machine. Thanks, -- Jeff Layton
Re: softlockup in v5.15.12 in dcn20_post_unlock_program_front_end
On Sun, 2022-01-02 at 09:30 -0500, Jeff Layton wrote: > I'm seeing a reproducible softlockup on amdgpu on v5.15.12: > > [ 861.656146] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for > DMUB idle: status=3 > [ 861.914848] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for > DMUB idle: status=3 > [ 862.173368] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for > DMUB idle: status=3 > [ 862.381635] [drm] enabling link 0 failed: 15 > [ 862.640908] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for > DMUB idle: status=3 > [ 862.743704] [drm:dcn20_wait_for_blank_complete [amdgpu]] *ERROR* DC: > failed to blank crtc! > [ 863.002846] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for > DMUB idle: status=3 > [ 863.261451] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for > DMUB idle: status=3 > [ 863.262090] [drm] REG_WAIT timeout 1us * 10 tries - optc3_lock line:112 > [ 863.532231] [drm] REG_WAIT timeout 1us * 10 tries - > optc1_wait_for_state line:835 > [ 888.900914] watchdog: BUG: soft lockup - CPU#11 stuck for 26s! > [gnome-shell:2306] > [ 888.900921] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer > rpcrdma rdma_cm iw_cm ib_cm ib_core nft_objref nf_conntrack_netbios_ns > nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib > nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat > bridge stp llc ip6table_nat ip6table_mangle ip6table_raw ip6table_security > iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle > iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter > ip6_tables iptable_filter qrtr ns bnep vfat fat snd_hda_codec_realtek > intel_rapl_msr snd_hda_codec_generic intel_rapl_common ledtrig_audio > snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg edac_mce_amd > snd_intel_sdw_acpi snd_usb_audio snd_hda_codec kvm_amd snd_hda_core btusb > snd_usbmidi_lib btrtl snd_rawmidi snd_hwdep btbcm ppdev kvm snd_seq btintel > uvcvideo snd_seq_device videobuf2_vmalloc videobuf2_memops bluetooth > videobuf2_v4l2 snd_pcm videobuf2_common irqbypass wmi_bmof mxm_wmi > [ 888.900963] pcspkr snd_timer rapl k10temp i2c_piix4 videodev snd > ecdh_generic rfkill joydev soundcore mc parport_pc parport gpio_amdpt > gpio_generic acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc zram > ip_tables amdgpu drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit > drm_kms_helper cec drm crct10dif_pclmul crc32_pclmul crc32c_intel uas ccp > ghash_clmulni_intel sp5100_tco usb_storage r8169 nvme nvme_core wmi > ipmi_devintf ipmi_msghandler fuse > [ 888.900989] CPU: 11 PID: 2306 Comm: gnome-shell Not tainted > 5.15.12-200.fc35.x86_64 #1 > [ 888.900992] Hardware name: Micro-Star International Co., Ltd. MS-7A33/X370 > SLI PLUS (MS-7A33), BIOS 3.JR 11/29/2019 > [ 888.900993] RIP: 0010:delay_halt_mwaitx+0x39/0x40 > [ 888.900999] Code: 03 05 cb b6 95 4d 31 d2 48 89 d1 0f 01 fa b8 ff ff ff ff > b9 02 00 00 00 48 39 c6 48 0f 46 c6 48 89 c3 b8 f0 00 00 00 0f 01 fb <5b> c3 > 0f 1f 44 00 00 0f 1f 44 00 00 48 8b 05 9c 2f 03 01 e9 7f 47 > [ 888.901001] RSP: 0018:b7f243e63878 EFLAGS: 0293 > [ 888.901003] RAX: 00f0 RBX: 002dc50a RCX: > 0002 > [ 888.901005] RDX: RSI: 002dc50a RDI: > 027b5712e506 > [ 888.901006] RBP: 002dc50a R08: b7f243e63824 R09: > 0001 > [ 888.901007] R10: b7f243e63660 R11: 000d R12: > 917bd719 > [ 888.901009] R13: 917dd450 R14: 917dd45006a0 R15: > 917bd541fc00 > [ 888.901010] FS: 7f2912683d80() GS:918a9ecc() > knlGS: > [ 888.901011] CS: 0010 DS: ES: CR0: 80050033 > [ 888.901013] CR2: 33910fce CR3: 000105b22000 CR4: > 003506e0 > [ 888.901014] Call Trace: > [ 888.901016] > [ 888.901018] delay_halt+0x3b/0x60 > [ 888.901021] dcn20_post_unlock_program_front_end+0xf4/0x2c0 [amdgpu] > [ 888.901209] dc_commit_state+0x4b6/0xa50 [amdgpu] > [ 888.901382] amdgpu_dm_atomic_commit_tail+0x55c/0x2610 [amdgpu] > [ 888.901557] ? dcn20_calculate_dlg_params+0x4f4/0x540 [amdgpu] > [ 888.901735] ? dcn20_calculate_dlg_params+0x4f4/0x540 [amdgpu] > [ 888.901916] ? dcn30_calculate_wm_and_dlg_fp+0x707/0x8a0 [amdgpu] > [ 888.902090] ? dcn30_validate_bandwidth+0x10f/0x240 [amdgpu] > [ 888.902261] ? kfree+0xaa/0x3f0 > [ 888.902265] ? dcn30_validate_bandwidth+0x10f/0x240 [amdgpu] > [ 888.902435] ? dc_validate_global_state+0x31f/0x3c0 [amdgpu] > [ 888.902604] ? ttm_bo_mem_compat+0x2c/0x90 [ttm] > [ 888.902609] ? ttm_bo_validate+0x42/0x100
softlockup in v5.15.12 in dcn20_post_unlock_program_front_end
/0x140 [drm_kms_helper] [ 888.902969] drm_mode_atomic_ioctl+0x8fd/0xac0 [drm] [ 888.902992] ? __cond_resched+0x16/0x40 [ 888.902994] ? drm_plane_get_damage_clips.cold+0x1c/0x1c [drm] [ 888.903015] ? drm_atomic_set_property+0xb30/0xb30 [drm] [ 888.903035] drm_ioctl_kernel+0x86/0xd0 [drm] [ 888.903055] ? wp_page_reuse+0x61/0x70 [ 888.903057] drm_ioctl+0x220/0x3e0 [drm] [ 888.903077] ? drm_atomic_set_property+0xb30/0xb30 [drm] [ 888.903097] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [ 888.903222] __x64_sys_ioctl+0x82/0xb0 [ 888.903226] do_syscall_64+0x3b/0x90 [ 888.903228] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 888.903231] RIP: 0033:0x7f2918b362bb [ 888.903234] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3d 2b 0f 00 f7 d8 64 89 01 48 [ 888.903235] RSP: 002b:7ffdd451da48 EFLAGS: 0246 ORIG_RAX: 0010 [ 888.903237] RAX: ffda RBX: 7ffdd451da90 RCX: 7f2918b362bb [ 888.903238] RDX: 7ffdd451da90 RSI: c03864bc RDI: 0009 [ 888.903239] RBP: c03864bc R08: 0012 R09: 0012 [ 888.903240] R10: 0002 R11: 0246 R12: 55bcad405c50 [ 888.903241] R13: 0009 R14: 55bca9f13bc0 R15: 55bcabe0d6a0 [ 888.903243] (gdb) list *(dcn20_post_unlock_program_front_end+0xf4) 0x2561b4 is in dcn20_post_unlock_program_front_end (drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_hwseq.c:1766). 1761if (pipe->plane_state && !pipe->top_pipe && pipe->update_flags.bits.enable) { 1762struct hubp *hubp = pipe->plane_res.hubp; 1763int j = 0; 1764 1765for (j = 0; j < TIMEOUT_FOR_PIPE_ENABLE_MS*1000 1766&& hubp->funcs->hubp_is_flip_pending(hubp); j++) 1767mdelay(1); 1768} 1769} 1770 I can reproduce this by logging with GNOME on wayland, starting up the steam client and then letting the screen blank kick in. Note that starting an actual game is not necessary. Once I try to unblank the screen (keypress or mouse movement), it never does and the machine goes into a soft lockup. The host is Fedora 35 with a stock kernels. I also see this with earlier v5.15 kernels. I haven't tested anything pre-5.15 though. If I log in with Xorg, I don't see the issue. Video card is: 30:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 [Radeon RX 6600/6600 XT/6600M] (rev c7) (prog-if 00 [VGA controller]) Subsystem: Sapphire Technology Limited Device e447 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ SERR- Kernel driver in use: amdgpu Kernel modules: amdgpu I'm able to test patches if it helps. Let me know if you want other info as well. Thanks! -- Jeff Layton
Re: [PATCH 03/34] net/ceph: convert put_page() to put_user_page*()
On Thu, 2019-08-01 at 19:19 -0700, john.hubb...@gmail.com wrote: > From: John Hubbard > > For pages that were retained via get_user_pages*(), release those pages > via the new put_user_page*() routines, instead of via put_page() or > release_pages(). > > This is part a tree-wide conversion, as described in commit fc1d8e7cca2d > ("mm: introduce put_user_page*(), placeholder versions"). > > Cc: Ilya Dryomov > Cc: Sage Weil > Cc: David S. Miller > Cc: ceph-de...@vger.kernel.org > Cc: net...@vger.kernel.org > Signed-off-by: John Hubbard > --- > net/ceph/pagevec.c | 8 +--- > 1 file changed, 1 insertion(+), 7 deletions(-) > > diff --git a/net/ceph/pagevec.c b/net/ceph/pagevec.c > index 64305e7056a1..c88fff2ab9bd 100644 > --- a/net/ceph/pagevec.c > +++ b/net/ceph/pagevec.c > @@ -12,13 +12,7 @@ > > void ceph_put_page_vector(struct page **pages, int num_pages, bool dirty) > { > - int i; > - > - for (i = 0; i < num_pages; i++) { > - if (dirty) > - set_page_dirty_lock(pages[i]); > - put_page(pages[i]); > - } > + put_user_pages_dirty_lock(pages, num_pages, dirty); > kvfree(pages); > } > EXPORT_SYMBOL(ceph_put_page_vector); This patch looks sane enough. Assuming that the earlier patches are OK: Acked-by: Jeff Layton ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx