Re: 6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang
On Tue, Sep 10, 2024 at 8:47 PM Leo Li wrote: > > Thanks Mikhail, I think I know what's going on now. > > The `scale-monitor-framebuffer` experimental setting is what puts us down the > bad code path. It seems VRR has nothing to do with this issue, just setting > `scale-monitor-framebuffer` is enough to reproduce. I ran some additional tests: 1) $ gsettings set org.gnome.mutter experimental-features "['variable-refresh-rate']" Symptoms: No 2) $ gsettings set org.gnome.mutter experimental-features "['scale-monitor-framebuffer']" Symptoms: Screen flickers happening when moving cursor. 3) $ gsettings set org.gnome.mutter experimental-features "['variable-refresh-rate', 'scale-monitor-framebuffer']" But Variable Refresh Rate is disabled in the display settings. Symptoms: As previous - Screen flickers happening when moving cursor. 4) $ gsettings set org.gnome.mutter experimental-features "['variable-refresh-rate', 'scale-monitor-framebuffer']" And Variable Refresh Rate is enabled in the display settings. Symptoms: On Radeon 7900XTX hardware computer completely hangs without any messages in kernel logs. On Wed, Sep 11, 2024 at 2:11 AM Leo Li wrote: > > Hi Mikhail, > > Can you give this patch a try to see if it helps? > https://gist.github.com/leeonadoh/3271e90ec95d768424c572c970ada743 > Thanks, with this patch, the issue is not reproduced anymore. Tested-by: Mikhail Gavrilov The only thing that worries me is the thought that the problem with hang is now hidden. It's one thing when the GPU hangs but the system continues to work, another thing when the system hangs completely and even Alt+SysRq+REISUB does not help to reboot the system. It shouldn't be like this... -- Best Regards, Mike Gavrilov.
Re: 6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang
On Sat, Sep 7, 2024 at 12:47 AM Leo Li wrote: > > > Hi Mikhail, > > I've tried to align my system with yours as best as I can, but so far, I've > had > no luck reproducing the hang. A video of what I'm doing: > https://youtu.be/VeD-LPCnfWM?si=b2baF8MyDBuU4jRH > (Under the hood, the W7900 and 7900xt should be the same) I have done additional tests: 1. The computer does not hang with 6900XT instead the screen flickers when moving the cursor. 2. The computer does not hang with 7900XTX if I turn off VRR. But the screen flickers when moving the cursor, as on 6900XT. To enable VRR, please set 'variable-refresh-rate' in experimental-features, and in the Display setting, enable Variable Refresh Rate. $ gsettings set org.gnome.mutter experimental-features "['variable-refresh-rate', 'scale-monitor-framebuffer']" https://postimg.cc/PvXYdvGR 3. The chances of the problem reoccurring are much higher when running the game "Play Innocence Or Money Season 1 - Episodes 1 to 3". There is a free demo version. https://store.steampowered.com/app/1958390/Innocence_Or_Money_Season_1__Episodes_1_to_3/ Demonstration: https://youtu.be/XIe0pQYPVUo > > I have a few suggestions: > > First, can you also open an issue on the amd gitlab tracker? It gives more > visibility to others, and makes working together a bit easier: > https://gitlab.freedesktop.org/drm/amd/-/issues > > Second, can you try adding "amdgpu.dcdebugmask=0x40" to your kernel cmdline at > boot, and see if you can still repro the hang? Yes. This didn't help. > This setting disables hw planes. If it resolves the hang, then it's quite > interesting, because it suggests that gnome may be using direct-scanout via hw > planes. We may need to align our gnome configuration in that case, since I > don't > see any additional hw planes being used on my setup. > > Third, in case these two issues are related, can you give the attached patch > on > this issue thread a try as well? > https://gitlab.freedesktop.org/drm/amd/-/issues/3569#note_2558359 This patch also didn't help. Maybe try to compile a kernel with the same config as mine and enable VRR to repeat the problem? I attached my build config to this message. -- Best Regards, Mike Gavrilov. .config.zip Description: Zip archive
Re: 6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang
On Thu, Sep 5, 2024 at 4:06 AM Leo Li wrote: > > Can you delete ", new_cursor_state" on that line and try again? Seems to be a > unused variable warning being elevated to an error. > Thanks, I applied both patches and can confirm that this solved the issue. The first patch was definitely not enough. Tested-by: Mikhail Gavrilov -- Best Regards, Mike Gavrilov.
Re: 6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang
On Wed, Sep 4, 2024 at 4:15 AM Leo Li wrote: > Hi Mike, > > Super sorry for the ridiculous wait. Your first two emails slipped by my > inbox, > which is really silly, given I'm first in the to field... > > Thanks for bisecting and finding a free game to reproduce it on. I did not > have > luck reproducing this today, but I am on sway and not gnome. While I get gnome > set up, will you be able to test which one of these reverts fixes the hang for > you? Whether just 1/2 is enough, or both 1/2 and 2/2 is required? > > I applied them on top of Linus's v6.11-rc6 tag, so hopefully they'll git am > cleanly for you: > > 1/2: > https://gist.github.com/leeonadoh/69147b5fa8d815b39c5f4c3e005cca28#file-0001-revert-drm-amd-display-move-primary-plane-zpos-highe-patch > 2/2: > https://gist.github.com/leeonadoh/69147b5fa8d815b39c5f4c3e005cca28#file-0002-revert-drm-amd-display-introduce-overlay-cursor-mode-patch > The first patch is not enough. Yes, it fixes the system hang when I launch the game "Find the Orange Narwhal". But it does not fix the issue completely. Some RenPy games still can lead the system to hang. For example "Innocence Or Money Season 1" https://store.steampowered.com/app/1958390/Innocence_Or_Money_Season_1__Episodes_1_to_3/ on the language selection screen. Unfortunately the kernel is not builded with both patches. I have got compilation error after applying second patch: CC [M] drivers/gpu/drm/nouveau/nvkm/engine/fifo/chid.o drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c: In function ‘amdgpu_dm_atomic_check’: drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:11003:69: error: unused variable ‘new_cursor_state’ [-Werror=unused-variable] 11003 | struct drm_plane_state *old_plane_state, *new_plane_state, *new_cursor_state; | ^~~~ CC [M] drivers/gpu/drm/amd/amdgpu/../display/dc/basics/conversion.o *** CC [M] drivers/gpu/drm/nouveau/nvkm/engine/gr/tu102.o cc1: all warnings being treated as errors CC [M] drivers/gpu/drm/amd/amdgpu/../display/dc/dml/calcs/dcn_calc_auto.o CC [M] drivers/gpu/drm/nouveau/nvkm/engine/gr/ga102.o CC [M] drivers/gpu/drm/nouveau/nvkm/engine/gr/ad102.o CC [M] drivers/gpu/drm/nouveau/nvkm/engine/gr/r535.o CC [M] drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/clk_mgr.o CC [M] drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxnv40.o CC [M] drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dce60/dce60_clk_mgr.o make[6]: *** [scripts/Makefile.build:244: drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.o] Error 1 make[6]: *** Waiting for unfinished jobs CC [M] drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxnv50.o *** make[5]: *** [scripts/Makefile.build:485: drivers/gpu/drm/amd/amdgpu] Error 2 make[4]: *** [scripts/Makefile.build:485: drivers/gpu/drm] Error 2 make[3]: *** [scripts/Makefile.build:485: drivers/gpu] Error 2 make[2]: *** [scripts/Makefile.build:485: drivers] Error 2 make[1]: *** [/home/mikhail/packaging-work/git/linux-3/Makefile:1925: .] Error 2 make: *** [Makefile:224: __sub-make] Error 2 -- Best Regards, Mike Gavrilov.
Re: 6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang
On Sun, Aug 25, 2024 at 2:12 AM Mikhail Gavrilov wrote: > > Hi, > Is anyone trying to look into it? > I continue to reproduce this issue on fresh kernel builds 6.11-rc4+. > In addition to the RenPy engine, the problem also reproduces on games > from Ubisoft, such as Far Cry 4. > A very important note that I missed in the first message. > To reproduce the problem, you need to enable scaling in Gnome for > HiDPI monitors. > I am using 4K resolution with 200% of fractional scaling. Sorry for persistence, but I'm afraid there's no time left to fix this regression. There's a week left until the release. A month later, no one has looked at what the problem is. -- Best Regards, Mike Gavrilov.
Re: 6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang
On Mon, Aug 5, 2024 at 11:05 PM Mikhail Gavrilov wrote: > > Hi, > After commit 1b04dcca4fb1, launching some RenPy games causes computer hang. > After the hang, even Alt + sysrq + REISUB can't reboot the computer! > And no trace in the kernel log! > For demonstration, I'm going to use the game "Find the Orange Narwhal" > because it is free and has 100% reproducivity for this issue. > You can find it in the Steam Store: > https://store.steampowered.com/app/2946010/Find_the_Orange_Narwhal/ > I uploaded demonstration video to youtube: https://youtu.be/yVW6rImRpXw > > Unfortunately, I can't check the revert commit 1541d63c5fe2 because of > conflicts. > > mikhail@primary-ws ~/p/g/linux (master)> git reset v6.11-rc1 --hard > HEAD is now at 8400291e289e Linux 6.11-rc1 > > mikhail@primary-ws ~/p/g/linux (master)> git revert -n 1b04dcca4fb1 > Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > CONFLICT (content): Merge conflict in > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h > Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c > Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c > CONFLICT (content): Merge conflict in > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c > error: could not revert 1b04dcca4fb1... drm/amd/display: Introduce > overlay cursor mode > hint: after resolving the conflicts, mark the corrected paths > hint: with 'git add ' or 'git rm ' > hint: Disable this message with "git config advice.mergeConflict false" > > commit 1b04dcca4fb10dd3834893a60de74edd99f2bfaf > Author: Leo Li > Date: Thu Jan 18 16:29:49 2024 -0500 > > drm/amd/display: Introduce overlay cursor mode > > [Why] > > DCN is the display hardware for amdgpu. DRM planes are backed by DCN > hardware pipes, which carry pixel data from one end (memory), to the > other (output encoder). > > Each DCN pipe has the ability to blend in a cursor early on in the > pipeline. In other words, there are no dedicated cursor planes in DCN, > which makes cursor behavior somewhat unintuitive for compositors. > > For example, if the cursor is in RGB format, but the top-most DRM plane > is in YUV format, DCN will not be able to blend them. Because of this, > amdgpu_dm rejects all configurations where a cursor needs to be enabled > on top of a YUV formatted plane. > > From a compositor's perspective, when computing an allocation for > hardware plane offloading, this cursor-on-yuv configuration result in an > atomic test failure. Since the failure reason is not obvious at all, > compositors will likely fall back to full rendering, which is not ideal. > > Instead, amdgpu_dm can try to accommodate the cursor-on-yuv > configuration by opportunistically reserving a separate DCN pipe just > for the cursor. We can refer to this as "overlay cursor mode". It is > contrasted with "native cursor mode", where the native DCN per-pipe > cursor is used. > > [How] > > On each crtc, compute whether the cursor plane should be enabled in > overlay mode. If it is, mark the CRTC as requesting overlay cursor mode. > > Overlay cursor should be enabled whenever there exists a underlying > plane that has YUV format, or is scaled differently than the cursor. It > should also be enabled if there is no underlying plane, or if underlying > planes do not cover the entire CRTC. > > During DC validation, attempt to enable a separate DCN pipe for the > cursor if it's in overlay mode. If that fails, or if no overlay mode is > requested, then fallback to native mode. > > v2: > * Update commit message for when overlay cursor should be enabled > * Also consider scale and no-underlying-plane case (cursor on crtc bg) > * Consider all underlying planes when determinig overlay/native, not > just the plane immediately beneath the cursor, as it may not cover the > entire CRTC. > * Fix typo s/decending/descending/ > * Force native cursor on pre-DCN hardware > > Reviewed-by: Harry Wentland > Acked-by: Zaeem Mohamed > Signed-off-by: Leo Li > Acked-by: Harry Wentland > Acked-by: Pekka Paalanen > Tested-by: Daniel Wheeler > Signed-off-by: Alex Deucher > > drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 490 > +++--- > drivers/gpu
6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang
Hi, After commit 1b04dcca4fb1, launching some RenPy games causes computer hang. After the hang, even Alt + sysrq + REISUB can't reboot the computer! And no trace in the kernel log! For demonstration, I'm going to use the game "Find the Orange Narwhal" because it is free and has 100% reproducivity for this issue. You can find it in the Steam Store: https://store.steampowered.com/app/2946010/Find_the_Orange_Narwhal/ I uploaded demonstration video to youtube: https://youtu.be/yVW6rImRpXw Unfortunately, I can't check the revert commit 1541d63c5fe2 because of conflicts. mikhail@primary-ws ~/p/g/linux (master)> git reset v6.11-rc1 --hard HEAD is now at 8400291e289e Linux 6.11-rc1 mikhail@primary-ws ~/p/g/linux (master)> git revert -n 1b04dcca4fb1 Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c CONFLICT (content): Merge conflict in drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c CONFLICT (content): Merge conflict in drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c error: could not revert 1b04dcca4fb1... drm/amd/display: Introduce overlay cursor mode hint: after resolving the conflicts, mark the corrected paths hint: with 'git add ' or 'git rm ' hint: Disable this message with "git config advice.mergeConflict false" commit 1b04dcca4fb10dd3834893a60de74edd99f2bfaf Author: Leo Li Date: Thu Jan 18 16:29:49 2024 -0500 drm/amd/display: Introduce overlay cursor mode [Why] DCN is the display hardware for amdgpu. DRM planes are backed by DCN hardware pipes, which carry pixel data from one end (memory), to the other (output encoder). Each DCN pipe has the ability to blend in a cursor early on in the pipeline. In other words, there are no dedicated cursor planes in DCN, which makes cursor behavior somewhat unintuitive for compositors. For example, if the cursor is in RGB format, but the top-most DRM plane is in YUV format, DCN will not be able to blend them. Because of this, amdgpu_dm rejects all configurations where a cursor needs to be enabled on top of a YUV formatted plane. From a compositor's perspective, when computing an allocation for hardware plane offloading, this cursor-on-yuv configuration result in an atomic test failure. Since the failure reason is not obvious at all, compositors will likely fall back to full rendering, which is not ideal. Instead, amdgpu_dm can try to accommodate the cursor-on-yuv configuration by opportunistically reserving a separate DCN pipe just for the cursor. We can refer to this as "overlay cursor mode". It is contrasted with "native cursor mode", where the native DCN per-pipe cursor is used. [How] On each crtc, compute whether the cursor plane should be enabled in overlay mode. If it is, mark the CRTC as requesting overlay cursor mode. Overlay cursor should be enabled whenever there exists a underlying plane that has YUV format, or is scaled differently than the cursor. It should also be enabled if there is no underlying plane, or if underlying planes do not cover the entire CRTC. During DC validation, attempt to enable a separate DCN pipe for the cursor if it's in overlay mode. If that fails, or if no overlay mode is requested, then fallback to native mode. v2: * Update commit message for when overlay cursor should be enabled * Also consider scale and no-underlying-plane case (cursor on crtc bg) * Consider all underlying planes when determinig overlay/native, not just the plane immediately beneath the cursor, as it may not cover the entire CRTC. * Fix typo s/decending/descending/ * Force native cursor on pre-DCN hardware Reviewed-by: Harry Wentland Acked-by: Zaeem Mohamed Signed-off-by: Leo Li Acked-by: Harry Wentland Acked-by: Pekka Paalanen Tested-by: Daniel Wheeler Signed-off-by: Alex Deucher drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 490 +++--- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h | 7 +++ drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c | 1 + drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c | 13 - 4 files changed, 389 insertions(+), 122 deletions(-) My hardware specs are: https://linux-hardware.org/?probe=61bd7390a9 Leo, can you look into it, please? -- Best Regards, Mike Gavrilov.
Re: 6.10/bisected/regression - Since commit e356d321d024 in the kernel log appears the message "MES failed to respond to msg=MISC (WAIT_REG_MEM)" which were never seen before
On Wed, Jul 24, 2024 at 10:16 PM Mikhail Gavrilov wrote: > > https://patchwork.freedesktop.org/patch/605201/ > For which kernel is this patch intended? The patch is not applied on > top of d67978318827. I am able to apply this patch on top of e4fc196f5ba3 and the issue is gone. Tested-by: Mikhail Gavrilov -- Best Regards, Mike Gavrilov.
Re: 6.10/bisected/regression - Since commit e356d321d024 in the kernel log appears the message "MES failed to respond to msg=MISC (WAIT_REG_MEM)" which were never seen before
On Tue, Jul 23, 2024 at 2:34 AM Alex Deucher wrote: > Do either of these patches help? > https://patchwork.freedesktop.org/patch/605437/ Unfortunately, this patch didn't help. Please see the attached kernel log. > https://patchwork.freedesktop.org/patch/605201/ For which kernel is this patch intended? The patch is not applied on top of d67978318827. mikhail@primary-ws ~/p/g/linux-3 (master)> git reset d67978318827 --hard HEAD is now at d67978318827 Merge tag 'x86_cpu_for_v6.11_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip mikhail@primary-ws ~/p/g/linux-3 (master)> git apply drm-amdgpu-mes-fix-mes-ring-buffer-overflow.patch error: drivers/gpu/drm/amd/amdgpu/mes_v12_0.c: No such file or directory -- Best Regards, Mike Gavrilov. <>
Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz
On Tue, Jul 16, 2024 at 10:10 PM Alex Deucher wrote: > > Does the attached partial revert fix it? > > Alex > Yes, thanks. Tested-by: Mikhail Gavrilov -- Best Regards, Mike Gavrilov.
Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz
On Wed, Jul 10, 2024 at 12:01 PM Mikhail Gavrilov wrote: > > On Tue, Jul 9, 2024 at 7:48 PM Rodrigo Siqueira Jordao > wrote: > > Hi, > > > > I also tried it with 6900XT. I got the same results on my side. > > This is weird. > > > Anyway, I could not reproduce the issue with the below components. I may > > be missing something that will trigger this bug; in this sense, could > > you describe the following: > > - The display resolution and refresh rate. > > 3840x2160 and 120Hz > At 60Hz issue not reproduced. > > > - Are you able to reproduce this issue with DP and HDMI? > > My TV, an OLED LG C3, has only an HDMI 2.1 port. > > > - Could you provide the firmware information: sudo cat > > /sys/kernel/debug/dri/0/amdgpu_firmware_info > > > sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info > [sudo] password for mikhail: > VCE feature version: 0, firmware version: 0x > UVD feature version: 0, firmware version: 0x > MC feature version: 0, firmware version: 0x > ME feature version: 38, firmware version: 0x000e > PFP feature version: 38, firmware version: 0x000e > CE feature version: 38, firmware version: 0x0003 > RLC feature version: 1, firmware version: 0x001f > RLC SRLC feature version: 1, firmware version: 0x0001 > RLC SRLG feature version: 1, firmware version: 0x0001 > RLC SRLS feature version: 1, firmware version: 0x0001 > RLCP feature version: 0, firmware version: 0x > RLCV feature version: 0, firmware version: 0x > MEC feature version: 38, firmware version: 0x0015 > MEC2 feature version: 38, firmware version: 0x0015 > IMU feature version: 0, firmware version: 0x > SOS feature version: 0, firmware version: 0x > ASD feature version: 553648344, firmware version: 0x21d8 > TA XGMI feature version: 0x, firmware version: 0x > TA RAS feature version: 0x, firmware version: 0x > TA HDCP feature version: 0x, firmware version: 0x173f > TA DTM feature version: 0x, firmware version: 0x1216 > TA RAP feature version: 0x, firmware version: 0x > TA SECUREDISPLAY feature version: 0x, firmware version: 0x > SMC feature version: 0, program: 0, firmware version: 0x00544fdf (84.79.223) > SDMA0 feature version: 52, firmware version: 0x0009 > VCN feature version: 0, firmware version: 0x0311f002 > DMCU feature version: 0, firmware version: 0x > DMCUB feature version: 0, firmware version: 0x05000f00 > TOC feature version: 0, firmware version: 0x0007 > MES_KIQ feature version: 0, firmware version: 0x > MES feature version: 0, firmware version: 0x > VPE feature version: 0, firmware version: 0x > VBIOS version: 102-RAPHAEL-008 > I forgot to add output for discrete GPU: > sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info [sudo] password for mikhail: VCE feature version: 0, firmware version: 0x UVD feature version: 0, firmware version: 0x MC feature version: 0, firmware version: 0x ME feature version: 44, firmware version: 0x0040 PFP feature version: 44, firmware version: 0x0062 CE feature version: 44, firmware version: 0x0025 RLC feature version: 1, firmware version: 0x0060 RLC SRLC feature version: 0, firmware version: 0x RLC SRLG feature version: 0, firmware version: 0x RLC SRLS feature version: 0, firmware version: 0x RLCP feature version: 0, firmware version: 0x RLCV feature version: 0, firmware version: 0x MEC feature version: 44, firmware version: 0x0076 MEC2 feature version: 44, firmware version: 0x0076 IMU feature version: 0, firmware version: 0x SOS feature version: 0, firmware version: 0x00210e64 ASD feature version: 553648345, firmware version: 0x21d9 TA XGMI feature version: 0x, firmware version: 0x200f TA RAS feature version: 0x, firmware version: 0x1b00013e TA HDCP feature version: 0x, firmware version: 0x173f TA DTM feature version: 0x, firmware version: 0x1216 TA RAP feature version: 0x, firmware version: 0x0716 TA SECUREDISPLAY feature version: 0x, firmware version: 0x SMC feature version: 0, program: 0, firmware version: 0x003a5a00 (58.90.0) SDMA0 feature version: 52, firmware version: 0x0053 SDMA1 feature version: 52, firmware version: 0x0053 SDMA2 feature version: 52, firmware version: 0x0053 SDMA3 feature version: 52, firmware version: 0x0053 VCN feature version: 0, firmware version: 0x0311f002 DMCU feature version: 0, firmware version: 0x DMCUB feature version: 0, firmware version: 0x02020020 TOC feature version: 0, firmware version: 0x MES_KIQ feature version: 0, firmware version: 0x MES feature version: 0, firmware version: 0x VPE feature version: 0, firmware version: 0x VBIOS version: 113-D4120100-100 -- Best Regards, Mike Gavrilov.
Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz
On Tue, Jul 9, 2024 at 7:48 PM Rodrigo Siqueira Jordao wrote: > Hi, > > I also tried it with 6900XT. I got the same results on my side. This is weird. > Anyway, I could not reproduce the issue with the below components. I may > be missing something that will trigger this bug; in this sense, could > you describe the following: > - The display resolution and refresh rate. 3840x2160 and 120Hz At 60Hz issue not reproduced. > - Are you able to reproduce this issue with DP and HDMI? My TV, an OLED LG C3, has only an HDMI 2.1 port. > - Could you provide the firmware information: sudo cat > /sys/kernel/debug/dri/0/amdgpu_firmware_info > sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info [sudo] password for mikhail: VCE feature version: 0, firmware version: 0x UVD feature version: 0, firmware version: 0x MC feature version: 0, firmware version: 0x ME feature version: 38, firmware version: 0x000e PFP feature version: 38, firmware version: 0x000e CE feature version: 38, firmware version: 0x0003 RLC feature version: 1, firmware version: 0x001f RLC SRLC feature version: 1, firmware version: 0x0001 RLC SRLG feature version: 1, firmware version: 0x0001 RLC SRLS feature version: 1, firmware version: 0x0001 RLCP feature version: 0, firmware version: 0x RLCV feature version: 0, firmware version: 0x MEC feature version: 38, firmware version: 0x0015 MEC2 feature version: 38, firmware version: 0x0015 IMU feature version: 0, firmware version: 0x SOS feature version: 0, firmware version: 0x ASD feature version: 553648344, firmware version: 0x21d8 TA XGMI feature version: 0x, firmware version: 0x TA RAS feature version: 0x, firmware version: 0x TA HDCP feature version: 0x, firmware version: 0x173f TA DTM feature version: 0x, firmware version: 0x1216 TA RAP feature version: 0x, firmware version: 0x TA SECUREDISPLAY feature version: 0x, firmware version: 0x SMC feature version: 0, program: 0, firmware version: 0x00544fdf (84.79.223) SDMA0 feature version: 52, firmware version: 0x0009 VCN feature version: 0, firmware version: 0x0311f002 DMCU feature version: 0, firmware version: 0x DMCUB feature version: 0, firmware version: 0x05000f00 TOC feature version: 0, firmware version: 0x0007 MES_KIQ feature version: 0, firmware version: 0x MES feature version: 0, firmware version: 0x VPE feature version: 0, firmware version: 0x VBIOS version: 102-RAPHAEL-008 > Also, could you conduct the below tests and report the results: > > - Test 1: Just revert the fallback patch (drm/amd/display: Add fallback > configuration for set DRR in DCN10) and see if it solves the issue. It's not enough. I checked revert commit bc87d666c05 on top of 34afb82a3c67. > - Test 2: Try the latest amd-staging-drm-next > (https://gitlab.freedesktop.org/agd5f/linux) and see if the issue is gone. I checked commit 7cef45b1347a in the amd-staging-drm-next branch. Same here. > - Test 3: In the kernel that you see the issue, could you install the > latest firmware and see if it fix the issue? Check: > https://gitlab.freedesktop.org/drm/firmware P.S.: Don't forget to update > the initramfs or something similar in your system. Is this any sense? Fedora Rawhide always ships with the latest kernel and firmware. -- Best Regards, Mike Gavrilov.
Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz
On Sat, Jun 29, 2024 at 9:46 PM Rodrigo Siqueira Jordao wrote: > Hi Mikhail, > > I'm trying to reproduce this issue, but until now, I've been unable to > reproduce it. I tried some different scenarios with the following > components: > > 1. Displays: I tried with one and two displays > - 4k@120 - DP && 4k@60 - HDMI > - 4k@244 Oled - DP > 2. GPU: 7900XTX The issue only reproduced with RDNA2 (6900XT) RDNA3 (7900XTX) is not affected. -- Best Regards, Mike Gavrilov.
Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz
On Fri, Jun 21, 2024 at 12:56 PM Linux regression tracking (Thorsten Leemhuis) wrote: > Hmmm, I might have missed something, but it looks like nothing happened > here since then. What's the status? Is the issue still happening? Yes. Tested on e5b3efbe1ab1. I spotted that the problem disappears after forcing the TV to sleep (activate screensaver + ) and then wake it up by pressing any button and entering a password. Hope this information can't help figure out how to fix it. -- Best Regards, Mike Gavrilov.
Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz
On Fri, Jun 7, 2024 at 5:29 PM Linux regression tracking (Thorsten Leemhuis) wrote: > > [CCing the other amd drm maintainers] > > Mikhail: are those details in any way relevant? Then in the future best > leave them out (or make things easier to follow), they make the bug > report confusing and sounds like this is just a bug, when it fact from > your bisection is sounds like this is a regression. Apologies if my pre-story is confused. I just wanna say I completely moved to the 7900XTX more than a year ago and I was surprised to see this regression on the old 6900XT. An accident helped me find this issue because I didn't plan to use old hardware. -- Best Regards, Mike Gavrilov.
Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz
On Fri, Jun 7, 2024 at 6:39 PM Alex Deucher wrote: > > --- a/drivers/gpu/drm/amd/display/dc/optc/dcn10/dcn10_optc.c > +++ b/drivers/gpu/drm/amd/display/dc/optc/dcn10/dcn10_optc.c > @@ -944,7 +944,7 @@ void optc1_set_drr( > OTG_V_TOTAL_MAX_SEL, 1, > OTG_FORCE_LOCK_ON_EVENT, 0, > OTG_SET_V_TOTAL_MIN_MASK_EN, 0, > - OTG_SET_V_TOTAL_MIN_MASK, 0); > + OTG_SET_V_TOTAL_MIN_MASK, (1 << 1)); /* TRIGA > */ > > // Setup manual flow control for EOF via TRIG_A > optc->funcs->setup_manual_trigger(optc); Thanks, Alex. I applied this patch on top of 771ed66105de and unfortunately the issue is not fixed. I saw a green flashing bar on top of the screen again. -- Best Regards, Mike Gavrilov.
Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz
On Sun, May 26, 2024 at 7:06 PM Mikhail Gavrilov wrote: > > Hi, > Day before yesterday I replaced 7900XTX to 6900XT for got clear in > which kernel first time appeared warning message "DMA-API: amdgpu > :0f:00.0: cacheline tracking EEXIST, overlapping mappings aren't > supported". > The kernel 6.3 and older won't boot on a computer with Radeon 7900XTX. > When I booted the system with 6900XT I saw a green flashing bar on top > of the screen when I typed commands in the gnome terminal which was > maximized on full screen. > Demonstration: https://youtu.be/tTvwQ_5pRkk > For reproduction you need Radeon 6900XT GPU connected to 120Hz OLED TV by > HDMI. > > I bisected the issue and the first commit which I found was 6d4279cb99ac. > commit 6d4279cb99ac4f51d10409501d29969f687ac8dc (HEAD) > Author: Rodrigo Siqueira > Date: Tue Mar 26 10:42:05 2024 -0600 > > drm/amd/display: Drop legacy code > > This commit removes code that are not used by display anymore. > > Acked-by: Hamza Mahfooz > Signed-off-by: Rodrigo Siqueira > Signed-off-by: Alex Deucher > > drivers/gpu/drm/amd/display/dc/inc/hw/stream_encoder.h | 4 > drivers/gpu/drm/amd/display/dc/inc/resource.h | 7 --- > drivers/gpu/drm/amd/display/dc/optc/dcn20/dcn20_optc.c | 10 > -- > drivers/gpu/drm/amd/display/dc/resource/dcn21/dcn21_resource.c | 33 > + > 4 files changed, 1 insertion(+), 53 deletions(-) > > Every time after bisecting I usually make sure that I found the right > commit and build the kernel with revert of the bad commit. > But this time I again observed an issue after running a kernel builded > without commit 6d4279cb99ac. > And I decided to find a second bad commit. > The second bad commit has been bc87d666c05. > commit bc87d666c05a13e6d4ae1ddce41fc43d2567b9a2 (HEAD) > Author: Rodrigo Siqueira > Date: Tue Mar 26 11:55:19 2024 -0600 > > drm/amd/display: Add fallback configuration for set DRR in DCN10 > > Set OTG/OPTC parameters to 0 if something goes wrong on DCN10. > > Acked-by: Hamza Mahfooz > Signed-off-by: Rodrigo Siqueira > Signed-off-by: Alex Deucher > > drivers/gpu/drm/amd/display/dc/optc/dcn10/dcn10_optc.c | 15 --- > 1 file changed, 12 insertions(+), 3 deletions(-) > > After reverting both these commits on top of 54f71b0369c9 the issue is gone. > > I also attach the build config. > > My hardware specs: https://linux-hardware.org/?probe=f25a873c5e > > Rodrigo or anyone else from the AMD team can you look please. > Did anyone watch? -- Best Regards, Mike Gavrilov.
6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz
Hi, Day before yesterday I replaced 7900XTX to 6900XT for got clear in which kernel first time appeared warning message "DMA-API: amdgpu :0f:00.0: cacheline tracking EEXIST, overlapping mappings aren't supported". The kernel 6.3 and older won't boot on a computer with Radeon 7900XTX. When I booted the system with 6900XT I saw a green flashing bar on top of the screen when I typed commands in the gnome terminal which was maximized on full screen. Demonstration: https://youtu.be/tTvwQ_5pRkk For reproduction you need Radeon 6900XT GPU connected to 120Hz OLED TV by HDMI. I bisected the issue and the first commit which I found was 6d4279cb99ac. commit 6d4279cb99ac4f51d10409501d29969f687ac8dc (HEAD) Author: Rodrigo Siqueira Date: Tue Mar 26 10:42:05 2024 -0600 drm/amd/display: Drop legacy code This commit removes code that are not used by display anymore. Acked-by: Hamza Mahfooz Signed-off-by: Rodrigo Siqueira Signed-off-by: Alex Deucher drivers/gpu/drm/amd/display/dc/inc/hw/stream_encoder.h | 4 drivers/gpu/drm/amd/display/dc/inc/resource.h | 7 --- drivers/gpu/drm/amd/display/dc/optc/dcn20/dcn20_optc.c | 10 -- drivers/gpu/drm/amd/display/dc/resource/dcn21/dcn21_resource.c | 33 + 4 files changed, 1 insertion(+), 53 deletions(-) Every time after bisecting I usually make sure that I found the right commit and build the kernel with revert of the bad commit. But this time I again observed an issue after running a kernel builded without commit 6d4279cb99ac. And I decided to find a second bad commit. The second bad commit has been bc87d666c05. commit bc87d666c05a13e6d4ae1ddce41fc43d2567b9a2 (HEAD) Author: Rodrigo Siqueira Date: Tue Mar 26 11:55:19 2024 -0600 drm/amd/display: Add fallback configuration for set DRR in DCN10 Set OTG/OPTC parameters to 0 if something goes wrong on DCN10. Acked-by: Hamza Mahfooz Signed-off-by: Rodrigo Siqueira Signed-off-by: Alex Deucher drivers/gpu/drm/amd/display/dc/optc/dcn10/dcn10_optc.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) After reverting both these commits on top of 54f71b0369c9 the issue is gone. I also attach the build config. My hardware specs: https://linux-hardware.org/?probe=f25a873c5e Rodrigo or anyone else from the AMD team can you look please. -- Best Regards, Mike Gavrilov. .config.zip Description: Zip archive
Re: regression/bisected/6.8 commit f7fe64ad0f22ff034f8ebcfbd7299ee9cc9b57d7 leads to GPU hang when I open GNOME activities
On Wed, Jan 24, 2024 at 7:19 AM Mikhail Gavrilov wrote: > > Who could dig into it, please? You decided to revert it? https://lkml.org/lkml/2024/1/22/1866 Also I forgot to attach the kernel build .config in the previous message. I'm going to fix it here. It may be useful for reproducing my bug script. -- Best Regards, Mike Gavrilov. .config.zip Description: Zip archive
Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
On Fri, Dec 15, 2023 at 5:37 PM Christian König wrote: > > I have no idea :) > > From the logs I can see that the AMDGPU now has the proper BARs assigned: > > [5.722015] pci :03:00.0: [1002:73df] type 00 class 0x038000 > [5.722051] pci :03:00.0: reg 0x10: [mem > 0xf8-0xfb 64bit pref] > [5.722081] pci :03:00.0: reg 0x18: [mem > 0xfc-0xfc0fff 64bit pref] > [5.722112] pci :03:00.0: reg 0x24: [mem 0xfca0-0xfcaf] > [5.722134] pci :03:00.0: reg 0x30: [mem 0xfcb0-0xfcb1 pref] > [5.722368] pci :03:00.0: PME# supported from D1 D2 D3hot D3cold > [5.722484] pci :03:00.0: 63.008 Gb/s available PCIe bandwidth, > limited by 8.0 GT/s PCIe x8 link at :00:01.1 (capable of 252.048 > Gb/s with 16.0 GT/s PCIe x16 link) > > And with that the driver can work perfectly fine. > > Have you updated the BIOS or added/removed some other hardware? Maybe > somebody added a quirk for your BIOS into the PCIe code or something > like that. No, nothing changed in hardware. But I found the commit which fixes it. > git bisect unfixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 is the first fixed commit commit 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 Author: Vasant Hegde Date: Thu Sep 21 09:21:45 2023 + iommu/amd: Introduce iommu_dev_data.flags to track device capabilities Currently we use struct iommu_dev_data.iommu_v2 to keep track of the device ATS, PRI, and PASID capabilities. But these capabilities can be enabled independently (except PRI requires ATS support). Hence, replace the iommu_v2 variable with a flags variable, which keep track of the device capabilities. From commit 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability with all VFs"), device PRI/PASID is shared between PF and any associated VFs. Hence use pci_pri_supported() and pci_pasid_features() instead of pci_find_ext_capability() to check device PRI/PASID support. Signed-off-by: Vasant Hegde Reviewed-by: Jason Gunthorpe Reviewed-by: Jerry Snitselaar Link: https://lore.kernel.org/r/20230921092147.5930-13-vasant.he...@amd.com Signed-off-by: Joerg Roedel drivers/iommu/amd/amd_iommu_types.h | 3 ++- drivers/iommu/amd/iommu.c | 46 ++--- 2 files changed, 30 insertions(+), 19 deletions(-) > git bisect log git bisect start '--term-new=fixed' '--term-old=unfixed' # status: waiting for both good and bad commits # fixed: [33cc938e65a98f1d29d0a18403dbbee050dcad9a] Linux 6.7-rc4 git bisect fixed 33cc938e65a98f1d29d0a18403dbbee050dcad9a # status: waiting for good commit(s), bad commit known # unfixed: [ffc253263a1375a65fa6c9f62a893e9767fbebfa] Linux 6.6 git bisect unfixed ffc253263a1375a65fa6c9f62a893e9767fbebfa # unfixed: [7d461b291e65938f15f56fe58da2303b07578a76] Merge tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm git bisect unfixed 7d461b291e65938f15f56fe58da2303b07578a76 # unfixed: [e14aec23025eeb1f2159ba34dbc1458467c4c347] s390/ap: fix AP bus crash on early config change callback invocation git bisect unfixed e14aec23025eeb1f2159ba34dbc1458467c4c347 # unfixed: [be3ca57cfb777ad820c6659d52e60bbdd36bf5ff] Merge tag 'media/v6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media git bisect unfixed be3ca57cfb777ad820c6659d52e60bbdd36bf5ff # fixed: [c0d12d769299e1e08338988c7745009e0db2a4a0] Merge tag 'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm git bisect fixed c0d12d769299e1e08338988c7745009e0db2a4a0 # fixed: [4bbdb725a36b0d235f3b832bd0c1e885f0442d9f] Merge tag 'iommu-updates-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu git bisect fixed 4bbdb725a36b0d235f3b832bd0c1e885f0442d9f # unfixed: [25b6377007ebe1c3ede773fd6979f613386db000] Merge tag 'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drm git bisect unfixed 25b6377007ebe1c3ede773fd6979f613386db000 # unfixed: [67c0afb6424fee94238d9a32b97c407d0c97155e] Merge tag 'exfat-for-6.7-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat git bisect unfixed 67c0afb6424fee94238d9a32b97c407d0c97155e # unfixed: [3613047280ec42a4e1350fdc1a6dd161ff4008cc] Merge tag 'v6.6-rc7' into core git bisect unfixed 3613047280ec42a4e1350fdc1a6dd161ff4008cc # fixed: [cedc811c76778bdef91d405717acee0de54d8db5] iommu/amd: Remove DMA_FQ type from domain allocation path git bisect fixed cedc811c76778bdef91d405717acee0de54d8db5 # unfixed: [b0cc5dae1ac0c18748706a4beb636e3b726dd744] iommu/amd: Rename ats related variables git bisect unfixed b0cc5dae1ac0c18748706a4beb636e3b726dd744 # fixed: [5a0b11a180a9b82b4437a4be1cf73530053f139b] iommu/amd: Remove iommu_v2 module git bisect fixed 5a0b11a180a9b82b4437a4be1cf73530053f139b # fixed: [92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6] iommu/amd: Introduce iommu_dev_data.flags to track device capabilities git bisect fixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 # unfixed: [739eb25514c90aa
Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
On Tue, Feb 28, 2023 at 5:43 PM Christian König wrote: > > The point is it doesn't need to talk to the amdgpu hardware. What it > does is that it talks to the good old VGA/VESA emulation and that just > happens to be still enabled by the BIOS/GRUB. > > And that VGA/VESA emulation doesn't need any BAR or whatever to keep the > hw running in the state where it was initialized before the kernel > started. The kernel just grabs the addresses where it needs to write the > display data and keeps going with that. > > But when a hw specific driver wants to load this is the first thing > which gets disabled because we need to load new firmware. And with the > BARs disabled this can't be re-enabled without rebooting the system. > > > My suggestion is that if > > amdgpu fails to talk to the hardware, then let another suitable driver > > do it. I attached a system log when I apply "pci=nocrs" with > > "modprobe.blacklist=amdgpu" for showing that graphics work right in > > this case. > > To do this, does the Linux module loading mechanism need to be refined? > > That's actually working as expected. The real problem is that the BIOS > on that system is so broken that we can't access the hw correctly. > > What we could to do is to check the BARs very early on and refuse to > load when they are disable. The problem with this approach is that there > are systems where it is normal that the BARs are disable until the > driver loads and get enabled during the hardware initialization process. > > What you might want to look into is to find a quirk for the BIOS to > properly enable the nvme controller. > That's interesting. I noticed that now amdgpu could work even with parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels. It means BARs became available? I attached here the kerner log and lspci. What's changed? -- Best Regards, Mike Gavrilov. <> <>
Re: 6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6
On Thu, Nov 16, 2023 at 11:56 PM Alex Deucher wrote: > > This patch should address the issue: > https://patchwork.freedesktop.org/patch/567101/ > If you still see issues, you may also need this series: > https://patchwork.freedesktop.org/series/126220/ > > Alex Thanks. The first one patch is enough. Tested-on: 7900XTX, 6900XT and 6800M. Tested-by: Mikhail Gavrilov -- Best Regards, Mike Gavrilov.
Re: 6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6
On Wed, Nov 8, 2023 at 12:12 AM Alex Deucher wrote: > > The attached patch should fix it. Not sure why your GPU shows up as > busy. The AGP aperture was just disabled. Tested-by: Mikhail Gavrilov Thanks, after applying the patch GPU loading meets expectations. Games are working so overall all looking good for now. -- Best Regards, Mike Gavrilov.
Re: 6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6
On Mon, Nov 6, 2023 at 8:29 PM Alex Deucher wrote: > > Already fixed in this commit: > https://gitlab.freedesktop.org/agd5f/linux/-/commit/d1d4c0b7b65b7fab2bc6f97af9e823b1c42ccdb0 > Which is in included in last weeks PR. > Thanks, it fixed the issue above. But, unfortunately this is not the only problem which I see on my laptop. Now I am observing 100% GPU loading all the time. And it looks as I show on this screenshot: https://postimg.cc/QHLQncMg And another bisect round says that this commit is blame: ❯ git bisect good de59b69932e64d77445d973a101d81d6e7e670c6 is the first bad commit commit de59b69932e64d77445d973a101d81d6e7e670c6 Author: Alex Deucher Date: Wed Sep 20 13:27:58 2023 -0400 drm/amdgpu/gmc: set a default disable value for AGP To disable AGP, the start needs to be set to a higher value than the end. Set a default disable value for the AGP aperture and allow the IP specific GMC code to enable it selectively be calling amdgpu_gmc_agp_location(). Reviewed-by: Christian König Signed-off-by: Alex Deucher drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 27 --- drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 2 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_object.c| 3 +++ drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c| 3 ++- drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c| 3 ++- drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c | 4 ++-- drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 3 ++- drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 2 +- 10 files changed, 37 insertions(+), 18 deletions(-) I checked twice and ensure that it not happens on commit 29495d81457a483c2859ccde59cc063034bfe47d -- Best Regards, Mike Gavrilov.
Re: [PATCH] drm/ttm: check null pointer before accessing when swapping
On Thu, Jul 27, 2023 at 12:33 PM Chen, Guchun wrote: > > Reviewed-by: Christian König > > > > Has this already been pushed to drm-misc-next? > > > > Thanks, > > Christian. > > Not yet, Christian, as I don't have push permission. I saw you were on > vacation, so I would expect to ping you to push after you are back with full > recharge. I expect to see it in drm-fixes-6.5 cause the problem appeared during the 6.5 release cycle. And yes, I follow all pull requests. This patch was not included in yesterday's pull request :( -- Best Regards, Mike Gavrilov.
Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
On Thu, Apr 20, 2023 at 3:32 PM Mikhail Gavrilov wrote: > > Important don't give up. > https://youtu.be/25zhHBGIHJ8 [40 min] > https://youtu.be/utnDR26eYBY [50 min] > https://youtu.be/DJQ_tiimW6g [12 min] > https://youtu.be/Y6AH1oJKivA [6 min] > Yes the issue is everything reproducible, but time to time it not > happens at first attempt. > I also uploaded other videos which proves that the issue definitely > exists if someone will launch those games in turn. > Reproducibility is only a matter of time. > > Anyway I didn't want you to spend so much time trying to reproduce it. > This monkey business fits me more than you. > It would be better if I could collect more useful info. Christian, Did you manage to reproduce the problem? At the weekend I faced with slab-use-after-free in amdgpu_vm_handle_moved. I didn't play in the games at this time. The Xwayland process was affected so it leads to desktop hang. == BUG: KASAN: slab-use-after-free in amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] Read of size 8 at addr 888295c66190 by task Xwayland:cs0/173185 CPU: 21 PID: 173185 Comm: Xwayland:cs0 Tainted: GWL --- --- 6.3.0-0.rc7.20230420gitcb0856346a60.59.fc39.x86_64+debug #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4601 02/02/2023 Call Trace: dump_stack_lvl+0x76/0xd0 print_report+0xcf/0x670 ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] kasan_report+0xa8/0xe0 ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] amdgpu_cs_ioctl+0x2b7e/0x5630 [amdgpu] ? __pfx___lock_acquire+0x10/0x10 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ? mark_lock+0x101/0x16e0 ? __lock_acquire+0xe54/0x59f0 ? __pfx_lock_release+0x10/0x10 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] drm_ioctl_kernel+0x1fc/0x3d0 ? __pfx_drm_ioctl_kernel+0x10/0x10 drm_ioctl+0x4c5/0xaa0 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ? __pfx_drm_ioctl+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x66/0x80 ? lockdep_hardirqs_on+0x81/0x110 ? _raw_spin_unlock_irqrestore+0x4f/0x80 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu] __x64_sys_ioctl+0x131/0x1a0 do_syscall_64+0x60/0x90 ? do_syscall_64+0x6c/0x90 ? lockdep_hardirqs_on+0x81/0x110 ? do_syscall_64+0x6c/0x90 ? lockdep_hardirqs_on+0x81/0x110 ? do_syscall_64+0x6c/0x90 ? lockdep_hardirqs_on+0x81/0x110 ? do_syscall_64+0x6c/0x90 ? lockdep_hardirqs_on+0x81/0x110 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7ffb71b0892d Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 RSP: 002b:7ffb677fe840 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 7ffb677fe9f8 RCX: 7ffb71b0892d RDX: 7ffb677fe900 RSI: c0186444 RDI: 000d RBP: 7ffb677fe890 R08: 7ffb677fea50 R09: 7ffb677fe8e0 R10: 556c4611bec0 R11: 0246 R12: 7ffb677fe900 R13: c0186444 R14: 000d R15: 7ffb677fe9f8 Allocated by task 173181: kasan_save_stack+0x33/0x60 kasan_set_track+0x25/0x30 __kasan_kmalloc+0x8f/0xa0 __kmalloc_node+0x65/0x160 amdgpu_bo_create+0x31e/0xfb0 [amdgpu] amdgpu_bo_create_user+0xca/0x160 [amdgpu] amdgpu_gem_create_ioctl+0x398/0x980 [amdgpu] drm_ioctl_kernel+0x1fc/0x3d0 drm_ioctl+0x4c5/0xaa0 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu] __x64_sys_ioctl+0x131/0x1a0 do_syscall_64+0x60/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Freed by task 173185: kasan_save_stack+0x33/0x60 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2e/0x50 __kasan_slab_free+0x10b/0x1a0 slab_free_freelist_hook+0x11e/0x1d0 __kmem_cache_free+0xc0/0x2e0 ttm_bo_release+0x667/0x9e0 [ttm] amdgpu_bo_unref+0x35/0x70 [amdgpu] amdgpu_gem_object_free+0x73/0xb0 [amdgpu] drm_gem_handle_delete+0xe3/0x150 drm_ioctl_kernel+0x1fc/0x3d0 drm_ioctl+0x4c5/0xaa0 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu] __x64_sys_ioctl+0x131/0x1a0 do_syscall_64+0x60/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Last potentially related work creation: kasan_save_stack+0x33/0x60 __kasan_record_aux_stack+0x97/0xb0 __call_rcu_common.constprop.0+0xf8/0x1af0 drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched] dma_resv_reserve_fences+0x4dc/0x7f0 ttm_eu_reserve_buffers+0x3f6/0x1190 [ttm] amdgpu_cs_ioctl+0x204d/0x5630 [amdgpu] drm_ioctl_kernel+0x1fc/0x3d0 drm_ioctl+0x4c5/0xaa0 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu] __x64_sys_ioctl+0x131/0x1a0 do_syscall_64+0x60/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Second to last potentially related work creation: kasan_save_stack+0x33/0x60 __kasan_record_aux_stack+0x97/0xb0 __call_rcu_common.constprop.0+0xf8/0x1af0 drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched] amdgpu_ctx_add
Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
On Thu, Apr 20, 2023 at 2:59 PM Christian König wrote: > Could you try drm-misc-next as well? If as I assume I cloned right repo $ git clone -b drm-misc-next git://anongit.freedesktop.org/drm/drm-misc linux-drm-misc-next for my hardware last commit on this branch is turned out completely unworking. Instead of the GDM login screen I see a black screen and hear howls of GPU fans. In the kernel logs I see general protection fault: general protection fault, probably for non-canonical address 0xdc2b: [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0158-0x015f] CPU: 0 PID: 749 Comm: sdma0 Tainted: GWL 6.3.0-rc4-misc-next-91c249b2b9f6a80c744387b6713adf275ffd296b+ #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4601 02/02/2023 RIP: 0010:drm_sched_get_cleanup_job+0x41b/0x5c0 [gpu_sched] Code: fa 48 c1 ea 03 80 3c 02 00 75 5c 49 8b 9f 80 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb 58 01 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 55 48 01 ab 58 01 00 00 e9 0c fd ff ff 48 89 ef e8 RSP: 0018:c9000548fdb8 EFLAGS: 00010216 RAX: dc00 RBX: RCX: RDX: 002b RSI: 0004 RDI: 0158 RBP: 085c R08: R09: 888170711783 R10: ed102e0e22f0 R11: 8da81678 R12: 8881707116b0 R13: 888170711780 R14: 888266f89820 R15: 888266f89808 FS: () GS:888fa200() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 560cea4a8000 CR3: 000191602000 CR4: 00350ef0 Call Trace: drm_sched_main+0xc3/0x930 [gpu_sched] ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched] ? __pfx_autoremove_wake_function+0x10/0x10 ? __kthread_parkme+0xc1/0x1f0 ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched] kthread+0x2a2/0x340 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x2c/0x50 Modules linked in: amdgpu(+) drm_ttm_helper ttm video crct10dif_pclmul drm_suballoc_helper crc32_pclmul iommu_v2 crc32c_intel drm_buddy polyval_clmulni gpu_sched polyval_generic ucsi_ccg drm_display_helper typec_ucsi nvme ghash_clmulni_intel igb typec ccp sha512_ssse3 cec nvme_core sp5100_tco dca i2c_algo_bit nvme_common wmi ip6_tables ip_tables fuse ---[ end trace ]--- RIP: 0010:drm_sched_get_cleanup_job+0x41b/0x5c0 [gpu_sched] Code: fa 48 c1 ea 03 80 3c 02 00 75 5c 49 8b 9f 80 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 8d bb 58 01 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 75 55 48 01 ab 58 01 00 00 e9 0c fd ff ff 48 89 ef e8 RSP: 0018:c9000548fdb8 EFLAGS: 00010216 RAX: dc00 RBX: RCX: RDX: 002b RSI: 0004 RDI: 0158 RBP: 085c R08: R09: 888170711783 R10: ed102e0e22f0 R11: 8da81678 R12: 8881707116b0 R13: 888170711780 R14: 888266f89820 R15: 888266f89808 FS: () GS:888fa200() knlGS: I also attached a full system log. -- Best Regards, Mike Gavrilov. system-log.tar.xz Description: application/xz
Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
On Thu, Apr 20, 2023 at 2:59 PM Christian König wrote: > > Could you try drm-misc-next as well? > > Going to give drm-fixes another round of testing. > > Thanks, > Christian. Important don't give up. https://youtu.be/25zhHBGIHJ8 [40 min] https://youtu.be/utnDR26eYBY [50 min] https://youtu.be/DJQ_tiimW6g [12 min] https://youtu.be/Y6AH1oJKivA [6 min] Yes the issue is everything reproducible, but time to time it not happens at first attempt. I also uploaded other videos which proves that the issue definitely exists if someone will launch those games in turn. Reproducibility is only a matter of time. Anyway I didn't want you to spend so much time trying to reproduce it. This monkey business fits me more than you. It would be better if I could collect more useful info. -- Best Regards, Mike Gavrilov.
Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
On Wed, Apr 19, 2023 at 1:12 PM Christian König wrote: > > I'm already looking into this, but can't figure out why we run into > problems here. > > What happens is that a CS is aborted without sending the job to the > scheduler and in this case the cleanup function doesn't seem to work. > > Christian. I can easily reproduce it on any AMD GPU hardware. You can add more logs to debug and I return with new logs which explains this. Thanks. -- Best Regards, Mike Gavrilov.
Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
Christian? ❯ /usr/src/kernels/6.3.0-0.rc7.56.fc39.x86_64/scripts/faddr2line /lib/debug/lib/modules/6.3.0-0.rc7.56.fc39.x86_64/kernel/drivers/gpu/drm/scheduler/gpu-sched.ko.debug drm_sched_job_cleanup+0x9a drm_sched_job_cleanup+0x9a/0x130: drm_sched_job_cleanup at /usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c:808 (discriminator 3) ❯ cat -s -n /usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c | head -818 | tail -20 799 /* drm_sched_job_arm() has been called */ 800 dma_fence_put(&job->s_fence->finished); 801 } else { 802 /* aborted job before committing to run it */ 803 drm_sched_fence_free(job->s_fence); 804 } 805 806 job->s_fence = NULL; 807 808 xa_for_each(&job->dependencies, index, fence) { 809 dma_fence_put(fence); 810 } 811 xa_destroy(&job->dependencies); 812 813 } 814 EXPORT_SYMBOL(drm_sched_job_cleanup); 815 816 /** 817 * drm_sched_ready - is the scheduler ready 818 * > git blame drivers/gpu/drm/scheduler/sched_main.c -L 800,819 dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-17 10:49:16 +0200 800) dma_fence_put(&job->s_fence->finished); dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-17 10:49:16 +0200 801) } else { dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-17 10:49:16 +0200 802) /* aborted job before committing to run it */ d4c16733e7960 drivers/gpu/drm/scheduler/sched_main.c(Boris Brezillon 2021-09-03 14:05:54 +0200 803) drm_sched_fence_free(job->s_fence); dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-17 10:49:16 +0200 804) } dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-17 10:49:16 +0200 805) 26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat Masetty 2018-10-29 15:02:28 +0530 806) job->s_fence = NULL; ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-05 12:46:49 +0200 807) ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-05 12:46:49 +0200 808) xa_for_each(&job->dependencies, index, fence) { ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-05 12:46:49 +0200 809) dma_fence_put(fence); ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-05 12:46:49 +0200 810) } ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-05 12:46:49 +0200 811) xa_destroy(&job->dependencies); ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel Vetter 2021-08-05 12:46:49 +0200 812) 26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat Masetty 2018-10-29 15:02:28 +0530 813) } 26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat Masetty 2018-10-29 15:02:28 +0530 814) EXPORT_SYMBOL(drm_sched_job_cleanup); 26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat Masetty 2018-10-29 15:02:28 +0530 815) e688b728228b9 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c (Christian König 2015-08-20 17:01:01 +0200 816) /** 2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan Deshmukh 2018-05-29 11:23:07 +0530 817) * drm_sched_ready - is the scheduler ready 2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan Deshmukh 2018-05-29 11:23:07 +0530 818) * 2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan Deshmukh 2018-05-29 11:23:07 +0530 819) * @sched: scheduler instance Daniel, because Christian, looks a little busy. Can you help? The git blame says that you are the author of code which KASAN mentions in its report. The issue is reproducible on all available AMD hardware: 6800M, 6900XT, 7900XTX. -- Best Regards, Mike Gavrilov.
Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
On Tue, Apr 11, 2023 at 10:40 PM Mikhail Gavrilov wrote: > > Hi, > KASAN continues to find problems in the drm_sched_job_cleanup code at 6.3rc6. > I not got any feedback in the thread > https://lore.kernel.org/lkml/cabxgcsmvub2ra4d+k5cna0_2521tox++d4nmoukki4x2-q_...@mail.gmail.com/ > Therefore, I decided to start a separate thread. Since the problems > are different, the symptoms are also different. > > Reproduction scenario. > After launching one of the listed games: > - Cyberpunk 2077 > - Forza Horizon 4 > - Forza Horizon 5 > - Sackboy: A Big Adventure > > Firstly after some time (may be after several attempts) appears bug > message from KASAN: > == > BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched] > Read of size 4 at addr 0078 by task ForzaHorizon4.e/31587 > > CPU: 15 PID: 31587 Comm: ForzaHorizon4.e Tainted: GWL > --- --- 6.3.0-0.rc6.49.fc39.x86_64+debug #1 > Hardware name: System manufacturer System Product Name/ROG STRIX > X570-I GAMING, BIOS 4601 02/02/2023 > Call Trace: > > dump_stack_lvl+0x72/0xc0 > kasan_report+0xa4/0xe0 > ? drm_sched_job_cleanup+0x96/0x290 [gpu_sched] > kasan_check_range+0x104/0x1b0 > drm_sched_job_cleanup+0x96/0x290 [gpu_sched] > ? __pfx_drm_sched_job_cleanup+0x10/0x10 [gpu_sched] > ? slab_free_freelist_hook+0x11e/0x1d0 > ? amdgpu_cs_parser_fini+0x363/0x5a0 [amdgpu] > amdgpu_job_free+0x40/0x1b0 [amdgpu] > amdgpu_cs_parser_fini+0x3c9/0x5a0 [amdgpu] > ? __pfx_amdgpu_cs_parser_fini+0x10/0x10 [amdgpu] > amdgpu_cs_ioctl+0x3d9/0x5630 [amdgpu] > ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] > ? __kmem_cache_free+0xbc/0x2e0 > ? mark_lock+0x101/0x16e0 > ? __lock_acquire+0xe54/0x59f0 > ? kasan_save_stack+0x3f/0x50 > ? __pfx_lock_release+0x10/0x10 > ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] > drm_ioctl_kernel+0x1f8/0x3d0 > ? __pfx_drm_ioctl_kernel+0x10/0x10 > drm_ioctl+0x4c1/0xaa0 > ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] > ? __pfx_drm_ioctl+0x10/0x10 > ? _raw_spin_unlock_irqrestore+0x62/0x80 > ? lockdep_hardirqs_on+0x7d/0x100 > ? _raw_spin_unlock_irqrestore+0x4b/0x80 > amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu] > __x64_sys_ioctl+0x12d/0x1a0 > do_syscall_64+0x5c/0x90 > ? do_syscall_64+0x68/0x90 > ? lockdep_hardirqs_on+0x7d/0x100 > ? do_syscall_64+0x68/0x90 > ? do_syscall_64+0x68/0x90 > ? lockdep_hardirqs_on+0x7d/0x100 > ? do_syscall_64+0x68/0x90 > ? asm_exc_page_fault+0x22/0x30 > ? lockdep_hardirqs_on+0x7d/0x100 > entry_SYSCALL_64_after_hwframe+0x72/0xdc > RIP: 0033:0x7fb8a270881d > Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 > 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 > 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 > RSP: 002b:467ad060 EFLAGS: 0246 ORIG_RAX: 0010 > RAX: ffda RBX: 467ad358 RCX: 7fb8a270881d > RDX: 467ad140 RSI: c0186444 RDI: 005a > RBP: 467ad0b0 R08: 7fb7f00d3eb0 R09: 467ad100 > R10: 7fb88c68fb20 R11: 0246 R12: 467ad140 > R13: c0186444 R14: 005a R15: 7fb7f00d3e50 > > == > > Finally it ends up with the games listed above stopping working they > stuck after a kernel warning: > general protection fault, probably for non-canonical address > 0xdc0f: [#1] PREEMPT SMP KASAN NOPTI > KASAN: null-ptr-deref in range [0x0078-0x007f] > CPU: 15 PID: 31587 Comm: ForzaHorizon4.e Tainted: GB WL > --- --- 6.3.0-0.rc6.49.fc39.x86_64+debug #1 > Hardware name: System manufacturer System Product Name/ROG STRIX > X570-I GAMING, BIOS 4601 02/02/2023 > RIP: 0010:drm_sched_job_cleanup+0xa7/0x290 [gpu_sched] > Code: d6 01 00 00 4c 8b 75 20 be 04 00 00 00 4d 8d 66 78 4c 89 e7 e8 > ba 4d 4e c9 4c 89 e2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 > 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 8a > RSP: 0018:c9003676f5a8 EFLAGS: 00010216 > RAX: dc00 RBX: 88816f81f020 RCX: 0001 > RDX: 000f RSI: 0008 RDI: 9053e5e0 > RBP: 88816f81f000 R08: 0001 R09: 9053e5e7 > R10: fbfff20a7cbc R11: 6e696c6261736944 R12: 0078 > R13: 192006cedeb5 R14: R15: c9003676f870 > FS: 4680f6c0() GS:888fa5c0() knlGS:2991 > CS: 0010 DS: ES: CR0: 80050033 > CR2: 7fb854d6f010 CR3: 00017b2d6000 CR4: 00350ee0 > Call Trace
Re: BUG: KASAN: slab-use-after-free in drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
On Fri, Mar 24, 2023 at 7:37 PM Christian König wrote: > > Yeah, that one > > Thanks for the info, looks like this isn't fixed. > > Christian. > Hi, glad to see that "BUG: KASAN: slab-use-after-free in drm_sched_get_cleanup_job+0x47b/0x5c0" was fixed in 6.3-rc5. For history it would be good to know the commit which fixes this issue. I waited for this moment because I know other one issue which was also found by KASAN santiniser. BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched] Read of size 4 at addr 0078 by task GameThread/23915 CPU: 10 PID: 23915 Comm: GameThread Tainted: GWL --- --- 6.3.0-0.rc5.42.fc39.x86_64+debug #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4601 02/02/2023 Call Trace: dump_stack_lvl+0x72/0xc0 kasan_report+0xa4/0xe0 ? drm_sched_job_cleanup+0x96/0x290 [gpu_sched] kasan_check_range+0x104/0x1b0 drm_sched_job_cleanup+0x96/0x290 [gpu_sched] ? __pfx_drm_sched_job_cleanup+0x10/0x10 [gpu_sched] ? slab_free_freelist_hook+0x11e/0x1d0 ? amdgpu_cs_parser_fini+0x363/0x5a0 [amdgpu] amdgpu_job_free+0x40/0x1b0 [amdgpu] amdgpu_cs_parser_fini+0x3c9/0x5a0 [amdgpu] ? __pfx_amdgpu_cs_parser_fini+0x10/0x10 [amdgpu] amdgpu_cs_ioctl+0x3d9/0x5630 [amdgpu] ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ? mark_lock+0x101/0x16e0 ? __lock_acquire+0xe54/0x59f0 ? __pfx_lock_release+0x10/0x10 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] drm_ioctl_kernel+0x1f8/0x3d0 ? __pfx_drm_ioctl_kernel+0x10/0x10 drm_ioctl+0x4c1/0xaa0 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ? __pfx_drm_ioctl+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x62/0x80 ? lockdep_hardirqs_on+0x7d/0x100 ? _raw_spin_unlock_irqrestore+0x4b/0x80 amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu] __x64_sys_ioctl+0x12d/0x1a0 do_syscall_64+0x5c/0x90 ? do_syscall_64+0x68/0x90 ? lockdep_hardirqs_on+0x7d/0x100 ? do_syscall_64+0x68/0x90 ? do_syscall_64+0x68/0x90 ? lockdep_hardirqs_on+0x7d/0x100 ? do_syscall_64+0x68/0x90 ? do_syscall_64+0x68/0x90 ? lockdep_hardirqs_on+0x7d/0x100 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fe97a50881d Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 RSP: 002b:7c35d3f0 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 7c35d6e8 RCX: 7fe97a50881d RDX: 7c35d4d0 RSI: c0186444 RDI: 00ae RBP: 7c35d440 R08: 7fe8fc0f0970 R09: 7c35d490 R10: 7fb79000 R11: 0246 R12: 7c35d4d0 R13: c0186444 R14: 00ae R15: 7fe8fc0f0900 I know at least 3 games which 100% triggering this bug: - Cyberpunk 2077 - Forza Horizon 4 - Forza Horizon 5 We would continue to discuss it here or better create a new thread (for someone who is also faced with this issue could easily find a solution on the internet)? A full kernel log as usual attached here. -- Best Regards, Mike Gavrilov. dmesg.tar.xz Description: application/xz
Re: BUG: KASAN: slab-use-after-free in drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
On Tue, Mar 21, 2023 at 11:47 PM Christian König wrote: > > Hi Mikhail, > > That looks like a reference counting issue to me. > > I'm going to take a look, but we have already fixed one of those recently. > > Probably best that you try this on drm-fixes, just to double check that > this isn't the same issue. > Hi Christian, you meant this branch? $ git clone -b drm-fixes git://anongit.freedesktop.org/drm/drm linux-drm If yes I just checked and unfortunately see this issue unfixed there. [ 1984.295833] == [ 1984.295876] BUG: KASAN: slab-use-after-free in drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched] [ 1984.295898] Read of size 8 at addr 88814cadc4c0 by task sdma1/764 [ 1984.295924] CPU: 12 PID: 764 Comm: sdma1 Tainted: GWL 6.3.0-rc3-drm-fixes+ #1 [ 1984.295937] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4601 02/02/2023 [ 1984.295951] Call Trace: [ 1984.295963] [ 1984.295975] dump_stack_lvl+0x72/0xc0 [ 1984.295991] print_report+0xcf/0x670 [ 1984.296007] ? drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched] [ 1984.296030] ? drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched] [ 1984.296047] kasan_report+0xa4/0xe0 [ 1984.296118] ? drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched] [ 1984.296149] drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched] [ 1984.296175] drm_sched_main+0x643/0x990 [gpu_sched] [ 1984.296204] ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched] [ 1984.296222] ? __pfx_autoremove_wake_function+0x10/0x10 [ 1984.296290] ? __kthread_parkme+0xc1/0x1f0 [ 1984.296304] ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched] [ 1984.296321] kthread+0x29e/0x340 [ 1984.296334] ? __pfx_kthread+0x10/0x10 [ 1984.296501] ret_from_fork+0x2c/0x50 [ 1984.296518] [ 1984.296539] Allocated by task 12194: [ 1984.296552] kasan_save_stack+0x2f/0x50 [ 1984.296566] kasan_set_track+0x21/0x30 [ 1984.296578] __kasan_kmalloc+0x8b/0x90 [ 1984.296590] amdgpu_driver_open_kms+0x10b/0x5a0 [amdgpu] [ 1984.297051] drm_file_alloc+0x46e/0x880 [ 1984.297064] drm_open_helper+0x161/0x460 [ 1984.297076] drm_open+0x1e7/0x5c0 [ 1984.297089] drm_stub_open+0x24d/0x400 [ 1984.297107] chrdev_open+0x215/0x620 [ 1984.297125] do_dentry_open+0x5f1/0x1000 [ 1984.297146] path_openat+0x1b3d/0x28a0 [ 1984.297164] do_filp_open+0x1bd/0x400 [ 1984.297180] do_sys_openat2+0x140/0x420 [ 1984.297197] __x64_sys_openat+0x11f/0x1d0 [ 1984.297213] do_syscall_64+0x5b/0x80 [ 1984.297231] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 1984.297266] Freed by task 12195: [ 1984.297284] kasan_save_stack+0x2f/0x50 [ 1984.297303] kasan_set_track+0x21/0x30 [ 1984.297323] kasan_save_free_info+0x2a/0x50 [ 1984.297343] __kasan_slab_free+0x107/0x1a0 [ 1984.297361] slab_free_freelist_hook+0x11e/0x1d0 [ 1984.297373] __kmem_cache_free+0xbc/0x2e0 [ 1984.297385] amdgpu_driver_postclose_kms+0x582/0x8d0 [amdgpu] [ 1984.297821] drm_file_free.part.0+0x638/0xb70 [ 1984.297834] drm_release+0x1ea/0x470 [ 1984.297845] __fput+0x213/0x9e0 [ 1984.297857] task_work_run+0x11b/0x200 [ 1984.297869] exit_to_user_mode_prepare+0x23a/0x260 [ 1984.297883] syscall_exit_to_user_mode+0x16/0x50 [ 1984.297896] do_syscall_64+0x67/0x80 [ 1984.297907] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 1984.298033] Last potentially related work creation: [ 1984.298044] kasan_save_stack+0x2f/0x50 [ 1984.298057] __kasan_record_aux_stack+0x97/0xb0 [ 1984.298075] __call_rcu_common.constprop.0+0xf8/0x1af0 [ 1984.298095] amdgpu_bo_list_put+0x1a4/0x1f0 [amdgpu] [ 1984.298557] amdgpu_cs_parser_fini+0x293/0x5a0 [amdgpu] [ 1984.299055] amdgpu_cs_ioctl+0x4f2a/0x5630 [amdgpu] [ 1984.299624] drm_ioctl_kernel+0x1f8/0x3d0 [ 1984.299637] drm_ioctl+0x4c1/0xaa0 [ 1984.299649] amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu] [ 1984.300083] __x64_sys_ioctl+0x12d/0x1a0 [ 1984.300097] do_syscall_64+0x5b/0x80 [ 1984.300109] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 1984.300135] Second to last potentially related work creation: [ 1984.300149] kasan_save_stack+0x2f/0x50 [ 1984.300167] __kasan_record_aux_stack+0x97/0xb0 [ 1984.300185] __call_rcu_common.constprop.0+0xf8/0x1af0 [ 1984.300203] amdgpu_bo_list_put+0x1a4/0x1f0 [amdgpu] [ 1984.300692] amdgpu_cs_parser_fini+0x293/0x5a0 [amdgpu] [ 1984.301133] amdgpu_cs_ioctl+0x4f2a/0x5630 [amdgpu] [ 1984.301577] drm_ioctl_kernel+0x1f8/0x3d0 [ 1984.301598] drm_ioctl+0x4c1/0xaa0 [ 1984.301610] amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu] [ 1984.302043] __x64_sys_ioctl+0x12d/0x1a0 [ 1984.302056] do_syscall_64+0x5b/0x80 [ 1984.302068] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 1984.302090] The buggy address belongs to the object at 88814cadc000 which belongs to the cache kmalloc-4k of size 4096 [ 1984.302103] The buggy address is located 1216 bytes inside of freed 4096-byte region [88814cadc000, 88814cadd000) [ 1984.302129] The buggy address belongs to the physical page: [ 1984.302141] page:
[6.3][regression] commit a4e771729a51168bc36317effaa9962e336d4f5e lead to flood kernel logs with warning messages "at kernel/workqueue.c:3167 __flush_work+0x472/0x500"
Hi, I didn't faced to issue drm_bridge_hpd_enable+0x94/0x9c [drm] but fixing this issue leads to warning messages on my laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007 which has two AMD GPU. Discrete Radeon 6800M and integrated in CPU Cezanne Vega 8. I found bad commit by bisecting: ❯ git bisect bad a4e771729a51168bc36317effaa9962e336d4f5e is the first bad commit commit a4e771729a51168bc36317effaa9962e336d4f5e Author: Dmitry Baryshkov Date: Tue Jan 24 12:45:48 2023 +0200 drm/probe_helper: sort out poll_running vs poll_enabled There are two flags attemting to guard connector polling: poll_enabled and poll_running. While poll_enabled semantics is clearly defined and fully adhered (mark that drm_kms_helper_poll_init() was called and not finalized by the _fini() call), the poll_running flag doesn't have such clearliness. This flag is used only in drm_helper_probe_single_connector_modes() to guard calling of drm_kms_helper_poll_enable, it doesn't guard the drm_kms_helper_poll_fini(), etc. Change it to only be set if the polling is actually running. Tie HPD enablement to this flag. This fixes the following warning reported after merging the HPD series: Hot plug detection already enabled WARNING: CPU: 2 PID: 9 at drivers/gpu/drm/drm_bridge.c:1257 drm_bridge_hpd_enable+0x94/0x9c [drm] Modules linked in: videobuf2_memops snd_soc_simple_card snd_soc_simple_card_utils fsl_imx8_ddr_perf videobuf2_common snd_soc_imx_spdif adv7511 etnaviv imx8m_ddrc imx_dcss mc cec nwl_dsi gov CPU: 2 PID: 9 Comm: kworker/u8:0 Not tainted 6.2.0-rc2-15208-g25b283acd578 #6 Hardware name: NXP i.MX8MQ EVK (DT) Workqueue: events_unbound deferred_probe_work_func pstate: 6005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : drm_bridge_hpd_enable+0x94/0x9c [drm] lr : drm_bridge_hpd_enable+0x94/0x9c [drm] sp : 89ef3740 x29: 89ef3740 x28: 09331f00 x27: 1000 x26: 0020 x25: 81148ed8 x24: 0a8fe000 x23: fffd x22: 05086348 x21: 81133ee0 x20: 0550d800 x19: 05086288 x18: 0006 x17: x16: 896ef008 x15: 972891004260 x14: 2a1403e19400 x13: 972891004260 x12: 2a1403e19400 x11: 7100385f29400801 x10: 0aa0 x9 : 88112744 x8 : 00250b00 x7 : 0003 x6 : 0011 x5 : x4 : bd986a48 x3 : 0001 x2 : x1 : x0 : 0025 Call trace: drm_bridge_hpd_enable+0x94/0x9c [drm] drm_bridge_connector_enable_hpd+0x2c/0x3c [drm_kms_helper] drm_kms_helper_poll_enable+0x94/0x10c [drm_kms_helper] drm_helper_probe_single_connector_modes+0x1a8/0x510 [drm_kms_helper] drm_client_modeset_probe+0x204/0x1190 [drm] __drm_fb_helper_initial_config_and_unlock+0x5c/0x4a4 [drm_kms_helper] drm_fb_helper_initial_config+0x54/0x6c [drm_kms_helper] drm_fbdev_client_hotplug+0xd0/0x140 [drm_kms_helper] drm_fbdev_generic_setup+0x90/0x154 [drm_kms_helper] dcss_kms_attach+0x1c8/0x254 [imx_dcss] dcss_drv_platform_probe+0x90/0xfc [imx_dcss] platform_probe+0x70/0xcc really_probe+0xc4/0x2e0 __driver_probe_device+0x80/0xf0 driver_probe_device+0xe0/0x164 __device_attach_driver+0xc0/0x13c bus_for_each_drv+0x84/0xe0 __device_attach+0xa4/0x1a0 device_initial_probe+0x1c/0x30 bus_probe_device+0xa4/0xb0 deferred_probe_work_func+0x90/0xd0 process_one_work+0x200/0x474 worker_thread+0x74/0x43c kthread+0xfc/0x110 ret_from_fork+0x10/0x20 ---[ end trace ]--- Reported-by: Laurentiu Palcu Fixes: c8268795c9a9 ("drm/probe-helper: enable and disable HPD on connectors") Tested-by: Marek Szyprowski Tested-by: Chen-Yu Tsai Acked-by: Laurentiu Palcu Tested-by: Laurentiu Palcu Tested-by: Laurent Pinchart Signed-off-by: Dmitry Baryshkov Signed-off-by: Neil Armstrong Link: https://patchwork.freedesktop.org/patch/msgid/20230124104548.3234554-2-dmitry.barysh...@linaro.org (cherry picked from commit d33a54e3991dfce88b4fc6d9c3360951c2c5660d) Signed-off-by: Thomas Zimmermann drivers/gpu/drm/drm_probe_helper.c | 42 +++--- 1 file changed, 21 insertions(+), 21 deletions(-) Of course I tried to check the bisect assumption by reverting this commit. And I can confirm without commit a4e771729a51168bc36317effaa9962e336d4f5e the warning messages do not appear within a day. I attached a full kernel log if someone would be interested to see it. -- Best Regards, Mike Gavrilov. git bisect start # status: waiting for both good and bad commits # good: [5b7c4cabbb65f5c469464da6c5f614cbd7f730f2] Merge tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next git bis
Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
On Mon, Feb 27, 2023 at 3:22 PM Christian König > > Unfortunately yes. We could clean that up a bit more so that you don't > run into a BUG() assertion, but what essentially happens here is that we > completely fail to talk to the hardware. > > In this situation we can't even re-enable vesa or text console any more. > Then I don't understand why when amdgpu is blacklisted via modprobe.blacklist=amdgpu then I see graphics and could login into GNOME. Yes without hardware acceleration, but it is better than non working graphics. It means there is some other driver (I assume this is "video") which can successfully talk to the AMD hardware in conditions where amdgpu cannot do this. My suggestion is that if amdgpu fails to talk to the hardware, then let another suitable driver do it. I attached a system log when I apply "pci=nocrs" with "modprobe.blacklist=amdgpu" for showing that graphics work right in this case. To do this, does the Linux module loading mechanism need to be refined? -- Best Regards, Mike Gavrilov. system-without-amdgpu.tar.xz Description: application/xz
Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
On Fri, Feb 24, 2023 at 8:31 PM Christian König wrote: > > Sorry I totally missed that you attached the full dmesg to your original > mail. > > Yeah, the driver did fail gracefully. But then X doesn't come up and > then gdm just dies. Are you sure that these messages should be present when the driver fails gracefully? turning off the locking correctness validator. CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L --- --- 6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug #1 Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.320 09/07/2022 Call Trace: dump_stack_lvl+0x57/0x90 register_lock_class+0x47d/0x490 __lock_acquire+0x74/0x21f0 ? lock_release+0x155/0x450 lock_acquire+0xd2/0x320 ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu] ? lock_is_held_type+0xce/0x120 _raw_spin_lock_irqsave+0x4d/0xa0 ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu] amdgpu_irq_disable_all+0x37/0xf0 [amdgpu] amdgpu_device_fini_hw+0x43/0x2c0 [amdgpu] amdgpu_driver_load_kms+0xe8/0x190 [amdgpu] amdgpu_pci_probe+0x140/0x420 [amdgpu] local_pci_probe+0x41/0x90 pci_device_probe+0xc3/0x230 really_probe+0x1b6/0x410 __driver_probe_device+0x78/0x170 driver_probe_device+0x1f/0x90 __driver_attach+0xd2/0x1c0 ? __pfx___driver_attach+0x10/0x10 bus_for_each_dev+0x8a/0xd0 bus_add_driver+0x141/0x230 driver_register+0x77/0x120 ? __pfx_init_module+0x10/0x10 [amdgpu] do_one_initcall+0x6e/0x350 do_init_module+0x4a/0x220 __do_sys_init_module+0x192/0x1c0 do_syscall_64+0x5b/0x80 ? asm_exc_page_fault+0x22/0x30 ? lockdep_hardirqs_on+0x7d/0x100 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fd58cfcb1be Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 RSP: 002b:7ffd1d1065d8 EFLAGS: 0246 ORIG_RAX: 00af RAX: ffda RBX: 55b0b5aa6d70 RCX: 7fd58cfcb1be RDX: 55b0b5a96670 RSI: 016b6156 RDI: 7fd589392010 RBP: 7ffd1d106690 R08: 55b0b5a93bd0 R09: 016b6ff0 R10: 55b5eea2c333 R11: 0246 R12: 55b0b5a96670 R13: 0002 R14: 55b0b5a9c170 R15: 55b0b5aa58a0 amdgpu: probe of :03:00.0 failed with error -12 amdgpu :08:00.0: enabling device (0006 -> 0007) [drm] initializing kernel modesetting (RENOIR 0x1002:0x1638 0x1043:0x16C2 0xC4). list_add corruption. prev->next should be next (c0940328), but was . (prev=8c9b734062b0). [ cut here ] kernel BUG at lib/list_debug.c:30! invalid opcode: [#1] PREEMPT SMP NOPTI CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L --- --- 6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug #1 Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.320 09/07/2022 RIP: 0010:__list_add_valid+0x74/0x90 Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d RSP: 0018:a50f81aafa00 EFLAGS: 00010246 RAX: 0075 RBX: 8c9b734062b0 RCX: RDX: RSI: 0027 RDI: RBP: 8c9b734062b0 R08: R09: a50f81aaf8a0 R10: 0003 R11: 8caa1d2fffe8 R12: 8c9b7c0a5e48 R13: R14: c13a6d20 R15: FS: 7fd58c6a5940() GS:8ca9d9a0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 55b0b5a955e0 CR3: 00017e86 CR4: 00750ee0 PKRU: 5554 Call Trace: ttm_device_init+0x184/0x1c0 [ttm] amdgpu_ttm_init+0xb8/0x610 [amdgpu] ? _printk+0x60/0x80 gmc_v9_0_sw_init+0x4a3/0x7c0 [amdgpu] amdgpu_device_init+0x14e5/0x2520 [amdgpu] amdgpu_driver_load_kms+0x15/0x190 [amdgpu] amdgpu_pci_probe+0x140/0x420 [amdgpu] local_pci_probe+0x41/0x90 pci_device_probe+0xc3/0x230 really_probe+0x1b6/0x410 __driver_probe_device+0x78/0x170 driver_probe_device+0x1f/0x90 __driver_attach+0xd2/0x1c0 ? __pfx___driver_attach+0x10/0x10 bus_for_each_dev+0x8a/0xd0 bus_add_driver+0x141/0x230 driver_register+0x77/0x120 ? __pfx_init_module+0x10/0x10 [amdgpu] do_one_initcall+0x6e/0x350 do_init_module+0x4a/0x220 __do_sys_init_module+0x192/0x1c0 do_syscall_64+0x5b/0x80 ? asm_exc_page_fault+0x22/0x30 ? lockdep_hardirqs_on+0x7d/0x100 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7fd58cfcb1be Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 48 RSP: 002b:7ffd1d1065d8 EFLAGS: 0246 ORIG_RAX: 00af RAX: ffda RBX: 55b0b5aa6d70 RCX: 7fd58cfcb1be RDX: 55b0b5a96670 RSI: 016b6156 RDI: 7fd589392010 RBP: 7f
Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
On Fri, Feb 24, 2023 at 12:13 PM Christian König wrote: > > Hi Mikhail, > > this is pretty clearly a problem with the system and/or it's BIOS and > not the GPU hw or the driver. > > The option pci=nocrs makes the kernel ignore additional resource windows > the BIOS reports through ACPI. This then most likely leads to problems > with amdgpu because it can't bring up its PCIe resources any more. > > The output of "sudo lspci - -s $BUSID_OF_AMDGPU" might help > understand the problem I attach both lspci for pci=nocrs and without pci=nocrs. The differences for Cezanne Radeon Vega Series: with pci=nocrs: Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Interrupt: pin A routed to IRQ 255 Region 4: I/O ports at e000 [disabled] [size=256] Capabilities: [c0] MSI-X: Enable- Count=4 Masked- Without pci=nocrs: Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Interrupt: pin A routed to IRQ 44 Region 4: I/O ports at e000 [size=256] Capabilities: [c0] MSI-X: Enable+ Count=4 Masked- The differences for Navi 22 Radeon 6800M: with pci=nocrs: Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Interrupt: pin A routed to IRQ 255 Region 0: Memory at f8 (64-bit, prefetchable) [disabled] [size=16G] Region 2: Memory at fc (64-bit, prefetchable) [disabled] [size=256M] Region 5: Memory at fca0 (32-bit, non-prefetchable) [disabled] [size=1M] AtomicOpsCtl: ReqEn- Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+ Address: Data: Without pci=nocrs: Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 103 Region 0: Memory at f8 (64-bit, prefetchable) [size=16G] Region 2: Memory at fc (64-bit, prefetchable) [size=256M] Region 5: Memory at fca0 (32-bit, non-prefetchable) [size=1M] AtomicOpsCtl: ReqEn+ Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: fee0 Data: > but I strongly suggest to try a BIOS update first. This is the first thing that was done. And I am afraid no more BIOS updates. https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/ I also have experience in dealing with manufacturers' tech support. Usually it ends with "we do not provide drivers for Linux". -- Best Regards, Mike Gavrilov. ❯ sudo lspci - -s 08:00.0 08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c4) (prog-if 00 [VGA controller]) Subsystem: ASUSTeK Computer Inc. Radeon Vega 8 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ SERR- Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16 TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR- 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ EETLPPrefix+, MaxEETLPPrefixes 1 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS- LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"
Hi, I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But it is impossible to use without AC power because the system losts nvme when I disconnect the power adapter. Messages from kernel log when it happens: nvme nvme0: controller is down; will reset: CSTS=0x, PCI_STATUS=0x10 nvme nvme0: Does your device have a faulty power saving mode enabled? nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug I tried to use recommended parameters (nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve this issue, but without successed. In the linux-nvme mail list the last advice was to try the "pci=nocrs" parameter. But with this parameter the amdgpu driver refuses to work and makes the system unbootable. I can solve the problem with the booting system by blacklisting the driver but it is not a good solution, because I don't wanna lose the GPU. Why amdgpu not work with "pci=nocrs" ? And is it possible to solve this incompatibility? It is very important because when I boot the system without amdgpu driver with "pci=nocrs" nvme is not losts when I disconnect the power adapter. So "pci=nocrs" really helps. Below that I see in kernel log when adds "pci=nocrs" parameter: amdgpu :03:00.0: amdgpu: Fetched VBIOS from ATRM amdgpu: ATOM BIOS: SWBRT77321.001 [drm] VCN(0) decode is enabled in VM mode [drm] VCN(0) encode is enabled in VM mode [drm] JPEG decode is enabled in VM mode Console: switching to colour dummy device 80x25 amdgpu :03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default) [drm] GPU posting now... [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit amdgpu :03:00.0: amdgpu: VRAM: 12272M 0x0080 - 0x0082FEFF (12272M used) amdgpu :03:00.0: amdgpu: GART: 512M 0x - 0x1FFF amdgpu :03:00.0: amdgpu: AGP: 267894784M 0x0084 - 0x [drm] Detected VRAM RAM=12272M, BAR=16384M [drm] RAM width 192bits GDDR6 [drm] amdgpu: 12272M of VRAM memory ready [drm] amdgpu: 31774M of GTT memory ready. amdgpu :03:00.0: amdgpu: (-14) failed to allocate kernel bo [drm] Debug VRAM access will use slowpath MM access amdgpu :03:00.0: amdgpu: Failed to DMA MAP the dummy page [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block failed -12 amdgpu :03:00.0: amdgpu: amdgpu_device_ip_init failed amdgpu :03:00.0: amdgpu: Fatal error during GPU init amdgpu :03:00.0: amdgpu: amdgpu: finishing device. Of course a full system log is also attached. -- Best Regards, Mike Gavrilov. system-log-Fatal-error-during-GPU-init.tar.xz Description: application/xz
Re: [regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70
On Thu, Feb 9, 2023 at 10:17 PM Leo Li wrote: > > Hi Mikhail, seems like your report flew past me, thanks for the ping. > > This might be a simple issue of not backing off when deadlock was hit. > drm_atomic_normalize_zpos() can return an error code, and I ignored it > (oops!) > > Can you give this patch a try? > https://gitlab.freedesktop.org/-/snippets/7414 > > - Leo > Thanks, I think the time for testing was enough. I observed three computers with different GPUs 6800M, 6900XT and 7900XTX for more than 3 days. And a warning message about drm_modeset_drop_locks no longer appears anymore. I hope this patch will have time to be merged in 6.2 before release. Tested-by: Mikhail Gavrilov -- Best Regards, Mike Gavrilov. uptime.tar.xz Description: application/xz
Re: [regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70
Harry, please don't ignore me. This issue still happens in 6.1 and 6.2 Leo you are the author of the problematic commit please don't stand aside. Really nobody is interested in clean logs without warnings and errors? I am 100% sure that reverting commit b261509952bc19d1012cf732f853659be6ebc61e will stop these warnings. I also attached fresh logs from 6.2.0-0.rc6. 6.2-rc7 I started to build without commit b261509952bc19d1012cf732f853659be6ebc61e to avoid these warnings. On Thu, Oct 13, 2022 at 6:36 PM Mikhail Gavrilov > > Hi! > I bisected an issue of the 6.0 kernel which started happening after > 6.0-rc7 on all my machines. > > Backtrace of this issue looks like as: > > [ 2807.339439] [ cut here ] > [ 2807.339445] WARNING: CPU: 11 PID: 2061 at > drivers/gpu/drm/drm_modeset_lock.c:276 > drm_modeset_drop_locks+0x63/0x70 > [ 2807.339453] Modules linked in: tls uinput rfcomm snd_seq_dummy > snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast > nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet > nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink > qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir > snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek sunrpc snd_sof > snd_hda_codec_hdmi snd_hda_codec_generic snd_sof_utils snd_hda_intel > snd_intel_dspcfg mt7921e snd_intel_sdw_acpi binfmt_misc snd_soc_core > mt7921_common snd_hda_codec snd_compress vfat ac97_bus edac_mce_amd > mt76_connac_lib snd_pcm_dmaengine fat snd_hda_core snd_rpl_pci_acp6x > snd_pci_acp6x mt76 btusb snd_hwdep kvm_amd btrtl snd_seq btbcm > mac80211 snd_seq_device kvm btintel btmtk libarc4 snd_pcm > snd_pci_acp5x bluetooth snd_timer snd_rn_pci_acp3x irqbypass > snd_acp_config snd_soc_acpi cfg80211 rapl snd joydev pcspkr > asus_nb_wmi wmi_bmof > [ 2807.339519] snd_pci_acp3x soundcore i2c_piix4 k10temp amd_pmc > asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi > crct10dif_pclmul iommu_v2 crc32_pclmul ledtrig_audio crc32c_intel > gpu_sched sparse_keymap platform_profile hid_multitouch > polyval_clmulni nvme ucsi_acpi drm_buddy polyval_generic > drm_display_helper ghash_clmulni_intel serio_raw nvme_core ccp > typec_ucsi rfkill sp5100_tco r8169 cec nvme_common typec wmi video > i2c_hid_acpi i2c_hid ip6_tables ip_tables fuse > [ 2807.339540] Unloaded tainted modules: acpi_cpufreq():1 > acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 > acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 > amd64_edac():1 acpi_cpufreq():1 acpi_cpufreq():1 amd64_edac():1 > amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 > amd64_edac():1 acpi_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 > fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1 > amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 > fjes():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 > amd64_edac():1 fjes():1 acpi_cpufreq():1 amd64_edac():1 > pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 amd64_edac():1 > pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 > fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 > acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 fjes():1 > acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 > acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 acpi_cpufreq():1 > pcc_cpufreq():1 fjes():1 > [ 2807.339579] acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 > acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 > acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 > acpi_cpufreq():1 fjes():1 acpi_cpufreq():1 fjes():1 fjes():1 fjes():1 > fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 > fjes():1 fjes():1 fjes():1 fjes():1 > [ 2807.339596] CPU: 11 PID: 2061 Comm: gnome-shell Tainted: GW >L 6.0.0-rc4-07-cb0eca01ad9756e853efec3301203c2b5b45aa9f+ #16 > [ 2807.339598] Hardware name: ASUSTeK COMPUTER INC. ROG Strix > G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022 > [ 2807.339600] RIP: 0010:drm_modeset_drop_locks+0x63/0x70 > [ 2807.339602] Code: 42 08 48 89 10 48 89 1b 48 8d bb 50 ff ff ff 48 > 89 5b 08 e8 3f 41 55 00 48 8b 45 78 49 39 c4 75 c6 5b 5d 41 5c c3 cc > cc cc cc <0f> 0b eb ac 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 > 41 54 > [ 2807.339604] RSP: 0018:b6ad46e07b80 EFLAGS: 00010282 > [ 2807.339606] RAX: 0001 RBX: RCX: > 0002 > [ 2807.339607] RDX: 0001 RSI: a6a118b1 RDI: > b6ad46e07c00 > [ 2807.339608] RBP: b6ad46e07c00 R08: R09: > > [ 2807.339609] R10: R11: 0001 R12: > > [ 2807.339610]
[6.2][regression] looks like commit aab9cf7b6954136f4339136a1a7fc0602a2c4d8b leads to use-after-free and random computer hangs
Hi, The kernel 6.2 preparation cycle has begun. And after the kernel was updated on my Fedora Rawhide I started receiving use-after-free errors with complete computer hangs. At least a good reproducer of this behaviour is launch of the game "Marvel's Avengers". The backtrace of the issue looks like: [ 550.435083] [ cut here ] [ 550.435110] refcount_t: underflow; use-after-free. [ 550.435808] WARNING: CPU: 9 PID: 738 at lib/refcount.c:25 refcount_warn_saturate+0x97/0x110 [ 550.435812] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack [ 550.435887] refcount_t: saturated; leaking memory. [ 550.435893] nf_defrag_ipv6 nf_defrag_ipv4 [ 550.435902] WARNING: CPU: 26 PID: 5032 at lib/refcount.c:19 refcount_warn_saturate+0x74/0x110 [ 550.435907] ip_set [ 550.435909] Modules linked in: [ 550.435910] nf_tables [ 550.435912] uinput rfcomm [ 550.435918] nfnetlink [ 550.435919] snd_seq_dummy snd_hrtimer [ 550.435925] qrtr [ 550.435926] netconsole nft_objref [ 550.435931] bnep [ 550.435933] nf_conntrack_netbios_ns nf_conntrack_broadcast [ 550.435938] sunrpc [ 550.435939] nft_fib_inet [ 550.435941] binfmt_misc [ 550.435942] nft_fib_ipv4 [ 550.435943] iwlmvm [ 550.435130] WARNING: CPU: 25 PID: 740 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 550.435945] nft_fib_ipv6 [ 550.435946] btusb [ 550.435947] nft_fib nft_reject_inet [ 550.435954] btrtl [ 550.435955] nf_reject_ipv4 nf_reject_ipv6 [ 550.435963] btbcm [ 550.435964] nft_reject nft_ct [ 550.435969] btintel [ 550.435971] nft_chain_nat nf_nat [ 550.435977] btmtk [ 550.435979] nf_conntrack nf_defrag_ipv6 [ 550.435984] snd_seq_midi [ 550.435985] nf_defrag_ipv4 ip_set [ 550.435991] snd_seq_midi_event [ 550.435992] nf_tables [ 550.435993] bluetooth [ 550.435995] nfnetlink [ 550.435996] hid_logitech_hidpp [ 550.435142] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc iwlmvm btusb btrtl btbcm btintel btmtk snd_seq_midi snd_seq_midi_event bluetooth hid_logitech_hidpp snd_usb_audio iwlwifi xpad ff_memless snd_usbmidi_lib snd_rawmidi mc ecdh_generic intel_rapl_msr intel_rapl_common mt76x2u mt76x2_common joydev snd_hda_codec_realtek mt76x02_usb edac_mce_amd snd_hda_codec_generic mt76_usb snd_hda_codec_hdmi mt76x02_lib kvm_amd snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec mt76 vfat kvm snd_hda_core fat snd_seq snd_hwdep irqbypass snd_seq_device mac80211 snd_pcm eeepc_wmi asus_wmi ledtrig_audio sparse_keymap rapl platform_profile wmi_bmof snd_timer snd pcspkr i2c_piix4 [ 550.435997] qrtr bnep [ 550.436003] snd_usb_audio [ 550.436004] sunrpc binfmt_misc [ 550.436010] iwlwifi [ 550.436012] iwlmvm btusb [ 550.436018] xpad [ 550.436019] btrtl btbcm [ 550.436025] ff_memless [ 550.436026] btintel [ 550.436027] snd_usbmidi_lib [ 550.436029] btmtk [ 550.436030] snd_rawmidi [ 550.436031] snd_seq_midi snd_seq_midi_event [ 550.436037] mc [ 550.436038] bluetooth [ 550.436039] ecdh_generic [ 550.436041] hid_logitech_hidpp snd_usb_audio [ 550.436046] intel_rapl_msr [ 550.436048] iwlwifi xpad [ 550.436054] intel_rapl_common [ 550.436055] ff_memless [ 550.436056] mt76x2u [ 550.436058] snd_usbmidi_lib snd_rawmidi [ 550.436063] mt76x2_common [ 550.436064] mc ecdh_generic [ 550.436070] joydev [ 550.436071] intel_rapl_msr intel_rapl_common [ 550.436076] snd_hda_codec_realtek [ 550.436078] mt76x2u [ 550.436079] mt76x02_usb [ 550.436080] mt76x2_common joydev [ 550.436086] edac_mce_amd [ 550.436088] snd_hda_codec_realtek mt76x02_usb [ 550.436094] snd_hda_codec_generic [ 550.436095] edac_mce_amd [ 550.436096] mt76_usb [ 550.436098] snd_hda_codec_generic mt76_usb [ 550.436104] snd_hda_codec_hdmi [ 550.436106] snd_hda_codec_hdmi [ 550.436107] mt76x02_lib [ 550.435234] k10temp soundcore libarc4 acpi_cpufreq cfg80211 hid_logitech_dj rfkill zram amdgpu drm_ttm_helper ttm video iommu_v2 gpu_sched drm_buddy crct10dif_pclmul crc32_pclmul crc32c_intel igb ucsi_ccg drm_display_helper nvme typec_ucsi ghash_clmulni_intel ccp typec cec sp5100_tco dca sha512_ssse3 nvme_core wmi ip6_tables ip_tables fuse [ 550.436108] mt76x02_lib kvm_amd [ 550.436115] kvm_amd [ 550.436116] snd_hda_intel snd_intel_dspcfg [ 550.436122] snd_hda_intel [ 550.436123] snd_intel_sdw_acpi [ 550.435284] CPU: 25 PID: 740 Comm: sdma2 Tainted: GWL 6.1.0-rc1-13-aab9cf7b6954136f4339136a1a7fc0602a2c4d
Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start
On Tue, Nov 22, 2022 at 12:16 PM Christian König wrote: > > Ah, thanks a lot for this. I've already pushed the patches into our > internal branch, but getting this confirmation is still great! > > This was quite some fundamental bug in the handling and I hope to get > this completely reworked at some point since it is currently only mitigated. Looks like the final version of this patch successfully merged in 6.1-rc7. Big thanks, all games work again! > No idea what that could be. Modesetting is not something I work on. > > The best advice I can give you is to maybe ping Harry and our other > display people, they should know that stuff better than I do. Unfortunately Harry didn't answer. I hope my email wasn't marked as spam. -- Best Regards, Mike Gavrilov.
Re: [regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70
On Thu, Oct 13, 2022 at 6:36 PM Mikhail Gavrilov wrote: > > Hi! > I bisected an issue of the 6.0 kernel which started happening after > 6.0-rc7 on all my machines. > > Backtrace of this issue looks like as: > > [ 2807.339439] [ cut here ] > [ 2807.339445] WARNING: CPU: 11 PID: 2061 at > drivers/gpu/drm/drm_modeset_lock.c:276 > drm_modeset_drop_locks+0x63/0x70 > > bisect points to this commit: b261509952bc19d1012cf732f853659be6ebc61e. > > After reverting this commit the WARNING messages described here disappeared. > Hi Harry, Christian says that you can help with it. Thanks. -- Best Regards, Mike Gavrilov.
Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start
On Mon, Nov 14, 2022 at 6:22 PM Christian König wrote: > > I've found and fixed a few problems around the userptr handling which > might explain what you see here. > > A series of four patches starting with "drm/amdgpu: always register an > MMU notifier for userptr" is under review now. > > Going to give that a bit cleanup later today and will CC you when I send > that out. Would be nice if you could give that some testing. > > Thanks, > Christian. > Christian, I tested all four patches around week and can say that this issue is completely gone. All known broken games working. Tested-by: Mikhail Gavrilov The only thing I don't like is the flood in the kernel logs of the message "WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70", but this is not related to the patches being checked. All kernel logs uploaded to pastebin [1][2][3][4][5][6][7][8] I wrote a separate bug report about "drm_modeset_lock" [9], it's a pity that no one paid attention to it. I even found the first bad commit. It is b261509952bc19d1012cf732f853659be6ebc61e. [1] https://pastebin.com/WZWczupk [2] https://pastebin.com/f4i9pvjS [3] https://pastebin.com/rsDWaMR1 [4] https://pastebin.com/tDNEYJq0 [5] https://pastebin.com/xfZVbm1f [6] https://pastebin.com/Vx9gDyKt [7] https://pastebin.com/XvRkLckV [8] https://pastebin.com/pd8WBkgx [9] https://www.spinics.net/lists/dri-devel/msg367543.html Thanks. -- Best Regards, Mike Gavrilov.
Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start
On Tue, Nov 1, 2022 at 10:52 PM Christian König wrote: > > Let's focus on one problem at a time. > > The issue here is that somehow userptr handling became racy after we > removed the lock, but I don't see why. > > We need to fix this ASAP since it is probably a much wider problem and > the additional lock just hides it somehow. > > Going to provide you with an updated patch tomorrow. > > Thanks, > Christian. Recently sackboy has been updated and now the kernel log contains a trace very similar to the one in the first post, even with the patch applied. [ 155.948044] [ cut here ] [ 155.948164] WARNING: CPU: 3 PID: 4850 at drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:678 amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu] [ 155.948342] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep intel_rapl_msr intel_rapl_common snd_hda_codec_realtek snd_sof_amd_renoir snd_sof_amd_acp snd_hda_codec_generic snd_hda_codec_hdmi snd_sof_pci sunrpc binfmt_misc snd_sof snd_hda_intel snd_sof_utils snd_intel_dspcfg mt7921e snd_intel_sdw_acpi snd_hda_codec mt7921_common snd_soc_core edac_mce_amd mt76_connac_lib btusb snd_hda_core snd_compress snd_hwdep mt76 btrtl ac97_bus kvm_amd snd_pcm_dmaengine btbcm snd_rpl_pci_acp6x snd_pci_acp6x btintel mac80211 btmtk snd_seq snd_seq_device kvm snd_pcm snd_pci_acp5x libarc4 bluetooth irqbypass vfat snd_timer snd_rn_pci_acp3x fat rapl snd_acp_config asus_nb_wmi snd cfg80211 snd_soc_acpi wmi_bmof k10temp pcspkr [ 155.948436] snd_pci_acp3x i2c_piix4 soundcore asus_wireless amd_pmc joydev zram amdgpu drm_ttm_helper ttm crct10dif_pclmul hid_asus crc32_pclmul asus_wmi crc32c_intel iommu_v2 ledtrig_audio polyval_clmulni gpu_sched sparse_keymap polyval_generic platform_profile drm_buddy drm_display_helper nvme rfkill ghash_clmulni_intel hid_multitouch ucsi_acpi sha512_ssse3 nvme_core typec_ucsi serio_raw sp5100_tco r8169 ccp cec nvme_common typec i2c_hid_acpi i2c_hid video wmi ip6_tables ip_tables fuse [ 155.948540] CPU: 3 PID: 4850 Comm: Sackboy-Win64-T Tainted: G WL--- --- 6.1.0-0.rc3.20221101git5aaef24b5c6d.29.fc38.x86_64 #1 [ 155.948544] Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022 [ 155.948547] RIP: 0010:amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu] [ 155.948748] Code: 9e f1 e9 32 ff ff ff 4c 89 e9 89 ea 48 c7 c6 a8 a3 fd c0 48 c7 c7 88 81 1e c1 e8 af 97 ea f1 eb 8e 66 90 bd f2 ff ff ff eb 8d <0f> 0b eb f5 bd fd ff ff ff eb 82 bd f2 ff ff ff e9 62 ff ff ff 48 [ 155.948751] RSP: 0018:960b544d3a50 EFLAGS: 00010282 [ 155.948756] RAX: 8a4e40d44e00 RBX: 8a4f0e564140 RCX: 0001 [ 155.948759] RDX: RSI: 8a4e40d44e00 RDI: 8a4f4b52b400 [ 155.948761] RBP: 8a4e8c979000 R08: 0dc0 R09: [ 155.948764] R10: 0001 R11: R12: 8a4e8aaad558 [ 155.948767] R13: 3b91 R14: 8a4f0e667180 R15: 8a4f4b52b458 [ 155.948770] FS: 7fa13fe006c0() GS:8a5d16e0() knlGS:36f8 [ 155.948772] CS: 0010 DS: ES: CR0: 80050033 [ 155.948775] CR2: 25c9e1d0 CR3: 00036199 CR4: 00750ee0 [ 155.948778] PKRU: 5554 [ 155.948780] Call Trace: [ 155.948783] [ 155.948790] amdgpu_cs_ioctl+0x9fd/0x2030 [amdgpu] [ 155.948992] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 155.949155] drm_ioctl_kernel+0xac/0x160 [ 155.949165] drm_ioctl+0x1e7/0x450 [ 155.949172] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 155.949344] amdgpu_drm_ioctl+0x4a/0x80 [amdgpu] [ 155.949528] __x64_sys_ioctl+0x90/0xd0 [ 155.949537] do_syscall_64+0x5b/0x80 [ 155.949547] ? lock_is_held_type+0xe8/0x140 [ 155.949559] ? do_syscall_64+0x67/0x80 [ 155.949565] ? lockdep_hardirqs_on+0x7d/0x100 [ 155.949573] ? do_syscall_64+0x67/0x80 [ 155.949579] ? do_syscall_64+0x67/0x80 [ 155.949586] ? do_syscall_64+0x67/0x80 [ 155.949592] ? lockdep_hardirqs_on+0x7d/0x100 [ 155.949597] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 155.949603] RIP: 0033:0x7fa1b7fd912f [ 155.949610] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 [ 155.949615] RSP: 002b:7fa13fdfe920 EFLAGS: 0246 ORIG_RAX: 0010 [ 155.949621] RAX: ffda RBX: 7fa13fdfebe8 RCX: 7fa1b7fd912f [ 155.949625] RDX: 7fa13fdfea10 RSI: c0186444 RDI: 0165 [ 155.949629] RBP: 7fa13fdfea10 R08: 7f9ff80018e0 R09: 7fa13fdfe9c0 [ 155.949633] R10: 7eb11590 R11: 0246 R12: c0186444 [ 15
Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start
On Wed, Oct 26, 2022 at 12:29 PM Christian König wrote: > > Attached is the original test patch rebased on current amd-staging-drm-next. > > Can you test if this is enough to make sure that the games start without > crashing by fetching the userptrs? 1. Over the past week the list of games affected by this issue updated with new games: The Outlast Trials, Gotham Knights, Sackboy: A Big Adventure. 2. I tested the patch and it really solves the problem with the launch of all the listed games and does not create new problems. 3. The only thing I noticed is that in the game Sackboy: A Big Adventure, when using the kernel built from the commit b229b6ca5abbd63ff40c1396095b1b36b18139c3 + the attached patch, I can’t connect to friend coop session due to the steam client hangs. The kernel built from commit 736ec9fadd7a1fde8480df7e5cfac465c07ff6f3 (this is the commit prior to dd80d9c8eecac8c516da5b240d01a35660ba6cb6) free of this problem. I need to spend some more time to find the commit after which leads to hanging [3] the steam client. Thanks. -- Best Regards, Mike Gavrilov.
Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start
On Fri, Oct 21, 2022 at 1:33 PM Christian König wrote: > > Hi, > > yes Bas already reported this issue, but I couldn't reproduce it. Need > to come up with a patch to narrow this down further. > > Can I send you something to test? I would appreciate to test any patches and ideas. -- Best Regards, Mike Gavrilov.
[6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start
Hi! I found that some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6. dd80d9c8eecac8c516da5b240d01a35660ba6cb6 is the first bad commit commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 Author: Christian König Date: Thu Jul 14 10:23:38 2022 +0200 drm/amdgpu: revert "partial revert "remove ctx->lock" v2" This reverts commit 94f4c4965e5513ba624488f4b601d6b385635aec. We found that the bo_list is missing a protection for its list entries. Since that is fixed now this workaround can be removed again. Signed-off-by: Christian König Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 21 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 -- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 - 3 files changed, 6 insertions(+), 18 deletions(-) And when it happening in kernel log appears a such backtrace: [ 231.331210] [ cut here ] [ 231.331262] WARNING: CPU: 11 PID: 6555 at drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:675 amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu] [ 231.331424] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek snd_sof snd_hda_codec_generic snd_hda_codec_hdmi snd_sof_utils mt7921e snd_hda_intel sunrpc snd_intel_dspcfg mt7921_common binfmt_misc snd_intel_sdw_acpi snd_hda_codec mt76_connac_lib edac_mce_amd btusb snd_soc_core mt76 snd_hda_core btrtl snd_hwdep snd_compress kvm_amd ac97_bus snd_seq btbcm snd_pcm_dmaengine btintel snd_rpl_pci_acp6x mac80211 btmtk snd_pci_acp6x kvm snd_seq_device snd_pcm snd_pci_acp5x libarc4 irqbypass bluetooth snd_rn_pci_acp3x snd_timer pcspkr asus_nb_wmi rapl joydev wmi_bmof snd_acp_config cfg80211 snd_soc_acpi vfat snd [ 231.331490] snd_pci_acp3x i2c_piix4 soundcore fat k10temp amd_pmc asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi iommu_v2 crct10dif_pclmul crc32_pclmul gpu_sched crc32c_intel ledtrig_audio sparse_keymap polyval_clmulni platform_profile drm_buddy polyval_generic hid_multitouch drm_display_helper rfkill nvme ucsi_acpi ghash_clmulni_intel nvme_core video typec_ucsi serio_raw ccp sha512_ssse3 sp5100_tco r8169 cec nvme_common typec wmi i2c_hid_acpi i2c_hid ip6_tables ip_tables fuse [ 231.331532] CPU: 11 PID: 6555 Comm: GameThread Tainted: GW L--- --- 6.1.0-0.rc1.20221019gitaae703b02f92.17.fc38.x86_64 #1 [ 231.331534] Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022 [ 231.331537] RIP: 0010:amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu] [ 231.331654] Code: a8 d0 e9 32 ff ff ff 4c 89 e9 89 ea 48 c7 c6 40 82 f3 c0 48 c7 c7 10 60 14 c1 e8 2f a0 f4 d0 eb 8e 66 90 bd f2 ff ff ff eb 8d <0f> 0b eb f5 bd fd ff ff ff eb 82 bd f2 ff ff ff e9 62 ff ff ff 48 [ 231.331656] RSP: 0018:aad4c705bae8 EFLAGS: 00010286 [ 231.331659] RAX: 8e9cbdbe3200 RBX: 8e997e3f2440 RCX: [ 231.331661] RDX: RSI: 8e9cbdbe3200 RDI: 8e9c31208000 [ 231.331663] RBP: 0001 R08: 0dc0 R09: [ 231.331665] R10: 0001 R11: R12: aad4c705bb90 [ 231.331666] R13: 7651 R14: 8e9c89f334e0 R15: 8e991fda8000 [ 231.331668] FS: 7c2af6c0() GS:8ea7d8e0() knlGS:7b2c [ 231.331671] CS: 0010 DS: ES: CR0: 80050033 [ 231.331673] CR2: 7ff65ffd8000 CR3: 0004f90f CR4: 00750ee0 [ 231.331674] PKRU: 5554 [ 231.331676] Call Trace: [ 231.331678] [ 231.331682] amdgpu_cs_ioctl+0x87e/0x1fc0 [amdgpu] [ 231.331824] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 231.331981] drm_ioctl_kernel+0xac/0x160 [ 231.331990] drm_ioctl+0x1e7/0x450 [ 231.331994] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 231.332118] amdgpu_drm_ioctl+0x4a/0x80 [amdgpu] [ 231.332233] __x64_sys_ioctl+0x90/0xd0 [ 231.332238] do_syscall_64+0x5b/0x80 [ 231.332243] ? asm_exc_page_fault+0x22/0x30 [ 231.332247] ? lockdep_hardirqs_on+0x7d/0x100 [ 231.332250] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 231.332253] RIP: 0033:0x7ff677c5704f [ 231.332256] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00 [ 231.332258] RSP: 002b:7c2ad470 EFLAGS: 0246 ORIG_RAX: 0010 [ 231.332261] RAX: ffda RBX: 7c2ad718 RCX: 7ff677c5704f [ 231.332263] RDX: 7c2ad540 RSI: c0186444
Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga
On Wed, May 11, 2022 at 5:01 PM Christian König wrote: > > > We have implemented a workaround, but still don't know the exact root cause. > > If anybody wants to look into this it would be rather helpful to be able > to reproduce the issue. > > Regards, > Christian. I see that issue was returned after this commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 is the first bad commit commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 Author: Christian König Date: Thu Jul 14 10:23:38 2022 +0200 drm/amdgpu: revert "partial revert "remove ctx->lock" v2" This reverts commit 94f4c4965e5513ba624488f4b601d6b385635aec. We found that the bo_list is missing a protection for its list entries. Since that is fixed now this workaround can be removed again. Signed-off-by: Christian König Reviewed-by: Alex Deucher Signed-off-by: Alex Deucher drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 21 ++--- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 2 -- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h | 1 - 3 files changed, 6 insertions(+), 18 deletions(-) The games Forza Horizon 4 and Cyberpunk 2077 again hangs at start. -- Best Regards, Mike Gavrilov.
[regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70
Hi! I bisected an issue of the 6.0 kernel which started happening after 6.0-rc7 on all my machines. Backtrace of this issue looks like as: [ 2807.339439] [ cut here ] [ 2807.339445] WARNING: CPU: 11 PID: 2061 at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70 [ 2807.339453] Modules linked in: tls uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek sunrpc snd_sof snd_hda_codec_hdmi snd_hda_codec_generic snd_sof_utils snd_hda_intel snd_intel_dspcfg mt7921e snd_intel_sdw_acpi binfmt_misc snd_soc_core mt7921_common snd_hda_codec snd_compress vfat ac97_bus edac_mce_amd mt76_connac_lib snd_pcm_dmaengine fat snd_hda_core snd_rpl_pci_acp6x snd_pci_acp6x mt76 btusb snd_hwdep kvm_amd btrtl snd_seq btbcm mac80211 snd_seq_device kvm btintel btmtk libarc4 snd_pcm snd_pci_acp5x bluetooth snd_timer snd_rn_pci_acp3x irqbypass snd_acp_config snd_soc_acpi cfg80211 rapl snd joydev pcspkr asus_nb_wmi wmi_bmof [ 2807.339519] snd_pci_acp3x soundcore i2c_piix4 k10temp amd_pmc asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi crct10dif_pclmul iommu_v2 crc32_pclmul ledtrig_audio crc32c_intel gpu_sched sparse_keymap platform_profile hid_multitouch polyval_clmulni nvme ucsi_acpi drm_buddy polyval_generic drm_display_helper ghash_clmulni_intel serio_raw nvme_core ccp typec_ucsi rfkill sp5100_tco r8169 cec nvme_common typec wmi video i2c_hid_acpi i2c_hid ip6_tables ip_tables fuse [ 2807.339540] Unloaded tainted modules: acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 acpi_cpufreq():1 amd64_edac():1 amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 amd64_edac():1 acpi_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1 amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 fjes():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 fjes():1 acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 fjes():1 acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 [ 2807.339579] acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 acpi_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 [ 2807.339596] CPU: 11 PID: 2061 Comm: gnome-shell Tainted: GW L 6.0.0-rc4-07-cb0eca01ad9756e853efec3301203c2b5b45aa9f+ #16 [ 2807.339598] Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022 [ 2807.339600] RIP: 0010:drm_modeset_drop_locks+0x63/0x70 [ 2807.339602] Code: 42 08 48 89 10 48 89 1b 48 8d bb 50 ff ff ff 48 89 5b 08 e8 3f 41 55 00 48 8b 45 78 49 39 c4 75 c6 5b 5d 41 5c c3 cc cc cc cc <0f> 0b eb ac 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55 41 54 [ 2807.339604] RSP: 0018:b6ad46e07b80 EFLAGS: 00010282 [ 2807.339606] RAX: 0001 RBX: RCX: 0002 [ 2807.339607] RDX: 0001 RSI: a6a118b1 RDI: b6ad46e07c00 [ 2807.339608] RBP: b6ad46e07c00 R08: R09: [ 2807.339609] R10: R11: 0001 R12: [ 2807.339610] R13: 9801ca24bb00 R14: 9801ca24bb00 R15: [ 2807.339611] FS: 7f57445b0600() GS:981198e0() knlGS: [ 2807.339613] CS: 0010 DS: ES: CR0: 80050033 [ 2807.339614] CR2: 7f574367f000 CR3: 0001236ae000 CR4: 00750ee0 [ 2807.339615] PKRU: 5554 [ 2807.339616] Call Trace: [ 2807.339618] [ 2807.339621] drm_mode_atomic_ioctl+0x3b9/0xac0 [ 2807.339627] ? drm_atomic_set_property+0xb60/0xb60 [ 2807.339629] drm_ioctl_kernel+0xac/0x160 [ 2807.339633] drm_ioctl+0x22d/0x410 [ 2807.339635] ? drm_atomic_set_property+0xb60/0xb60 [ 2807.339639] amdgpu_drm_ioctl+0x4a/0x80 [amdgpu] [ 2807.339834] __x64_sys_ioctl+0x90/0xd0 [ 2807.339838] do_syscall_64+0x5b/0x80 [ 2807.339843] ? rcu_read_lock_sched_held+0x10/0x80 [ 2807.339846] ? trace_hardirqs_on_prepare+0x55/0xe0 [ 2807.339849] ? do_syscall_64+0x67/0x80 [ 2807.339851] ? rcu_read_loc
[regression][6.1] After commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 system randomly hungs
Hi! The hungs occurs randomly, but I found good reproductive scenario (This is running the campaign in the game Halo Infinite) The backtrace is look like this: [ 147.260971] BUG: kernel NULL pointer dereference, address: 0088 [ 147.260987] [ cut here ] [ 147.260988] WARNING: CPU: 3 PID: 0 at kernel/softirq.c:321 __local_bh_disable_ip+0x9e/0xb0 [ 147.260993] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc snd_sof_amd_renoir intel_rapl_msr snd_sof_amd_acp intel_rapl_common mt7921e snd_sof_pci mt7921_common binfmt_misc snd_sof mt76_connac_lib snd_sof_utils vfat snd_hda_codec_realtek snd_soc_core snd_hda_codec_generic mt76 fat snd_hda_codec_hdmi snd_hda_intel edac_mce_amd snd_compress ac97_bus btusb kvm_amd snd_intel_dspcfg snd_pcm_dmaengine btrtl snd_intel_sdw_acpi btbcm snd_hda_codec snd_pci_acp6x mac80211 kvm snd_hda_core btintel btmtk irqbypass snd_hwdep snd_seq libarc4 snd_seq_device bluetooth snd_pcm snd_pci_acp5x snd_timer snd_rn_pci_acp3x cfg80211 rapl pcspkr joydev asus_nb_wmi wmi_bmof snd_acp_config snd snd_soc_acpi k10temp [ 147.261033] soundcore i2c_piix4 snd_pci_acp3x asus_wireless amd_pmc zram amdgpu drm_ttm_helper ttm hid_asus iommu_v2 asus_wmi gpu_sched ledtrig_audio sparse_keymap drm_buddy platform_profile drm_display_helper crct10dif_pclmul crc32_pclmul nvme rfkill crc32c_intel ucsi_acpi hid_multitouch video ghash_clmulni_intel nvme_core ccp typec_ucsi serio_raw r8169 cec sp5100_tco typec i2c_hid_acpi wmi i2c_hid ip6_tables ip_tables fuse [ 147.261045] CPU: 3 PID: 0 Comm: swapper/3 Tainted: GWL 6.0.0-rc2-02-907cc346ff6a69a08b4786c4ed2a78ac0120b9da+ #124 [ 147.261046] Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022 [ 147.261047] RIP: 0010:__local_bh_disable_ip+0x9e/0xb0 [ 147.261048] Code: 25 00 1e 02 00 48 89 df e8 6f 23 08 00 85 c0 75 0e 48 89 9d 30 1c 00 00 5b 5d c3 cc cc cc cc 31 ff 31 db e8 54 23 08 00 eb e7 <0f> 0b e9 76 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 [ 147.261049] RSP: 0018:a4e1c028c8d8 EFLAGS: 00010006 [ 147.261050] RAX: 80010005 RBX: 0201 RCX: 0018 [ 147.261051] RDX: 0f440b255950 RSI: 0201 RDI: c1b652e5 [ 147.261051] RBP: 93a4eaf00fd8 R08: 0001 R09: [ 147.261052] R10: 7635d840c31a8942 R11: fcca632b3d1b0d46 R12: 93a4f7831000 [ 147.261052] R13: 93a4eaf00ee0 R14: 93a4efd84178 R15: 93a4efd84000 [ 147.261053] FS: () GS:93b396e0() knlGS: [ 147.261054] CS: 0010 DS: ES: CR0: 80050033 [ 147.261055] CR2: 0088 CR3: 00012a61 CR4: 00750ee0 [ 147.261056] PKRU: 5554 [ 147.261056] Call Trace: [ 147.261060] [ 147.261068] _raw_spin_lock_bh+0x1d/0x80 [ 147.261074] ieee80211_queue_skb+0x125/0x7a0 [mac80211] [ 147.261113] ? __skb_get_hash+0x55/0x200 [ 147.261117] ieee80211_tx_8023+0x9c/0x1c0 [mac80211] [ 147.261155] ieee80211_subif_start_xmit_8023+0x2b5/0x510 [mac80211] [ 147.261191] netpoll_start_xmit+0x121/0x190 [ 147.261199] netpoll_send_skb+0x1fc/0x300 [ 147.261202] write_msg+0xdc/0xf0 [netconsole] [ 147.261207] console_emit_next_record.constprop.0+0x17d/0x300 [ 147.261214] console_unlock+0xf3/0x1f0 [ 147.261215] vprintk_emit+0x152/0x350 [ 147.261217] ? plist_add+0xba/0xf0 [ 147.261223] _printk+0x48/0x4e [ 147.261231] ? rcu_read_lock_sched_held+0x10/0x80 [ 147.261235] page_fault_oops.cold+0xcf/0x1f9 [ 147.261240] ? do_user_addr_fault+0x65/0x6b0 [ 147.261243] ? _raw_spin_unlock_irqrestore+0x40/0x60 [ 147.261247] exc_page_fault+0x7e/0x300 [ 147.261249] asm_exc_page_fault+0x22/0x30 [ 147.261252] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x1e0 [gpu_sched] [ 147.261255] Code: 89 d7 e8 87 02 0d f0 e9 54 ff ff ff 48 89 d7 e8 ea 66 37 f0 e9 47 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 41 54 55 53 48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d 70 02 00 00 48 8b 85 a8 03 00 00 f0 [ 147.261256] RSP: 0018:a4e1c028cdc8 EFLAGS: 00010093 [ 147.261257] RAX: c06dc380 RBX: RCX: 0018 [ 147.261257] RDX: 0efa9afe3594 RSI: 93a7a4c1ec90 RDI: [ 147.261258] RBP: 93a7a4c1ee10 R08: 0001 R09: [ 147.261259] R10: R11: 0001 R12: a4e1c028cde8 [ 147.261259] R13: 0086 R14: R15: 93a4fbed0198 [ 147.261261] ? drm_sched_job_done.isra.0+0x1e0/0x1e0 [gpu_sched] [ 147.261266] dma_fence_signal_timestamp_locked+0x9e/0x1c0 [ 147.261274] dma_fence_signal+0x36/0x70 [ 147.261276] amdgpu_fence_process+
Re: [BUG][5.20] refcount_t: underflow; use-after-free
Hi! Unfortunately the use-after-free issue still happens on the 6.0-rc5 kernel. The issue became hard to repeat. I spent the whole day at the computer when use-after-free again happened, I was playing the game Tiny Tina's Wonderlands. Therefore, forget about repeatability. It remains only to hope for logs and tracing. I didn't see anything new in the logs. It seems that we need to somehow expand the logging so that the next time this happens we have more information. Sep 18 20:52:16 primary-ws gnome-shell[2388]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed Sep 18 20:52:27 primary-ws gnome-shell[2388]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed Sep 18 20:53:44 primary-ws gnome-shell[2388]: Window manager warning: Window 0x4e3 sets an MWM hint indicating it isn't resizable, but sets min size 1 x 1 and max size 2147483647 x 2147483647; this doesn't make much sense. Sep 18 20:53:45 primary-ws kernel: umip_printk: 11 callbacks suppressed Sep 18 20:53:45 primary-ws kernel: umip: Wonderlands.exe[213853] ip:14ebb0d03 sp:4ee528: SGDT instruction cannot be used by applications. Sep 18 20:53:45 primary-ws kernel: umip: Wonderlands.exe[213853] ip:14ebb0d03 sp:4ee528: For now, expensive software emulation returns the result. Sep 18 20:53:53 primary-ws gnome-shell[2388]: meta_window_set_stack_position_no_sync: assertion 'window->stack_position >= 0' failed Sep 18 20:53:53 primary-ws kernel: umip: Wonderlands.exe[213853] ip:14ebb0d03 sp:4ee528: SGDT instruction cannot be used by applications. Sep 18 20:53:53 primary-ws kernel: umip: Wonderlands.exe[213853] ip:14ebb0d03 sp:4ee528: For now, expensive software emulation returns the result. Sep 18 20:54:15 primary-ws kernel: umip: Wonderlands.exe[214194] ip:15a270815 sp:6eaef490: SGDT instruction cannot be used by applications. Sep 18 20:56:01 primary-ws kernel: umip_printk: 15 callbacks suppressed Sep 18 20:56:01 primary-ws kernel: umip: Wonderlands.exe[213853] ip:15e3a82b0 sp:4ed178: SGDT instruction cannot be used by applications. Sep 18 20:56:01 primary-ws kernel: umip: Wonderlands.exe[213853] ip:15e3a82b0 sp:4ed178: For now, expensive software emulation returns the result. Sep 18 20:56:03 primary-ws kernel: umip: Wonderlands.exe[213853] ip:15e3a82b0 sp:4edbe8: SGDT instruction cannot be used by applications. Sep 18 20:56:03 primary-ws kernel: umip: Wonderlands.exe[213853] ip:15e3a82b0 sp:4edbe8: For now, expensive software emulation returns the result. Sep 18 20:56:03 primary-ws kernel: umip: Wonderlands.exe[213853] ip:15e3a82b0 sp:4ebf18: SGDT instruction cannot be used by applications. Sep 18 20:57:55 primary-ws kernel: [ cut here ] Sep 18 20:57:55 primary-ws kernel: refcount_t: underflow; use-after-free. Sep 18 20:57:55 primary-ws kernel: WARNING: CPU: 22 PID: 235114 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 Sep 18 20:57:55 primary-ws kernel: Modules linked in: tls uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_> Sep 18 20:57:55 primary-ws kernel: asus_wmi ledtrig_audio sparse_keymap platform_profile irqbypass rfkill mc rapl snd_timer video wmi_bmof pcspkr snd k10temp i2c_piix4 soundcore acpi_cpufreq zram amdgpu drm_ttm_helper ttm iommu_v2 crct1> Sep 18 20:57:55 primary-ws kernel: Unloaded tainted modules: amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_eda> Sep 18 20:57:55 primary-ws kernel: pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 Sep 18 20:57:55 primary-ws kernel: CPU: 22 PID: 235114 Comm: kworker/22:0 Tainted: GWL--- --- 6.0.0-0.rc5.20220914git3245cb65fd91.39.fc38.x86_64 #1 Sep 18 20:57:55 primary-ws kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 Sep 18 20:57:55 primary-ws kernel: Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched] Sep 18 20:57:55 primary-ws kernel: RIP: 0010:refcount_warn_saturate+0xba/0x110 Sep 18 20:57:55 primary-ws kernel: Code: 01 01 e8 69 6b 6f 00 0f 0b e9 32 38 a5 00 80 3d 4d 7d be 01 00 75 85 48 c7 c7 80 b7 8e 95 c6 05 3d 7d be 01 01 e8 46 6b 6f 00 <0f> 0b e9 0f 38 a5 00 80 3d 28 7d be 01 00 0f 85 5e ff ff ff 48 c7 Sep 18 20:57:55 primary-ws kernel: RSP: 0018:a1a853ccbe60 EFLAGS: 00010286 Sep 18 20:57:55 primary-ws kernel: RAX: 0026 RBX: 8e0e60a96c28 RCX: Sep 18 20:57:55 primary-ws kernel: RDX: 0001 RSI: 958d255c RDI: Sep 18 20:57:55 primary-ws kernel: RBP: 8e19a83f5600 R08: R09: a1a853ccbd10 Sep 18 20:57:55 primary-ws kernel: R10: 0003 R11: 8e19ee2fffe8 R12: 8e19a83fc800 Sep 18 20:
Re: [BUG][5.20] refcount_t: underflow; use-after-free
On Fri, Aug 19, 2022 at 5:13 PM Maíra Canal wrote: > > Hi Mikhail, > > Could you please specify the steps to reproduce this use-after-free? I > will try to reproduce it on the RX5700 XT and bisect the issue. > Hi Maíra, thanks for help. I'm afraid that it will be unrealistic to reproduce, because on a laptop with 6800M (also RDNA 2 graphics) the problem does not repeat. Sorry for the long silence, but I was trying to bisect the problem myself. git bisect start # status: waiting for both good and bad commits # good: [3d7cb6b04c3f3115719235cc6866b10326de34cd] Linux 5.19 git bisect good 3d7cb6b04c3f3115719235cc6866b10326de34cd # status: waiting for bad commit, 1 good commit known # bad: [7ebfc85e2cd7b08f518b526173e9a33b56b3913b] Merge tag 'net-6.0-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net git bisect bad 7ebfc85e2cd7b08f518b526173e9a33b56b3913b # bad: [b44f2fd87919b5ae6e1756d4c7ba2cbba22238e1] Merge tag 'drm-next-2022-08-03' of git://anongit.freedesktop.org/drm/drm # 001: GPU hangs + use-after-free issue - https://pastebin.com/z86E9ydx git bisect bad b44f2fd87919b5ae6e1756d4c7ba2cbba22238e1 # good: [526942b8134cc34d25d27f95dfff98b8ce2f6fcd] Merge tag 'ata-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata # 002: good - https://pastebin.com/9qki65Sj git bisect good 526942b8134cc34d25d27f95dfff98b8ce2f6fcd # good: [45490ce2ff833c4ec0de66705e46ba41320860cb] nfp: flower: add support for tunnel offload without key ID # 003: good - https://pastebin.com/vHk5eRkw git bisect good 45490ce2ff833c4ec0de66705e46ba41320860cb # skip: [e23a5e14aa278858c2e3d81ec34e83aa9a4177c5] Backmerge tag 'v5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into drm-next # 004: GPU not switched in graphic mode - https://pastebin.com/RmqCTMLD git bisect skip e23a5e14aa278858c2e3d81ec34e83aa9a4177c5 # bad: [b2065fb21d9a789b14f737ea90facedabadeb8a4] drm/amdgpu: fix i2s_pdata out of bound array access # 005: GPU hangs + use-after-free issue - https://pastebin.com/Zgw5Hc48 git bisect bad b2065fb21d9a789b14f737ea90facedabadeb8a4 # skip: [344feb7ccf764756937cfd74fa4ac5caba069c99] Merge tag 'amd-drm-next-5.20-2022-07-05' of https://gitlab.freedesktop.org/agd5f/linux into drm-next # 006: GPU not switched in graphic mode - https://pastebin.com/b8BUBE7Q git bisect skip 344feb7ccf764756937cfd74fa4ac5caba069c99 # skip: [869b10ac8d2300327f554d83f4dbab041bf27d49] drm/amdgpu: add dm ip block for dcn 3.1.4 # 007: GPU not switched in graphic mode - https://pastebin.com/byd7HECH git bisect skip 869b10ac8d2300327f554d83f4dbab041bf27d49 # skip: [676ad8e997036e2f815c293b76c356fb7cc97a08] drm: rcar-du: Lift z-pos restriction on primary plane for Gen3 # 008: GPU not switched in graphic mode - https://pastebin.com/3fXCTinb git bisect skip 676ad8e997036e2f815c293b76c356fb7cc97a08 # skip: [5c57cbc390b166950c2e6c2f0c4edaeb0f47e97d] drm/bridge: lt9211: Convert to drm_of_get_data_lanes_count # 009: Build error - https://pastebin.com/rxHe9QRB git bisect skip 5c57cbc390b166950c2e6c2f0c4edaeb0f47e97d # skip: [6db5e0c8692e590734a7ec7455365d9cbaa15ef1] Merge tag 'drm-intel-next-2022-07-06' of git://anongit.freedesktop.org/drm/drm-intel into drm-next # 010: GPU not switched in graphic mode - https://pastebin.com/rqubSuc8 git bisect skip 6db5e0c8692e590734a7ec7455365d9cbaa15ef1 # skip: [5d763a9955f0fbf2681a2f1fa87c416056bd0c89] drm/amd/display: Remove compiler warning # 011: GPU not switched in graphic mode - https://pastebin.com/BrJs6ybP git bisect skip 5d763a9955f0fbf2681a2f1fa87c416056bd0c89 # skip: [e6c2db2be986158afb9991d9fa8a38fe65a88516] drm/i915: Don't use DRM_DEBUG_WARN_ON for unexpected l3bank/mslice config # 012: GPU not switched in graphic mode - https://pastebin.com/yxppyqbD git bisect skip e6c2db2be986158afb9991d9fa8a38fe65a88516 # bad: [cb6b81b21bd9cf09d72b7fe711be1b55001eb166] Merge tag 'drm-misc-next-fixes-2022-07-21' of git://anongit.freedesktop.org/drm/drm-misc into drm-next # 013: GPU hangs without use-after-free issue - https://pastebin.com/iRek4bBy git bisect bad cb6b81b21bd9cf09d72b7fe711be1b55001eb166 # skip: [48b927770f8ad3f8cf4a024a552abf272af9f592] drm/exynos/exynos7_drm_decon: free resources when clk_set_parent() failed. # 014: GPU not switched in graphic mode - https://pastebin.com/ekp10xhP git bisect skip 48b927770f8ad3f8cf4a024a552abf272af9f592 # skip: [c5da61cf5bab30059f22ea368702c445ee87171a] drm/amdgpu/display: add missing FP_START/END checks dcn32_clk_mgr.c # 015: GPU not switched in graphic mode - https://pastebin.com/YbskKWmA git bisect skip c5da61cf5bab30059f22ea368702c445ee87171a # skip: [a77f7c89e62c6dfe405a64995812746f27adc510] drm/edid: convert drm_gtf_modes_for_range() to drm_edid # 016: GPU not switched in graphic mode - https://pastebin.com/bA2AwkJ7 git bisect skip a77f7c89e62c6dfe405a64995812746f27adc510 # skip: [6fde8eec71796f3534f0c274066862829813b21f] drm/doc: Add KUnit documentation # 017: GPU not switched in graphic mode - https://pasteb
Re: [BUG][5.20] refcount_t: underflow; use-after-free
On Wed, Aug 17, 2022 at 11:43 PM Maíra Canal wrote: > > Hi Mikhail, > > Looks like 45ecaea738830b9d521c93520c8f201359dcbd95 ("drm/sched: Partial > revert of 'drm/sched: Keep s_fence->parent pointer'") introduced the > error. Try reverting it and check if the use-after-free still happens. Thanks, but unfortunately, this did not lead to the expected result. Again happens use-after-free in an incomprehensible context. >From the new: added warning "suspicious RCU usage" but it looks like it is completely not related to the use-after-free issue. [ 215.434115] [ cut here ] [ 215.434184] refcount_t: underflow; use-after-free. [ 215.434204] WARNING: CPU: 7 PID: 1258 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 215.434214] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc snd_seq_midi snd_seq_midi_event intel_rapl_msr intel_rapl_common snd_hda_codec_realtek vfat snd_hda_codec_generic snd_hda_codec_hdmi mt76x2u fat mt76x2_common snd_hda_intel mt76x02_usb snd_intel_dspcfg snd_intel_sdw_acpi mt76_usb iwlmvm edac_mce_amd snd_usb_audio snd_hda_codec mt76x02_lib snd_hda_core snd_usbmidi_lib snd_hwdep snd_rawmidi uvcvideo mt76 kvm_amd snd_seq videobuf2_vmalloc videobuf2_memops snd_seq_device mac80211 videobuf2_v4l2 videobuf2_common kvm btusb iwlwifi snd_pcm btrtl videodev libarc4 eeepc_wmi btbcm asus_wmi iwlmei btintel ledtrig_audio xpad irqbypass sparse_keymap btmtk platform_profile joydev [ 215.434436] hid_logitech_hidpp rapl ff_memless mc snd_timer bluetooth cfg80211 video pcspkr wmi_bmof snd soundcore k10temp i2c_piix4 rfkill mei asus_ec_sensors acpi_cpufreq zram amdgpu drm_ttm_helper ttm iommu_v2 ucsi_ccg gpu_sched crct10dif_pclmul crc32_pclmul typec_ucsi drm_buddy crc32c_intel ghash_clmulni_intel ccp igb sp5100_tco typec drm_display_helper nvme dca nvme_core cec wmi ip6_tables ip_tables fuse [ 215.434528] Unloaded tainted modules: amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 [ 215.434672] pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 [ 215.434702] CPU: 7 PID: 1258 Comm: kworker/7:3 Tainted: G W L --- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1 [ 215.434709] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 215.434715] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched] [ 215.434728] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 215.434734] Code: 01 01 e8 59 59 6f 00 0f 0b e9 22 46 a5 00 80 3d be 7d be 01 00 75 85 48 c7 c7 c0 99 8e 92 c6 05 ae 7d be 01 01 e8 36 59 6f 00 <0f> 0b e9 ff 45 a5 00 80 3d 99 7d be 01 00 0f 85 5e ff ff ff 48 c7 [ 215.434740] RSP: 0018:9ccb0237fe60 EFLAGS: 00010286 [ 215.434747] RAX: 0026 RBX: 8d531f6f2828 RCX: [ 215.434753] RDX: 0001 RSI: 928d07a4 RDI: [ 215.434757] RBP: 8d61e47f5600 R08: R09: 9ccb0237fd10 [ 215.434762] R10: 0003 R11: 8d622e2fffe8 R12: 8d61e47fc800 [ 215.434767] R13: 8d5313e95500 R14: 8d61e47fc805 R15: 8d531f6f2830 [ 215.434772] FS: () GS:8d61e460() knlGS: [ 215.434777] CS: 0010 DS: ES: CR0: 80050033 [ 215.434782] CR2: 7f0c8b815048 CR3: 0001ab0e8000 CR4: 00350ee0 [ 215.434788] Call Trace: [ 215.434792] [ 215.434797] process_one_work+0x2a0/0x600 [ 215.434819] worker_thread+0x4f/0x3a0 [ 215.434830] ? process_one_work+0x600/0x600 [ 215.434836] kthread+0xf5/0x120 [ 215.434842] ? kthread_complete_and_exit+0x20/0x20 [ 215.434854] ret_from_fork+0x22/0x30 [ 215.434881] [ 215.434885] irq event stamp: 134873 [ 215.434890] hardirqs last enabled at (134881): [] __up_console_sem+0x5e/0x70 [ 215.434897] hardirqs l
Re: [BUG][5.20] refcount_t: underflow; use-after-free
On Wed, Aug 17, 2022 at 9:08 PM Melissa Wen wrote: > > Hi Mikhail, > > IIUC, you got this second user-after-free by applying the first version > of Maíra's patch, right? So, that version was adding another unbalanced > unlock to the cs ioctl flow, but it was solved in the latest version, > that you can find here: https://patchwork.freedesktop.org/patch/497680/ > If this is the situation, can you check this last version? > > Thanks, > > Melissa With the last version warning "bad unlock balance detected!" was gone, but the user-after-free issue remains. And again "Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]". [ 297.834779] [ cut here ] [ 297.834818] refcount_t: underflow; use-after-free. [ 297.834831] WARNING: CPU: 30 PID: 2377 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 297.834838] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc snd_seq_midi snd_seq_midi_event mt76x2u mt76x2_common mt76x02_usb mt76_usb mt76x02_lib snd_hda_codec_realtek iwlmvm intel_rapl_msr snd_hda_codec_generic snd_hda_codec_hdmi mt76 vfat fat snd_hda_intel intel_rapl_common mac80211 snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio snd_hda_codec snd_usbmidi_lib btusb edac_mce_amd iwlwifi libarc4 uvcvideo snd_hda_core btrtl snd_rawmidi snd_hwdep videobuf2_vmalloc btbcm kvm_amd videobuf2_memops snd_seq iwlmei btintel videobuf2_v4l2 eeepc_wmi snd_seq_device videobuf2_common btmtk kvm xpad videodev joydev irqbypass snd_pcm asus_wmi hid_logitech_hidpp ff_memless cfg80211 bluetooth rapl mc [ 297.834932] ledtrig_audio snd_timer sparse_keymap platform_profile wmi_bmof snd video pcspkr k10temp i2c_piix4 rfkill soundcore mei asus_ec_sensors acpi_cpufreq zram amdgpu drm_ttm_helper ttm crct10dif_pclmul crc32_pclmul crc32c_intel iommu_v2 ucsi_ccg gpu_sched typec_ucsi drm_buddy ghash_clmulni_intel drm_display_helper ccp igb typec sp5100_tco nvme cec nvme_core dca wmi ip6_tables ip_tables fuse [ 297.834978] Unloaded tainted modules: amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 [ 297.835055] pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 [ 297.835071] CPU: 30 PID: 2377 Comm: kworker/30:6 Tainted: G WL--- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1 [ 297.835075] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 297.835078] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched] [ 297.835085] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 297.835088] Code: 01 01 e8 59 59 6f 00 0f 0b e9 22 46 a5 00 80 3d be 7d be 01 00 75 85 48 c7 c7 c0 99 8e aa c6 05 ae 7d be 01 01 e8 36 59 6f 00 <0f> 0b e9 ff 45 a5 00 80 3d 99 7d be 01 00 0f 85 5e ff ff ff 48 c7 [ 297.835091] RSP: 0018:bd3506df7e60 EFLAGS: 00010286 [ 297.835095] RAX: 0026 RBX: 961b250cbc28 RCX: [ 297.835097] RDX: 0001 RSI: aa8d07a4 RDI: [ 297.835100] RBP: 96276a3f5600 R08: R09: bd3506df7d10 [ 297.835102] R10: 0003 R11: 9627ae2fffe8 R12: 96276a3fc800 [ 297.835105] R13: 9618c03e6600 R14: 96276a3fc805 R15: 961b250cbc30 [ 297.835108] FS: () GS:96276a20() knlGS: [ 297.835110] CS: 0010 DS: ES: CR0: 80050033 [ 297.835113] CR2: 621001e4a000 CR3: 00018d958000 CR4: 00350ee0 [ 297.835116] Call Trace: [ 297.835118] [ 297.835121] process_one_work+0x2a0/0x600 [ 297.835133] worker_thread+0x4f/0x3a0 [ 297.835139] ? process_one_work+0x600/0x600 [ 297.835142] kthread+0xf5/0x120 [ 297.835145] ? kthread_complete_and_exit+0x20/0x20 [ 297.835151] ret_from_fork+0x22/0x30 [ 297.835166] [
Re: [BUG][5.20] refcount_t: underflow; use-after-free
On Mon, Aug 15, 2022 at 3:37 PM Mikhail Gavrilov wrote: > > Thanks, I tested this patch. > But with this patch use-after-free problem happening in another place: Does anyone have an idea why the second use-after-free happened? >From the trace I don't understand which code is related. I don't quite understand what the "Workqueue" entry in the trace means. [ 408.358737] [ cut here ] [ 408.358743] refcount_t: underflow; use-after-free. [ 408.358760] WARNING: CPU: 9 PID: 62 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 408.358769] Modules linked in: uinput snd_seq_dummy rfcomm snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc snd_seq_midi snd_seq_midi_event mt76x2u mt76x2_common snd_hda_codec_realtek mt76x02_usb snd_hda_codec_generic iwlmvm snd_hda_codec_hdmi mt76_usb intel_rapl_msr snd_hda_intel mt76x02_lib intel_rapl_common snd_intel_dspcfg snd_intel_sdw_acpi mt76 snd_hda_codec vfat fat snd_usb_audio snd_hda_core edac_mce_amd mac80211 snd_usbmidi_lib snd_hwdep snd_rawmidi mc snd_seq btusb kvm_amd iwlwifi snd_seq_device btrtl btbcm libarc4 btintel eeepc_wmi snd_pcm iwlmei kvm btmtk asus_wmi ledtrig_audio irqbypass joydev snd_timer sparse_keymap bluetooth platform_profile rapl cfg80211 snd video wmi_bmof soundcore i2c_piix4 k10temp rfkill mei [ 408.358853] asus_ec_sensors acpi_cpufreq zram hid_logitech_hidpp amdgpu igb dca drm_ttm_helper ttm iommu_v2 crct10dif_pclmul gpu_sched crc32_pclmul ucsi_ccg crc32c_intel drm_buddy nvme typec_ucsi drm_display_helper ghash_clmulni_intel ccp typec nvme_core sp5100_tco cec wmi ip6_tables ip_tables fuse [ 408.358880] Unloaded tainted modules: amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 [ 408.358953] pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 [ 408.358967] CPU: 9 PID: 62 Comm: kworker/9:0 Tainted: G W L --- --- 6.0.0-0.rc1.13.fc38.x86_64+debug #1 [ 408.358971] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 408.358974] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched] [ 408.358982] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 408.358987] Code: 01 01 e8 d9 59 6f 00 0f 0b e9 a2 46 a5 00 80 3d 3e 7e be 01 00 75 85 48 c7 c7 70 99 8e 92 c6 05 2e 7e be 01 01 e8 b6 59 6f 00 <0f> 0b e9 7f 46 a5 00 80 3d 19 7e be 01 00 0f 85 5e ff ff ff 48 c7 [ 408.358990] RSP: 0018:b124003efe60 EFLAGS: 00010286 [ 408.358994] RAX: 0026 RBX: 9987a025d428 RCX: [ 408.358997] RDX: 0001 RSI: 928d0754 RDI: [ 408.358999] RBP: 9994e4ff5600 R08: R09: b124003efd10 [ 408.359001] R10: 0003 R11: 99952e2fffe8 R12: 9994e4ffc800 [ 408.359004] R13: 998600228cc0 R14: 9994e4ffc805 R15: 9987a025d430 [ 408.359006] FS: () GS:9994e4e0() knlGS: [ 408.359009] CS: 0010 DS: ES: CR0: 80050033 [ 408.359012] CR2: 27ac39e78000 CR3: 0001a66d8000 CR4: 00350ee0 [ 408.359015] Call Trace: [ 408.359017] [ 408.359020] process_one_work+0x2a0/0x600 [ 408.359032] worker_thread+0x4f/0x3a0 [ 408.359036] ? process_one_work+0x600/0x600 [ 408.359039] kthread+0xf5/0x120 [ 408.359044] ? kthread_complete_and_exit+0x20/0x20 [ 408.359049] ret_from_fork+0x22/0x30 [ 408.359061] [ 408.359063] irq event stamp: 5468 [ 408.359064] hardirqs last enabled at (5467): [] _raw_spin_unlock_irq+0x24/0x50 [ 408.359071] hardirqs last disabled at (5468): [] __schedule+0xe2c/0x16d0 [ 408.359076] softirqs last enabled at (2482): [] rht_deferred_worker+0x708/0xc00 [ 408.359079] softirqs last disabled at (2480): [] rht_deferred_worker+0x1f7/0xc00 [ 408.359082] ---[ end trace ]--- Full ke
Re: [BUG][5.20] refcount_t: underflow; use-after-free
On Mon, Aug 15, 2022 at 5:20 AM Maíra Canal wrote: > > Hi Mikhail > > Looks like this use-after-free problem was introduced on > 90af0ca047f3049c4b46e902f432ad6ef1e2ded6. Checking this patch it seems > like: if amdgpu_cs_vm_handling return r != 0, then it will unlock > bo_list_mutex inside the function amdgpu_cs_vm_handling and again on > amdgpu_cs_parser_fini. > > Maybe the following patch will help: Thanks, I tested this patch. But with this patch use-after-free problem happening in another place: [ 894.012920] [ cut here ] [ 894.012939] refcount_t: underflow; use-after-free. [ 894.012968] WARNING: CPU: 14 PID: 205 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 894.012999] Modules linked in: tls uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc snd_seq_midi snd_seq_midi_event snd_hda_codec_realtek mt76x2u mt76x2_common snd_hda_codec_generic snd_hda_codec_hdmi intel_rapl_msr mt76x02_usb intel_rapl_common snd_hda_intel mt76_usb snd_intel_dspcfg vfat iwlmvm snd_intel_sdw_acpi mt76x02_lib fat snd_usb_audio snd_hda_codec mt76 edac_mce_amd snd_usbmidi_lib snd_hda_core btusb snd_rawmidi snd_hwdep mac80211 mc iwlwifi btrtl eeepc_wmi asus_wmi btbcm snd_seq kvm_amd libarc4 ledtrig_audio snd_seq_device btintel iwlmei sparse_keymap btmtk kvm snd_pcm irqbypass platform_profile snd_timer xpad joydev cfg80211 rapl hid_logitech_hidpp bluetooth ff_memless wmi_bmof video pcspkr snd k10temp i2c_piix4 [ 894.013086] soundcore rfkill mei asus_ec_sensors acpi_cpufreq zram amdgpu drm_ttm_helper ttm iommu_v2 crct10dif_pclmul ucsi_ccg gpu_sched crc32_pclmul crc32c_intel typec_ucsi drm_buddy typec drm_display_helper ghash_clmulni_intel igb ccp cec nvme sp5100_tco nvme_core dca wmi ip6_tables ip_tables fuse [ 894.013322] Unloaded tainted modules: amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 [ 894.013455] pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 [ 894.013690] CPU: 14 PID: 205 Comm: kworker/14:1 Tainted: GW L--- --- 5.20.0-0.rc0.20220812git7ebfc85e2cd7.11.fc38.x86_64 #1 [ 894.013725] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 894.013756] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched] [ 894.013779] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 894.013796] Code: 01 01 e8 79 4a 6f 00 0f 0b e9 42 47 a5 00 80 3d de 7e be 01 00 75 85 48 c7 c7 f8 98 8e 9c c6 05 ce 7e be 01 01 e8 56 4a 6f 00 <0f> 0b e9 1f 47 a5 00 80 3d b9 7e be 01 00 0f 85 5e ff ff ff 48 c7 [ 894.013842] RSP: 0018:b48681153e60 EFLAGS: 00010286 [ 894.013858] RAX: 0026 RBX: 9bad16f1f028 RCX: [ 894.013878] RDX: 0001 RSI: 9c8d06dc RDI: [ 894.013897] RBP: 9bba663f5600 R08: R09: b48681153d10 [ 894.013916] R10: 0003 R11: 9bbaae2fffe8 R12: 9bba663fc800 [ 894.013934] R13: 9bab93fcab40 R14: 9bba663fc805 R15: 9bad16f1f030 [ 894.013954] FS: () GS:9bba6620() knlGS: [ 894.013975] CS: 0010 DS: ES: CR0: 80050033 [ 894.013991] CR2: 1aa46b2ec008 CR3: 000101516000 CR4: 00350ee0 [ 894.014011] Call Trace: [ 894.014022] [ 894.014030] process_one_work+0x2a0/0x600 [ 894.014051] worker_thread+0x4f/0x3a0 [ 894.014065] ? process_one_work+0x600/0x600 [ 894.014079] kthread+0xf5/0x120 [ 894.014092] ? kthread_complete_and_exit+0x20/0x20 [ 894.014109] ret_from_fork+0x22/0x30 [ 894.014129] [ 894.014137] irq event stamp: 5802 [ 894.014148] hardirqs last enabled at (5801): [] _raw_spin_unlock_irq+0x24/0x50 [ 894.014178] hardirqs last disabled at (5802): [] __schedule+0xe2c/0x16d0 [ 894.014206] softirq
[BUG][5.20] refcount_t: underflow; use-after-free
Hi folks. Joined testing 5.20 today (7ebfc85e2cd7). I encountered a frequently GPU freeze, after which a message appears in the kernel logs: [ 220.280990] [ cut here ] [ 220.281000] refcount_t: underflow; use-after-free. [ 220.281019] WARNING: CPU: 1 PID: 3746 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 220.281029] Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep sunrpc snd_seq_midi snd_seq_midi_event vfat intel_rapl_msr fat intel_rapl_common snd_hda_codec_realtek mt76x2u snd_hda_codec_generic snd_hda_codec_hdmi mt76x2_common iwlmvm mt76x02_usb edac_mce_amd mt76_usb snd_hda_intel snd_intel_dspcfg mt76x02_lib snd_intel_sdw_acpi snd_usb_audio snd_hda_codec mt76 kvm_amd uvcvideo mac80211 snd_hda_core btusb eeepc_wmi snd_usbmidi_lib videobuf2_vmalloc videobuf2_memops kvm btrtl snd_rawmidi asus_wmi snd_hwdep videobuf2_v4l2 btbcm iwlwifi ledtrig_audio libarc4 btintel snd_seq videobuf2_common sparse_keymap btmtk irqbypass videodev snd_seq_device joydev xpad iwlmei platform_profile bluetooth ff_memless snd_pcm mc rapl [ 220.281185] video snd_timer cfg80211 wmi_bmof snd pcspkr soundcore k10temp i2c_piix4 rfkill mei asus_ec_sensors acpi_cpufreq zram hid_logitech_hidpp amdgpu igb dca drm_ttm_helper ttm crct10dif_pclmul iommu_v2 crc32_pclmul gpu_sched crc32c_intel ucsi_ccg drm_buddy nvme typec_ucsi ghash_clmulni_intel drm_display_helper ccp nvme_core typec sp5100_tco cec wmi ip6_tables ip_tables fuse [ 220.281258] Unloaded tainted modules: amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 [ 220.281388] pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 [ 220.281415] CPU: 1 PID: 3746 Comm: chrome:cs0 Tainted: G W L --- --- 5.20.0-0.rc0.20220812git7ebfc85e2cd7.10.fc38.x86_64 #1 [ 220.281421] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 220.281426] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 220.281431] Code: 01 01 e8 79 4a 6f 00 0f 0b e9 42 47 a5 00 80 3d de 7e be 01 00 75 85 48 c7 c7 f8 98 8e 98 c6 05 ce 7e be 01 01 e8 56 4a 6f 00 <0f> 0b e9 1f 47 a5 00 80 3d b9 7e be 01 00 0f 85 5e ff ff ff 48 c7 [ 220.281437] RSP: 0018:b4b0d18d7a80 EFLAGS: 00010282 [ 220.281443] RAX: 0026 RBX: 0003 RCX: [ 220.281448] RDX: 0001 RSI: 988d06dc RDI: [ 220.281452] RBP: R08: R09: b4b0d18d7930 [ 220.281457] R10: 0003 R11: a0672e2fffe8 R12: a058ca360400 [ 220.281461] R13: a05846c50a18 R14: fe00 R15: 0003 [ 220.281465] FS: 7f82683e06c0() GS:a066e2e0() knlGS: [ 220.281470] CS: 0010 DS: ES: CR0: 80050033 [ 220.281475] CR2: 3590005cc000 CR3: 0001fca46000 CR4: 00350ee0 [ 220.281480] Call Trace: [ 220.281485] [ 220.281490] amdgpu_cs_ioctl+0x4e2/0x2070 [amdgpu] [ 220.281806] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 220.282028] drm_ioctl_kernel+0xa4/0x150 [ 220.282043] drm_ioctl+0x21f/0x420 [ 220.282053] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu] [ 220.282275] ? lock_release+0x14f/0x460 [ 220.282282] ? _raw_spin_unlock_irqrestore+0x30/0x60 [ 220.282290] ? _raw_spin_unlock_irqrestore+0x30/0x60 [ 220.282297] ? lockdep_hardirqs_on+0x7d/0x100 [ 220.282305] ? _raw_spin_unlock_irqrestore+0x40/0x60 [ 220.282317] amdgpu_drm_ioctl+0x4a/0x80 [amdgpu] [ 220.282534] __x64_sys_ioctl+0x90/0xd0 [ 220.282545] do_syscall_64+0x5b/0x80 [ 220.282551] ? futex_wake+0x6c/0x150 [ 220.282568] ? lock_is_held_type+0xe8/0x140 [ 220.282580] ? do_syscall_64+0x67/0x80 [ 220.282585] ? lockdep_hardirqs_on+0x7d/0x100 [ 220.282592] ? do_syscall_64+0x67/0x80 [ 220.282597] ? do_syscall_64+0x67/0x80 [ 220.282602] ? lockdep_hardi
Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
On Thu, 21 Jan 2021 at 18:27, Christian König wrote: > > I still have no idea what's going on here. > > The KASAN messages from the DC code are completely unrelated. > > Please add the full dmesg to your bug report. > I did it. https://gitlab.freedesktop.org/drm/amd/-/issues/1439#note_776267 -- Best Regards, Mike Gavrilov. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
On Fri, 15 Jan 2021 at 03:43, Mikhail Gavrilov wrote: > In rc4, the number of warnings has dropped dramatically. No more errors "kasan slab-out-of-bounds" and no "DMA-API device driver failed to check map error". But still not fixed "sleeping function called from invalid context at include/linux/sched/mm.h:196" and "BUG: key 88810b0d9148 has not been registered!" Second issue Navi specific because it started to happen in 5.10 kernel after replacing Radeon VII to 6900XT. 1. BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 500, name: systemd-udevd 1 lock held by systemd-udevd/500: #0: 888107690258 (&dev->mutex){}-{3:3}, at: device_driver_attach+0xa3/0x250 CPU: 9 PID: 500 Comm: systemd-udevd Not tainted 5.11.0-0.rc4.129.fc34.x86_64+debug #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0xae/0xe5 ___might_sleep.cold+0x150/0x17e ? dcn30_clock_source_create+0x53/0x110 [amdgpu] kmem_cache_alloc_trace+0x23f/0x270 dcn30_clock_source_create+0x53/0x110 [amdgpu] dcn30_create_resource_pool+0x998/0x4890 [amdgpu] ? dcn30_calc_max_scaled_time+0x40/0x40 [amdgpu] ? lock_is_held_type+0xb8/0xf0 ? unpoison_range+0x3a/0x60 ? kasan_kmalloc.constprop.0+0x84/0xa0 ? dc_create_resource_pool+0x26e/0x5e0 [amdgpu] dc_create_resource_pool+0x26e/0x5e0 [amdgpu] dc_create+0x636/0x1bc0 [amdgpu] ? lock_acquire+0x2dd/0x7a0 ? sched_clock+0x5/0x10 ? sched_clock_cpu+0x18/0x170 ? find_held_lock+0x33/0x110 ? dc_create_state+0xa0/0xa0 [amdgpu] ? lock_downgrade+0x6b0/0x6b0 ? module_assert_mutex_or_preempt+0x3e/0x70 ? lock_is_held_type+0xb8/0xf0 ? unpoison_range+0x3a/0x60 ? kasan_kmalloc.constprop.0+0x84/0xa0 amdgpu_dm_init.isra.0+0x479/0x640 [amdgpu] ? vprintk_emit+0x1c0/0x460 ? dev_vprintk_emit+0x2d8/0x31a ? sched_clock+0x5/0x10 ? dm_resume+0x13b0/0x13b0 [amdgpu] ? dev_attr_show.cold+0x35/0x35 ? lock_downgrade+0x6b0/0x6b0 ? dev_printk_emit+0x8c/0xa8 ? dev_vprintk_emit+0x31a/0x31a ? wait_for_completion_io+0x240/0x240 ? __dev_printk+0x71/0xdf ? smu_hw_init.cold+0x16b/0x18a [amdgpu] ? smu_suspend+0x240/0x240 [amdgpu] ? navi10_ih_irq_init+0xea3/0x2420 [amdgpu] dm_hw_init+0xe/0x20 [amdgpu] amdgpu_device_init.cold+0x3031/0x4940 [amdgpu] ? amdgpu_device_cache_pci_state+0xf0/0xf0 [amdgpu] ? pci_bus_read_config_byte+0x140/0x140 ? do_pci_enable_device+0x1f8/0x260 ? pci_find_saved_ext_cap+0x110/0x110 ? pci_enable_bridge+0xf9/0x1e0 ? pci_dev_check_d3cold+0x107/0x250 ? pci_enable_device_flags+0x201/0x340 amdgpu_driver_load_kms+0x167/0x8a0 [amdgpu] amdgpu_pci_probe+0x235/0x360 [amdgpu] ? amdgpu_pci_remove+0xd0/0xd0 [amdgpu] local_pci_probe+0xd8/0x170 pci_device_probe+0x318/0x5c0 ? kernfs_create_link+0x16c/0x230 ? pci_device_remove+0x1d0/0x1d0 really_probe+0x224/0xc40 driver_probe_device+0x1f2/0x380 device_driver_attach+0x1df/0x250 __driver_attach+0xf6/0x260 ? device_driver_attach+0x250/0x250 bus_for_each_dev+0x114/0x180 ? subsys_dev_iter_exit+0x10/0x10 bus_add_driver+0x352/0x570 driver_register+0x20f/0x390 ? __pci_register_driver+0x13a/0x210 ? 0xc1d8d000 do_one_initcall+0xfb/0x530 ? perf_trace_initcall_level+0x3d0/0x3d0 ? __memset+0x2b/0x30 ? unpoison_range+0x3a/0x60 do_init_module+0x1ce/0x7a0 load_module+0x9841/0xa380 ? module_frob_arch_sections+0x20/0x20 ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0 ? sched_clock_cpu+0x18/0x170 ? sched_clock+0x5/0x10 ? lock_acquire+0x2dd/0x7a0 ? sched_clock+0x5/0x10 ? lock_is_held_type+0xb8/0xf0 ? __do_sys_init_module+0x18b/0x220 __do_sys_init_module+0x18b/0x220 ? load_module+0xa380/0xa380 ? ktime_get_coarse_real_ts64+0x12f/0x160 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f2c109da07e Code: 48 8b 0d f5 1d 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c2 1d 0c 00 f7 d8 64 89 01 48 RSP: 002b:7ffc84d33f88 EFLAGS: 0246 ORIG_RAX: 00af RAX: ffda RBX: 55b87f8260a0 RCX: 7f2c109da07e RDX: 55b87f834060 RSI: 01e2cbf6 RDI: 7f2c0b7e0010 RBP: 7f2c0b7e0010 R08: 55b87f8281e0 R09: 7ffc84d30a26 R10: 55bd2404cc18 R11: 0246 R12: 55b87f834060 R13: 55b87f831ca0 R14: R15: 55b87f832640 [drm] Display Core initialized with v3.2.116! [drm] DMUB hardware initialized: version=0x0201 usb 1-3.2: Device not responding to setup address. usb 1-3.2: device not accepting address 5, error -71 [drm] REG_WAIT timeout 1us * 10 tries - mpc2_assert_idle_mpcc line:480 2. BUG: key 88810b0d9148 has not been registered! [ cut here ] DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 25 PID: 500 at kernel/locking/lockdep.c:4618 lockdep_init_map_waits+0x592/0x770 Modules li
Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
On Thu, 14 Jan 2021 at 18:56, Christian König wrote: > Unfortunately not of hand. > > I also don't see any bug reports from other people and can't reproduce > the last backtrace you send out TTM here. Because only the most desperate will install kernels with enabled debug flags and then load the system by opening a huge number of programs and tabs. So you shouldn't be surprised that I'm the only one here. This is what my desktop looks like every day: https://imgur.com/a/Kxlmrem > Do you have any local modifications or special setup in your system? > Like bpf scripts or something like that? No, my I didn't write any bpf scripts, but looks like my distribution Fedora Rawhide uses some bpf scripts by default out of box: # bpftool prog 20: cgroup_device tag 40ddf486530245f5 gpl loaded_at 2021-01-15T01:30:04+0500 uid 0 xlated 504B jited 309B memlock 4096B 21: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:04+0500 uid 0 xlated 64B jited 54B memlock 4096B 22: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:04+0500 uid 0 xlated 64B jited 54B memlock 4096B 23: cgroup_device tag ca8e50a3c7fb034b gpl loaded_at 2021-01-15T01:30:05+0500 uid 0 xlated 496B jited 307B memlock 4096B 24: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:05+0500 uid 0 xlated 64B jited 54B memlock 4096B 25: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:05+0500 uid 0 xlated 64B jited 54B memlock 4096B 26: cgroup_device tag be31ae23198a0378 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 464B jited 288B memlock 4096B 27: cgroup_device tag ee0e253c78993a24 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 416B jited 255B memlock 4096B 28: cgroup_device tag 438c5618576e5b0c gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 568B jited 354B memlock 4096B 29: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 64B jited 54B memlock 4096B 30: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 64B jited 54B memlock 4096B 31: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 64B jited 54B memlock 4096B 32: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:13+0500 uid 0 xlated 64B jited 54B memlock 4096B 33: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 64B jited 54B memlock 4096B 34: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 64B jited 54B memlock 4096B 35: cgroup_device tag ee0e253c78993a24 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 416B jited 255B memlock 4096B 38: cgroup_device tag 3a0ef5414c2f6fca gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 744B jited 447B memlock 4096B 39: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 64B jited 54B memlock 4096B 40: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:14+0500 uid 0 xlated 64B jited 54B memlock 4096B 41: cgroup_device tag ee0e253c78993a24 gpl loaded_at 2021-01-15T01:30:18+0500 uid 0 xlated 416B jited 255B memlock 4096B 42: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:18+0500 uid 0 xlated 64B jited 54B memlock 4096B 43: cgroup_skb tag 6deef7357e7b4530 gpl loaded_at 2021-01-15T01:30:18+0500 uid 0 xlated 64B jited 54B memlock 4096B I catched yet another couples of leaks , but nothing new: https://pastebin.com/2EgvYJdz [1] do_detailed_mode+0x7c1/0x13d0 [drm] [2] drm_mode_duplicate+0x45/0x220 [drm] [3] do_seccomp+0x215/0x2280 [4] __vmalloc_node_range+0x464/0x7b0 [5] bpf_prog_alloc_no_stats+0xa2/0x2b0 [6] bpf_prog_store_orig_filter+0x7b/0x1c0 [7] kmemdup+0x1a/0x40 Did the following trace message confuse anyone? == BUG: KASAN: slab-out-of-bounds in kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu] Read of size 1 at addr 88812a6b4181 by task systemd-udevd/491 CPU: 20 PID: 491 Comm: systemd-udevd Not tainted 5.11.0-0.rc3.20210114git65f0d2414b70.125.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0xae/0xe5 print_address_description.constprop.0+0x18/0x160 ? kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu] kasan_report.cold+0x7f/0x10e ? kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu] kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu] ? kfd_create_crat_image_acpi+0x340/0x340 [amdgpu] ? __raw_spin_lock_init+0x39/0x110 kfd_topology_init+0x2ac/0x400 [amdgpu] ? kfd_create_topology_device+0x320/0x320 [amdgpu] ? __class_register+0x2ad/0x430 ? __class_create+0xc5/0x130 kgd2kfd_init+0x95/0xf0 [amdgpu] amdgpu_a
Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
On Tue, 12 Jan 2021 at 01:45, Christian König wrote: > > But what you have in your logs so far are only unrelated symptoms, the > root of the problem is that somebody is leaking memory. > > What you could do as well is to try to enable kmemleak I captured some memleaks. Do they contain any useful information? [1] https://pastebin.com/n0FE7Hsu [2] https://pastebin.com/MUX55L1k [3] https://pastebin.com/a3FT7DVG [4] https://pastebin.com/1ALvJKz7 -- Best Regards, Mike Gavrilov. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Hi Christian, On Tue, 12 Jan 2021 at 01:45, Christian König wrote: > > Hi Mike, > > Unfortunately not, that's DC stuff. Easiest is to assign this as a bug > tracker to our DC team. Ok > At least some progress. Any objections that I add your e-mail address as > tested-by tag? Yes, feel free add me. > I can take a look at this one here. Looks like some missing error > handling when allocating memory. > Can you decode to which line number ttm_tt_swapin+0x34 points to? $ /usr/src/kernels/`uname -r`/scripts/faddr2line /lib/debug/lib/modules/`uname -r`/kernel/drivers/gpu/drm/ttm/ttm.ko.debug ttm_tt_swapin+0x34 ttm_tt_swapin+0x34/0xd0: mapping_gfp_mask at /usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/./include/linux/pagemap.h:105 (discriminator 2) (inlined by) ttm_tt_swapin at /usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/drivers/gpu/drm/ttm/ttm_tt.c:210 (discriminator 2) $ cat -s -n /usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/drivers/gpu/drm/ttm/ttm_tt.c | head -220 | tail -20 201 struct page *from_page; 202 struct page *to_page; 203 gfp_t gfp_mask; 204 int i, ret; 205 206 swap_storage = ttm->swap_storage; 207 BUG_ON(swap_storage == NULL); 208 209 swap_space = swap_storage->f_mapping; 210 gfp_mask = mapping_gfp_mask(swap_space); 211 212 for (i = 0; i < ttm->num_pages; ++i) { 213 from_page = shmem_read_mapping_page_gfp(swap_space, i, 214 gfp_mask); 215 if (IS_ERR(from_page)) { 216 ret = PTR_ERR(from_page); 217 goto out_err; 218 } 219 to_page = ttm->pages[i]; 220 if (unlikely(to_page == NULL)) { > Please use this one here: > https://gitlab.freedesktop.org/drm/amd/-/issues/new > > If you can't find the DC guys of hand in the assignee list just assign > to me and I will forward. https://gitlab.freedesktop.org/drm/amd/-/issues/1439 Ok, let's continue there. -- Best Regards, Mike Gavrilov. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
On Mon, 11 Jan 2021 at 19:01, Christian König wrote: > Changing the page table attributes while releasing memory might sleep. > So we can't use a spinlock here. > > Thanks for the report, a patch to fix this is on the mailing list now. Can you look also the first trace? Here a same error message "sleeping function called from invalid context" and a lot of [amdgpu] code. BUG: sleeping function called from invalid context at include/linux/sched/mm.h:196 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 501, name: systemd-udevd 1 lock held by systemd-udevd/501: #0: 978e0278d258 (&dev->mutex){}-{3:3}, at: device_driver_attach+0x3b/0xb0 CPU: 25 PID: 501 Comm: systemd-udevd Not tainted 5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0x8b/0xb0 ___might_sleep.cold+0xb6/0xc6 ? dcn30_clock_source_create+0x34/0xb0 [amdgpu] kmem_cache_alloc_trace+0x204/0x230 dcn30_clock_source_create+0x34/0xb0 [amdgpu] dcn30_create_resource_pool+0x1d9/0x13a0 [amdgpu] ? rcu_read_lock_sched_held+0x3f/0x80 ? trace_kmalloc+0xb2/0xe0 ? __kmalloc+0x191/0x280 ? dc_create_resource_pool+0x110/0x1d0 [amdgpu] dc_create_resource_pool+0x110/0x1d0 [amdgpu] dc_create+0x205/0x790 [amdgpu] ? trace_kmalloc+0xb2/0xe0 ? kmem_cache_alloc_trace+0x174/0x230 amdgpu_dm_init.isra.0+0x1b9/0x250 [amdgpu] ? dev_vprintk_emit+0x171/0x195 ? dev_printk_emit+0x3e/0x40 dm_hw_init+0xe/0x20 [amdgpu] amdgpu_device_init.cold+0x179f/0x1afd [amdgpu] ? pci_conf1_read+0xa4/0x100 amdgpu_driver_load_kms+0x68/0x280 [amdgpu] amdgpu_pci_probe+0x129/0x1b0 [amdgpu] local_pci_probe+0x42/0x80 pci_device_probe+0xd9/0x1a0 really_probe+0x205/0x460 driver_probe_device+0xe1/0x150 device_driver_attach+0xa8/0xb0 __driver_attach+0x8c/0x150 ? device_driver_attach+0xb0/0xb0 ? device_driver_attach+0xb0/0xb0 bus_for_each_dev+0x67/0x90 bus_add_driver+0x12e/0x1f0 driver_register+0x8f/0xe0 ? 0xc0d9c000 do_one_initcall+0x67/0x320 ? rcu_read_lock_sched_held+0x3f/0x80 ? trace_kmalloc+0xb2/0xe0 ? kmem_cache_alloc_trace+0x174/0x230 do_init_module+0x5c/0x270 __do_sys_init_module+0x130/0x190 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f363661deee Code: 48 8b 0d 85 1f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 52 1f 0c 00 f7 d8 64 89 01 48 RSP: 002b:7ffeb7191588 EFLAGS: 0246 ORIG_RAX: 00af RAX: ffda RBX: 561b94563170 RCX: 7f363661deee RDX: 561b94579df0 RSI: 00b8a356 RDI: 7f3633b9e010 RBP: 7f3633b9e010 R08: 561b94565240 R09: 7ffeb718d786 R10: 561ef5ef1595 R11: 0246 R12: 561b94579df0 R13: 561b9457a3e0 R14: R15: 561b94576530 [drm] Display Core initialized with v3.2.116! [drm] DMUB hardware initialized: version=0x0201 usb 1-3.2: new high-speed USB device number 5 using xhci_hcd [drm] REG_WAIT timeout 1us * 10 tries - mpc2_assert_idle_mpcc line:480 > > -12 is just -ENOMEM. Looks like a memory leak to me, maybe caused by > > the problem above, maybe something completely unrelated. > > > > I will take a look. > > The looks like a completely unrelated memory leak to me. > > Probably best if you open up a bug report for this. Yes, the monitor still turns off after applying patch "make the pool shrinker lock a mutex". Anyway patch fixed the issue with flood of message "BUG: sleeping function called from invalid context at mm/vmalloc.c:1756" so kernel log became cleaner. Now the issue with turns off monitor looks in logs so: DMA-API: cacheline tracking ENOMEM, dma-debug disabled amdgpu :0b:00.0: amdgpu: 6b791523 pin failed [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12 BUG: kernel NULL pointer dereference, address: 0060 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 P4D 0 Oops: [#1] SMP NOPTI CPU: 20 PID: 3780 Comm: brave:cs0 Tainted: GW- --- 5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 RIP: 0010:ttm_tt_swapin+0x34/0x1b0 [ttm] Code: 55 41 54 55 53 48 83 ec 10 48 8b 47 20 48 89 44 24 08 48 85 c0 0f 84 86 01 00 00 48 8b 44 24 08 49 89 fc 4c 8b a8 e0 01 00 00 <41> 8b 45 60 89 44 24 04 8b 47 0c 85 c0 0f 84 df 00 00 00 31 db 65 RSP: 0018:a7400532b9c0 EFLAGS: 00010286 RAX: 978e2ae25800 RBX: 97910ec12058 RCX: 978e12caac70 RDX: 8010 RSI: RDI: 97912c3d99c0 RBP: 97912c3d99c0 R08: R09: 70b3a000 R10: 0002 R11: R12: 97912c3d99c0 R13: R14: a7400532ba90 R15: 978e182c6350 FS: 7f070bb1b640(00
[drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12
Hi folks, today I joined to testing Kernel 5.11 and saw that the kernel log was flooded with BUG messages: BUG: sleeping function called from invalid context at mm/vmalloc.c:1756 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 266, name: kswapd0 INFO: lockdep is turned off. CPU: 15 PID: 266 Comm: kswapd0 Tainted: GW- --- 5.11.0-0.rc2.20210108gitf5e6c330254a.119.fc34.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 Call Trace: dump_stack+0x8b/0xb0 ___might_sleep.cold+0xb6/0xc6 vm_unmap_aliases+0x21/0x40 change_page_attr_set_clr+0x9e/0x190 set_memory_wb+0x2f/0x80 ttm_pool_free_page+0x28/0x90 [ttm] ttm_pool_shrink+0x45/0xb0 [ttm] ttm_pool_shrinker_scan+0xa/0x20 [ttm] do_shrink_slab+0x177/0x3a0 shrink_slab+0x9c/0x290 shrink_node+0x2e6/0x700 balance_pgdat+0x2f5/0x650 kswapd+0x21d/0x4d0 ? do_wait_intr_irq+0xd0/0xd0 ? balance_pgdat+0x650/0x650 kthread+0x13a/0x150 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 But the most unpleasant thing is that after a while the monitor turns off and does not go on again until the restart. This is accompanied by an entry in the kernel log: amdgpu :0b:00.0: amdgpu: ff7d8b94 pin failed [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12 $ grep "Failed to pin framebuffer with error" -Rn . ./drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:5816: DRM_ERROR("Failed to pin framebuffer with error %d\n", r); $ git blame -L 5811,5821 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c Blaming lines: 0% (11/9167), done. 5d43be0ccbc2f (Christian König 2017-10-26 18:06:23 +0200 5811) domain = AMDGPU_GEM_DOMAIN_VRAM; e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5812) 7b7c6c81b3a37 (Junwei Zhang2018-06-25 12:51:14 +0800 5813) r = amdgpu_bo_pin(rbo, domain); e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5814) if (unlikely(r != 0)) { 30b7c6147d18d (Harry Wentland 2017-10-26 15:35:14 -0400 5815) if (r != -ERESTARTSYS) 30b7c6147d18d (Harry Wentland 2017-10-26 15:35:14 -0400 5816) DRM_ERROR("Failed to pin framebuffer with error %d\n", r); 0f257b09531b4 (Chunming Zhou 2019-05-07 19:45:31 +0800 5817) ttm_eu_backoff_reservation(&ticket, &list); e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5818) return r; e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5819) } e7b07ceef2a65 (Harry Wentland 2017-08-10 13:29:07 -0400 5820) bb812f1ea87dd (Junwei Zhang2018-06-25 13:32:24 +0800 5821) r = amdgpu_ttm_alloc_gart(&rbo->tbo); Who knows how to fix it? Full kernel logs is here: [1] https://pastebin.com/fLasjDHX [2] https://pastebin.com/g3wR2r9e -- Best Regards, Mike Gavrilov. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [bug] Radeon 3900XT not switch to graphic mode on kernel 5.10
On Tue, 29 Dec 2020 at 20:15, Deucher, Alexander wrote: > > It looks like the driver is not able to access the firmware for some reason. > Please make sure it is available in your initrd or compiled into the kernel > depending on your config. Exactly! Thanks! # lsinitrd /boot/initramfs-5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64.img | grep sienna_cichlid # ls /usr/lib/firmware/amdgpu | grep sienna_cichlid sienna_cichlid_ce.bin sienna_cichlid_dmcub.bin sienna_cichlid_me.bin sienna_cichlid_mec2.bin sienna_cichlid_mec.bin sienna_cichlid_pfp.bin sienna_cichlid_rlc.bin sienna_cichlid_sdma.bin sienna_cichlid_smc.bin sienna_cichlid_sos.bin sienna_cichlid_ta.bin sienna_cichlid_vcn.bin # dracut -f --regenerate-all # lsinitrd /boot/initramfs-5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64.img | grep sienna_cichlid -rw-r--r-- 1 root root 263296 Dec 15 14:00 usr/lib/firmware/amdgpu/sienna_cichlid_ce.bin -rw-r--r-- 1 root root80244 Dec 15 14:00 usr/lib/firmware/amdgpu/sienna_cichlid_dmcub.bin -rw-r--r-- 1 root root 263424 Dec 15 14:00 usr/lib/firmware/amdgpu/sienna_cichlid_me.bin -rw-r--r-- 2 root root 268592 Dec 15 14:00 usr/lib/firmware/amdgpu/sienna_cichlid_mec2.bin -rw-r--r-- 2 root root0 Dec 15 14:00 usr/lib/firmware/amdgpu/sienna_cichlid_mec.bin -rw-r--r-- 1 root root 263424 Dec 15 14:00 usr/lib/firmware/amdgpu/sienna_cichlid_pfp.bin -rw-r--r-- 1 root root 128592 Dec 15 14:00 usr/lib/firmware/amdgpu/sienna_cichlid_rlc.bin -rw-r--r-- 1 root root34048 Dec 15 14:00 usr/lib/firmware/amdgpu/sienna_cichlid_sdma.bin -rw-r--r-- 1 root root 247396 Dec 15 14:00 usr/lib/firmware/amdgpu/sienna_cichlid_smc.bin -rw-r--r-- 1 root root 215152 Dec 15 14:00 usr/lib/firmware/amdgpu/sienna_cichlid_sos.bin -rw-r--r-- 1 root root 333568 Dec 15 14:00 usr/lib/firmware/amdgpu/sienna_cichlid_ta.bin -rw-r--r-- 1 root root 504224 Dec 15 14:00 usr/lib/firmware/amdgpu/sienna_cichlid_vcn.bin # grep '20201204git34816d20f173\|linux-firmware-20201218-116' /var/log/dnf.rpm.log 2020-12-06T12:40:44+0500 SUBDEBUG Installed: kernel-core-5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64 2020-12-06T12:40:46+0500 SUBDEBUG Installed: kernel-modules-5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64 2020-12-06T12:41:03+0500 SUBDEBUG Installed: kernel-5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64 2020-12-06T12:41:03+0500 SUBDEBUG Installed: kernel-modules-extra-5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64 2020-12-21T10:52:43+0500 SUBDEBUG Upgrade: linux-firmware-20201218-116.fc34.noarch I think every update of linux-firmware should regenerate initramfs. But my downstream report was closed: https://bugzilla.redhat.com/show_bug.cgi?id=1911745 -- Best Regards, Mike Gavrilov. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [bug] Radeon 3900XT not switch to graphic mode on kernel 5.10
On Sun, 27 Dec 2020 at 21:39, Mikhail Gavrilov wrote: > I suppose the root of cause my problem here: > > [3.961326] amdgpu :0b:00.0: Direct firmware load for > amdgpu/sienna_cichlid_sos.bin failed with error -2 > [3.961359] amdgpu :0b:00.0: amdgpu: failed to init sos firmware > [3.961433] [drm:psp_sw_init [amdgpu]] *ERROR* Failed to load psp firmware! > [3.961529] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init > of IP block failed -2 > [3.961549] amdgpu :0b:00.0: amdgpu: amdgpu_device_ip_init failed > [3.961569] amdgpu :0b:00.0: amdgpu: Fatal error during GPU init > [3.961911] amdgpu: probe of :0b:00.0 failed with error -2 > # dnf provides */sienna_cichlid_sos.bin Last metadata expiration check: 3:01:27 ago on Sun 27 Dec 2020 06:53:25 PM +05. linux-firmware-20201218-116.fc34.noarch : Firmware files used by the Linux kernel Repo: @System Matched from: Filename: /usr/lib/firmware/amdgpu/sienna_cichlid_sos.bin linux-firmware-20201218-116.fc34.noarch : Firmware files used by the Linux kernel Repo: rawhide Matched from: Filename: /usr/lib/firmware/amdgpu/sienna_cichlid_sos.bin # dnf install linux-firmware-20201218-116.fc34.noarch Last metadata expiration check: 3:02:11 ago on Sun 27 Dec 2020 06:53:25 PM +05. Package linux-firmware-20201218-116.fc34.noarch is already installed. Dependencies resolved. Nothing to do. Complete! Looks like firmware is present. So I didn't understand why the kernel cannot read firmware. -- Best Regards, Mike Gavrilov. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[bugreport] [5.10-rc1] Oops: 0000 [#1] SMP NOPTI bug which always starts as page allocation failure
Hi folks. I observed hard reproductible the set of bugs. It always started as 1) kworker/u64:2: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0 Continious as: 2) WARNING: CPU: 21 PID: 806649 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7505 amdgpu_dm_atomic_commit_tail+0x23bd/0x24e0 [amdgpu] And ended as: 3) BUG: unable to handle page fault for address: 00012488 Which annoing because lead to completely computer hang. Example of one log: [11561.927250] kworker/u64:10: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0 [11561.927472] CPU: 18 PID: 39985 Comm: kworker/u64:10 Not tainted 5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1 [11561.927475] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020 [11561.927485] Workqueue: events_unbound commit_work [drm_kms_helper] [11561.927489] Call Trace: [11561.927496] dump_stack+0x8b/0xb0 [11561.927501] warn_alloc.cold+0x75/0xd9 [11561.927507] ? _cond_resched+0x16/0x50 [11561.927512] ? __alloc_pages_direct_compact+0x159/0x180 [11561.927518] __alloc_pages_slowpath.constprop.0+0x103f/0x1070 [11561.927531] __alloc_pages_nodemask+0x37d/0x400 [11561.927538] kmalloc_order+0x33/0xc0 [11561.927542] kmalloc_order_trace+0x19/0x110 [11561.927614] dc_create_state+0x26/0x60 [amdgpu] [11561.927677] amdgpu_dm_atomic_commit_tail+0x1cee/0x24e0 [amdgpu] [11561.927686] ? find_busiest_group+0x33/0x350 [11561.927698] ? __lock_acquire+0x3b0/0x21f0 [11561.927707] ? lock_acquire+0xc8/0x400 [11561.927710] ? wait_for_completion_timeout+0x3b/0xf0 [11561.927715] ? mark_held_locks+0x50/0x80 [11561.927719] ? lockdep_hardirqs_on_prepare+0xff/0x180 [11561.927722] ? _raw_spin_unlock_irq+0x24/0x40 [11561.927726] ? _raw_spin_unlock_irq+0x24/0x40 [11561.927729] ? wait_for_completion_timeout+0xdb/0xf0 [11561.927740] commit_tail+0x94/0x130 [drm_kms_helper] [11561.927745] process_one_work+0x27d/0x5b0 [11561.927753] worker_thread+0x55/0x3c0 [11561.927756] ? process_one_work+0x5b0/0x5b0 [11561.927760] kthread+0x13a/0x150 [11561.927763] ? __kthread_bind_mask+0x60/0x60 [11561.927769] ret_from_fork+0x22/0x30 [11561.927809] Mem-Info: [11561.927816] active_anon:933848 inactive_anon:4558268 isolated_anon:118 active_file:154021 inactive_file:80446 isolated_file:0 unevictable:1586 dirty:32469 writeback:700 slab_reclaimable:185330 slab_unreclaimable:176202 mapped:514440 shmem:592199 pagetables:81732 bounce:0 free:99082 free_pcp:2104 free_cma:0 [11561.927820] Node 0 active_anon:3735392kB inactive_anon:18233072kB active_file:616084kB inactive_file:321784kB unevictable:6344kB isolated(anon):472kB isolated(file):0kB mapped:2057760kB dirty:129876kB writeback:2800kB shmem:2368796kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:8kB kernel_stack:96608kB all_unreclaimable? no [11561.927824] Node 0 DMA free:11800kB min:32kB low:44kB high:56kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15900kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [11561.927829] lowmem_reserve[]: 0 3136 31809 31809 31809 [11561.927839] Node 0 DMA32 free:142632kB min:26264kB low:29472kB high:32680kB reserved_highatomic:0KB active_anon:131568kB inactive_anon:1625184kB active_file:57556kB inactive_file:13532kB unevictable:0kB writepending:2428kB present:3317760kB managed:3317572kB mlocked:0kB pagetables:25624kB bounce:0kB free_pcp:1764kB local_pcp:0kB free_cma:0kB [11561.927844] lowmem_reserve[]: 0 0 28673 28673 28673 [11561.927854] Node 0 Normal free:241896kB min:240300kB low:269660kB high:299020kB reserved_highatomic:2048KB active_anon:3603472kB inactive_anon:16607812kB active_file:558660kB inactive_file:308056kB unevictable:6344kB writepending:130596kB present:30133248kB managed:29370624kB mlocked:6344kB pagetables:301304kB bounce:0kB free_pcp:6656kB local_pcp:60kB free_cma:0kB [11561.927859] lowmem_reserve[]: 0 0 0 0 0 [11561.927871] Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11800kB [11561.927900] Node 0 DMA32: 15432*4kB (UME) 4963*8kB (UME) 2169*16kB (UME) 201*32kB (UM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 142568kB [11561.927923] Node 0 Normal: 49027*4kB (UMEH) 5656*8kB (MH) 20*16kB (H) 10*32kB (H) 2*64kB (H) 2*128kB (H) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 242380kB [11561.927951] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [11561.927954] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [11561.927956] 847580 total pagecache pages [11561.927967] 19862 pages in swap cache [11561.927970
[BUG] general protection fault, probably for non-canonical address 0xfe5d6f0af7831e5e: 0000 [#1] SMP NOPTI (5.7RC4 GIT 79dede78c057)
Hi folks. I didn’t do anything unusual, I just restarted the computer after the update, launched all the applications that I usually launch and went to drink tea. When I returned, I found that the monitor was on (it should have turned off since I had set the energy-saving mode for 5 minutes in DE) I tried to move the mouse, after that I realized that the computer was completely frozen. Even Alt+PrnScr+B did not helped reboot computer. I decided to fill the bug report here since this is a really serious problem. general protection fault, probably for non-canonical address 0xfe5d6f0af7831e5e: [#1] SMP NOPTI CPU: 16 PID: 6372 Comm: chrome:cs0 Not tainted 5.7.0-0.rc4.20200508git79dede78c057.1.fc33.x86_64 #1 Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 1405 11/19/2019 RIP: 0010:kmem_cache_alloc+0x83/0x310 Code: 02 00 00 4c 8b 45 00 65 49 8b 50 08 65 4c 03 05 5b a3 cc 5e 4d 8b 20 4d 85 e4 0f 84 3e 02 00 00 8b 45 20 48 8b 7d 00 4c 01 e0 <48> 8b 18 48 89 c1 48 33 9d d0 01 00 00 48 0f c9 48 31 cb 40 f6 c7 RSP: 0018:a8398b357b08 EFLAGS: 00010282 RAX: fe5d6f0af7831e5e RBX: RCX: RDX: 62b6 RSI: 0400 RDI: 001f83c0 RBP: 9513740e9200 R08: 95137c3f83c0 R09: R10: R11: R12: fe5d6f0af7831dee R13: 0dc0 R14: 9513740e9200 R15: c03a3e92 FS: 7fd77db5c700() GS:95137c20() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fea1fe56540 CR3: 00060424a000 CR4: 00340ee0 Call Trace: drm_sched_fence_create+0x22/0xc0 [gpu_sched] drm_sched_job_init+0x5d/0xa0 [gpu_sched] amdgpu_cs_ioctl+0x17d5/0x1eb0 [amdgpu] ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu] drm_ioctl_kernel+0x86/0xd0 [drm] drm_ioctl+0x206/0x390 [drm] ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] ksys_ioctl+0x82/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x5c/0xa0 entry_SYSCALL_64_after_hwframe+0x49/0xb3 RIP: 0033:0x7fd7954654bb Code: 0f 1e fa 48 8b 05 cd b9 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 9d b9 0c 00 f7 d8 64 89 01 48 RSP: 002b:7fd77db5b628 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 7fd77db5b690 RCX: 7fd7954654bb RDX: 7fd77db5b690 RSI: c0186444 RDI: 0016 RBP: c0186444 R08: 7fd77db5b7a0 R09: 7fd77db5b670 R10: R11: 0246 R12: 3a732f36f000 R13: 0016 R14: 3a732f5122ec R15: 3a732f50a0f8 Modules linked in: snd_seq_dummy snd_hrtimer uinput rfcomm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp tun nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat hid_logitech_hidpp xpad ff_memless joydev edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hid_logitech_dj eeepc_wmi asus_wmi sparse_keymap video snd_usb_audio btusb btrtl wmi_bmof btbcm snd_usbmidi_lib btintel snd_rawmidi bluetooth mc ecdh_generic ecc pcspkr sp5100_tco k10temp iwlmvm i2c_piix4 snd_hda_codec_realtek mac80211 snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi libarc4 snd_hda_intel snd_intel_dspcfg snd_hda_codec iwlwifi snd_hda_core snd_hwdep cfg80211 snd_seq snd_seq_device snd_pcm rfkill snd_timer snd ccp soundcore acpi_cpufreq binfmt_misc ip_tables xfs libcrc32c amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper cec drm crc32c_intel igb nvme dca nvme_core i2c_algo_bit wmi pinctrl_amd br_netfilter bridge stp llc fuse ---[ end trace 4528e591387ed399 ]--- RIP: 0010:kmem_cache_alloc+0x83/0x310 Code: 02 00 00 4c 8b 45 00 65 49 8b 50 08 65 4c 03 05 5b a3 cc 5e 4d 8b 20 4d 85 e4 0f 84 3e 02 00 00 8b 45 20 48 8b 7d 00 4c 01 e0 <48> 8b 18 48 89 c1 48 33 9d d0 01 00 00 48 0f c9 48 31 cb 40 f6 c7 RSP: 0018:a8398b357b08 EFLAGS: 00010282 RAX: fe5d6f0af7831e5e RBX: RCX: RDX: 62b6 RSI: 0400 RDI: 001f83c0 RBP: 9513740e9200 R08: 95137c3f83c0 R09: R10: R11: R12: fe5d6f0af7831dee R13: 0dc0 R14: 9513740e9200 R15: c03a3e92 FS: 7fd77db5c700() GS:95137c20() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fea1fe56540 CR3: 00060424a000 CR4: 00340ee0 $ /usr/src/kernels/`uname -r`/scripts/faddr2line /lib/debug/lib/modules/`uname -r`/vmlinux
Re: BUG: kernel NULL pointer dereference, address: 0000000000000026 after switching to 5.7 kernel
On Sat, 11 Apr 2020 at 14:56, Christian König wrote: > > Yeah, that is a known issue. > > You could try the attached patch, but please be aware that it is not > even compile tested because of the Easter holidays here. > Looks good to me, so it's pity that this patch did not exist in the pull request https://patchwork.kernel.org/patch/11492083/ -- Best Regards, Mike Gavrilov. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
BUG: kernel NULL pointer dereference, address: 0000000000000026 after switching to 5.7 kernel
Hi folks. After upgrade kernel to 5.7 I see every boot in kernel log following error messages: [2.569513] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19 [2.569538] [drm] PSP loading UVD firmware [2.570038] BUG: kernel NULL pointer dereference, address: 0026 [2.570045] #PF: supervisor read access in kernel mode [2.570050] #PF: error_code(0x) - not-present page [2.570055] PGD 0 P4D 0 [2.570060] Oops: [#1] SMP NOPTI [2.570065] CPU: 5 PID: 667 Comm: uvd_enc_1.1 Not tainted 5.7.0-0.rc0.git6.1.2.fc33.x86_64 #1 [2.570072] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 1405 11/19/2019 [2.570085] RIP: 0010:__kthread_should_park+0x5/0x30 [2.570090] Code: 00 e9 fe fe ff ff e8 ca 3a 08 00 e9 49 fe ff ff 48 89 df e8 dd 38 08 00 84 c0 0f 84 6a ff ff ff e9 a6 fe ff ff 0f 1f 44 00 00 47 26 20 74 12 48 8b 87 88 09 00 00 48 8b 00 48 c1 e8 02 83 e0 [2.570103] RSP: 0018:ad8141723e50 EFLAGS: 00010246 [2.570107] RAX: 7fff RBX: 8a8d1d116ed8 RCX: [2.570112] RDX: RSI: RDI: [2.570116] RBP: 8a8d28c11300 R08: R09: [2.570120] R10: R11: R12: 8a8d1d152e40 [2.570125] R13: 8a8d1d117280 R14: 8a8d1d116ed8 R15: 8a8d1ca68000 [2.570131] FS: () GS:8a8d3aa0() knlGS: [2.570137] CS: 0010 DS: ES: CR0: 80050033 [2.570142] CR2: 0026 CR3: 0007e3dc6000 CR4: 003406e0 [2.570147] Call Trace: [2.570157] drm_sched_get_cleanup_job+0x42/0x130 [gpu_sched] [2.570166] drm_sched_main+0x6f/0x530 [gpu_sched] [2.570173] ? lockdep_hardirqs_on+0x11e/0x1b0 [2.570179] ? drm_sched_get_cleanup_job+0x130/0x130 [gpu_sched] [2.570185] kthread+0x131/0x150 [2.570189] ? __kthread_bind_mask+0x60/0x60 [2.570196] ret_from_fork+0x27/0x50 [2.570203] Modules linked in: fjes(-) amdgpu(+) amd_iommu_v2 gpu_sched ttm drm_kms_helper drm crc32c_intel igb nvme nvme_core dca i2c_algo_bit wmi pinctrl_amd br_netfilter bridge stp llc fuse [2.570223] CR2: 0026 [2.570228] ---[ end trace 80c25d326e1e0d7c ]--- [2.570233] RIP: 0010:__kthread_should_park+0x5/0x30 [2.570238] Code: 00 e9 fe fe ff ff e8 ca 3a 08 00 e9 49 fe ff ff 48 89 df e8 dd 38 08 00 84 c0 0f 84 6a ff ff ff e9 a6 fe ff ff 0f 1f 44 00 00 47 26 20 74 12 48 8b 87 88 09 00 00 48 8b 00 48 c1 e8 02 83 e0 [2.570250] RSP: 0018:ad8141723e50 EFLAGS: 00010246 [2.570255] RAX: 7fff RBX: 8a8d1d116ed8 RCX: [2.570260] RDX: RSI: RDI: [2.570265] RBP: 8a8d28c11300 R08: R09: [2.570271] R10: R11: R12: 8a8d1d152e40 [2.570276] R13: 8a8d1d117280 R14: 8a8d1d116ed8 R15: 8a8d1ca68000 [2.570281] FS: () GS:8a8d3aa0() knlGS: [2.570287] CS: 0010 DS: ES: CR0: 80050033 [2.570292] CR2: 0026 CR3: 0007e3dc6000 CR4: 003406e0 [2.570299] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49 [2.570306] in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 667, name: uvd_enc_1.1 [2.570311] INFO: lockdep is turned off. [2.570315] irq event stamp: 14 [2.570319] hardirqs last enabled at (13): [] _raw_spin_unlock_irqrestore+0x46/0x60 [2.570330] hardirqs last disabled at (14): [] trace_hardirqs_off_thunk+0x1a/0x1c [2.570338] softirqs last enabled at (0): [] copy_process+0x706/0x1bc0 [2.570345] softirqs last disabled at (0): [<>] 0x0 [2.570351] CPU: 5 PID: 667 Comm: uvd_enc_1.1 Tainted: G D 5.7.0-0.rc0.git6.1.2.fc33.x86_64 #1 [2.570358] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 1405 11/19/2019 [2.570365] Call Trace: [2.570373] dump_stack+0x8b/0xc8 [2.570380] ___might_sleep.cold+0xb6/0xc6 [2.570385] exit_signals+0x1c/0x2d0 [2.570390] do_exit+0xb1/0xc30 [2.570395] ? kthread+0x131/0x150 [2.570400] rewind_stack_do_exit+0x17/0x20 [2.570559] [drm] Found VCE firmware Version: 57.6 Binary ID: 4 [2.570572] [drm] PSP loading VCE firmware [3.146462] [drm] reserve 0x40 from 0x83fe80 for PSP TMR $ /usr/src/kernels/`uname -r`/scripts/faddr2line /lib/debug/lib/modules/`uname -r`/vmlinux __kthread_should_park+0x5 __kthread_should_park+0x5/0x30: to_kthread at kernel/kthread.c:75 (inlined by) __kthread_should_park at kernel/kthread.c:109 I think this issue related to amdgpu driver. Can anyone look into it? Thanks. Full kernel log here: https://pastebin.com/RrSp6KYL -- Best Regards, Mike Gavrilov. _
BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 60s!
Hi folks, I just wanted to share my logs via paste but didn't look at what size they are. I opened the file in Geany and press Ctrl + A, Ctrl + C, and then go to Chrome in tab with opened pastebin.com and pressed Ctrl + V. I did not expect that after such action the GUI of the system hangs. I connected via ssh and saw follow messages: [ 317.662558] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based firewall rule not found. Use the iptables CT target to attach helpers instead. [ 2003.644286] GpuWatchdog[4339]: segfault at 0 ip 562357dfa40c sp 7fbc6bdc3500 error 6 in chrome[562353e82000+731f000] [ 2003.644341] Code: 3d bd 02 47 fb be 01 00 00 00 ba 07 00 00 00 e8 3a 9f a6 fe 48 8d 3d 0f 41 48 fb be 01 00 00 00 ba 03 00 00 00 e8 24 9f a6 fe 04 25 00 00 00 00 37 13 00 00 c6 05 82 a8 bd 03 01 80 7d 87 00 [ 2032.449702] GpuWatchdog[10475]: segfault at 0 ip 55ad62b7240c sp 7f81bc7ff500 error 6 in chrome[55ad5ebfa000+731f000] [ 2032.449719] Code: 3d bd 02 47 fb be 01 00 00 00 ba 07 00 00 00 e8 3a 9f a6 fe 48 8d 3d 0f 41 48 fb be 01 00 00 00 ba 03 00 00 00 e8 24 9f a6 fe 04 25 00 00 00 00 37 13 00 00 c6 05 82 a8 bd 03 01 80 7d 87 00 [ 2060.726076] GpuWatchdog[10663]: segfault at 0 ip 558ea234c40c sp 7f26a3d3e500 error 6 in chrome[558e9e3d4000+731f000] [ 2060.726093] Code: 3d bd 02 47 fb be 01 00 00 00 ba 07 00 00 00 e8 3a 9f a6 fe 48 8d 3d 0f 41 48 fb be 01 00 00 00 ba 03 00 00 00 e8 24 9f a6 fe 04 25 00 00 00 00 37 13 00 00 c6 05 82 a8 bd 03 01 80 7d 87 00 [ 2253.777053] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 60s! [ 2253.777144] Showing busy workqueues and worker pools: [ 2253.777149] workqueue events: flags=0x0 [ 2253.777313] pwq 22: cpus=11 node=0 flags=0x0 nice=0 active=1/256 refcnt=2 [ 2253.777849] in-flight: 10359:key_garbage_collector [ 2253.777856] == [ 2253.777856] WARNING: possible circular locking dependency detected [ 2253.777857] 5.5.0-0.rc5.git3.2.fc32.x86_64 #1 Not tainted [ 2253.777857] -- [ 2253.777858] WRRende~ckend#1/6583 is trying to acquire lock: [ 2253.777858] b866aa40 (console_owner){-.-.}, at: console_unlock+0x197/0x5c0 [ 2253.777860] but task is already holding lock: [ 2253.777861] 9a5a3b9ee798 (&(&pool->lock)->rlock){-.-.}, at: show_workqueue_state.cold+0x7c/0x2d1 [ 2253.777863] which lock already depends on the new lock. [ 2253.777864] the existing dependency chain (in reverse order) is: [ 2253.777864] -> #1 (&(&pool->lock)->rlock){-.-.}: [ 2253.777866]_raw_spin_lock+0x31/0x80 [ 2253.777866]__queue_work+0x36b/0x610 [ 2253.777866]queue_work_on+0x85/0x90 [ 2253.777867]soft_cursor+0x19f/0x220 [ 2253.777867]bit_cursor+0x3d4/0x5f0 [ 2253.777868]hide_cursor+0x2a/0x90 [ 2253.777868]vt_console_print+0x3ef/0x400 [ 2253.777868]console_unlock+0x41a/0x5c0 [ 2253.777869]register_framebuffer+0x28f/0x300 [ 2253.777870] __drm_fb_helper_initial_config_and_unlock+0x32e/0x4e0 [drm_kms_helper] [ 2253.777870]amdgpu_fbdev_init+0xbc/0xf0 [amdgpu] [ 2253.777870]amdgpu_device_init.cold+0x1674/0x1acc [amdgpu] [ 2253.777871]amdgpu_driver_load_kms+0x53/0x1a0 [amdgpu] [ 2253.777871]drm_dev_register+0x113/0x150 [drm] [ 2253.777872]amdgpu_pci_probe+0xec/0x150 [amdgpu] [ 2253.777872]local_pci_probe+0x42/0x80 [ 2253.777872]pci_device_probe+0x107/0x1a0 [ 2253.777873]really_probe+0x147/0x3c0 [ 2253.777873]driver_probe_device+0xb6/0x100 [ 2253.777874]device_driver_attach+0x53/0x60 [ 2253.777874]__driver_attach+0x8c/0x150 [ 2253.777874]bus_for_each_dev+0x7b/0xc0 [ 2253.777875]bus_add_driver+0x150/0x1f0 [ 2253.777875]driver_register+0x6c/0xc0 [ 2253.777875]do_one_initcall+0x5d/0x2f0 [ 2253.777876]do_init_module+0x5c/0x230 [ 2253.777876]load_module+0x2400/0x2650 [ 2253.777877]__do_sys_init_module+0x181/0x1b0 [ 2253.777877]do_syscall_64+0x5c/0xa0 [ 2253.777877]entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 2253.777878] -> #0 (console_owner){-.-.}: [ 2253.777879]__lock_acquire+0xe13/0x1a30 [ 2253.777880]lock_acquire+0xa2/0x1b0 [ 2253.777880]console_unlock+0x1f0/0x5c0 [ 2253.777880]vprintk_emit+0x180/0x350 [ 2253.777881]printk+0x58/0x6f [ 2253.777881]show_pwq+0x6c/0x298 [ 2253.777882]show_workqueue_state.cold+0x91/0x2d1 [ 2253.777882]wq_watchdog_timer_fn+0x1ba/0x240 [ 2253.777882]call_timer_fn+0xaf/0x2c0 [ 2253.777883]run_timer_softirq+0x3a0/0x5e0 [ 2253.777883]__do_softirq+0xe1/0x45d [ 2253.777884]irq_exit+0xf7/0x100 [ 2253.777884]smp_apic_timer_interrupt+0xa4/0x230 [ 2253.777884]apic_timer_interrupt+0xf/0x20 [ 2253.777885] other info that might help us de
Re: gnome-shell stuck because of amdgpu driver [5.3 RC5]
On Mon, 9 Sep 2019 at 14:15, Koenig, Christian wrote: > > I agree with Daniels analysis. > > It looks like the problem is simply that PM turns of a block before all > work is done on that block. > > Have you opened a bug report yet? If not then that would certainly help > cause it is really hard to extract all necessary information from that > mail thread. https://bugs.freedesktop.org/show_bug.cgi?id=111689 It'll do? -- Best Regards, Mike Gavrilov. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: gnome-shell stuck because of amdgpu driver [5.3 RC5]
On Thu, 5 Sep 2019 at 12:58, Daniel Vetter wrote: > > I think those fences are only emitted for CS, not display related. > Adding Christian König. More fresh kernel log with 5.3RC7 - the issue still happens. https://pastebin.com/tyxkWJYV -- Best Regards, Mike Gavrilov. On Thu, 5 Sep 2019 at 12:58, Daniel Vetter wrote: > > On Thu, Sep 5, 2019 at 12:27 AM Mikhail Gavrilov > wrote: > > > > On Wed, 4 Sep 2019 at 13:37, Daniel Vetter wrote: > > > > > > Extend your backtrac warning slightly like > > > > > > WARN(r, "we're stuck on fence %pS\n", fence->ops); > > > > > > Also adding Harry and Alex, I'm not really working on amdgpu ... > > > > [ 3511.998320] [ cut here ] > > [ 3511.998714] we're stuck on fence > > amdgpu_fence_ops+0x0/0xc220 [amdgpu]$ > > I think those fences are only emitted for CS, not display related. > Adding Christian König. > -Daniel > > > [ 3511.998991] WARNING: CPU: 10 PID: 1811 at > > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:332 > > amdgpu_fence_wait_empty+0x1c6/0x240 [amdgpu] > > [ 3511.999009] Modules linked in: rfcomm fuse xt_CHECKSUM > > xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc > > nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT > > nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack > > ebtable_nat ip6table_nat ip6table_mangle ip6table_raw > > ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw > > iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c > > ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables > > iptable_filter cmac bnep sunrpc vfat fat edac_mce_amd kvm_amd > > snd_hda_codec_realtek rtwpci snd_hda_codec_generic kvm ledtrig_audio > > snd_hda_codec_hdmi uvcvideo rtw88 videobuf2_vmalloc snd_hda_intel > > videobuf2_memops videobuf2_v4l2 irqbypass snd_usb_audio snd_hda_codec > > videobuf2_common crct10dif_pclmul snd_usbmidi_lib crc32_pclmul > > mac80211 snd_rawmidi videodev snd_hda_core ghash_clmulni_intel btusb > > snd_hwdep btrtl snd_seq btbcm btintel snd_seq_device eeepc_wmi > > bluetooth xpad joydev mc snd_pcm > > [ 3511.999076] asus_wmi ff_memless cfg80211 sparse_keymap video > > wmi_bmof ecdh_generic snd_timer ecc sp5100_tco k10temp snd i2c_piix4 > > ccp rfkill soundcore libarc4 gpio_amdpt gpio_generic acpi_cpufreq > > binfmt_misc ip_tables hid_logitech_hidpp hid_logitech_dj amdgpu > > amd_iommu_v2 gpu_sched ttm drm_kms_helper drm crc32c_intel igb dca > > nvme i2c_algo_bit nvme_core wmi pinctrl_amd > > [ 3511.999126] CPU: 10 PID: 1811 Comm: Xorg Not tainted > > 5.3.0-0.rc6.git2.1c.fc32.x86_64 #1 > > [ 3511.999131] Hardware name: System manufacturer System Product > > Name/ROG STRIX X470-I GAMING, BIOS 2703 08/20/2019 > > [ 3511.999253] RIP: 0010:amdgpu_fence_wait_empty+0x1c6/0x240 [amdgpu] > > [ 3511.999278] Code: fe ff ff 31 c0 c3 48 89 ef e8 36 29 04 cb 84 c0 > > 74 08 48 89 ef e8 8a a9 21 cb 48 8b 75 08 48 c7 c7 2c 16 86 c0 e8 82 > > b8 b9 ca <0f> 0b b8 ea ff ff ff 5d c3 e8 ec 57 c3 ca 84 c0 0f 85 6f ff > > ff ff > > [ 3511.999282] RSP: 0018:b9c04170f798 EFLAGS: 00210282 > > [ 3511.999288] RAX: RBX: 8d2ce5205a80 RCX: > > 0006 > > [ 3511.999292] RDX: 0007 RSI: 8d2c5bea4070 RDI: > > 8d2cfb5d9e00 > > [ 3511.999296] RBP: 8d28becae480 R08: 0331b36fd503 R09: > > > > [ 3511.999299] R10: R11: R12: > > 8d2ce520 > > [ 3511.999303] R13: R14: R15: > > 8d2ce154 > > [ 3511.999308] FS: 7f59a5bc6f00() GS:8d2cfb40() > > knlGS: > > [ 3511.999311] CS: 0010 DS: ES: CR0: 80050033 > > [ 3511.999315] CR2: 1108bc475960 CR3: 00075bf32000 CR4: > > 003406e0 > > [ 3511.999319] Call Trace: > > [ 3511.999394] amdgpu_pm_compute_clocks+0x70/0x5f0 [amdgpu] > > [ 3511.999503] dm_pp_apply_display_requirements+0x1a8/0x1c0 [amdgpu] > > [ 3511.999609] dce12_update_clocks+0xd8/0x110 [amdgpu] > > [ 3511.999712] dc_commit_state+0x414/0x590 [amdgpu] > > [ 3511.999725] ? find_held_lock+0x32/0x90 > > [ 3511.999832] amdgpu_dm_atomic_commit_tail+0xd18/0x1cf0 [amdgpu] > > [ 3511.999844] ? reacquire_held_locks+0xed/0x210 > > [ 3511.999859] ? ttm_eu_backoff_reservation+0xa5/0x160 [ttm] > > [ 3511.999866] ? find_held_lock+0x32/0x90 > > [ 3511.999872] ? find_held_lock+0x32/0x90 > > [ 3511.
Re: gnome-shell stuck because of amdgpu driver [5.3 RC5]
On Wed, 4 Sep 2019 at 13:37, Daniel Vetter wrote: > > Extend your backtrac warning slightly like > > WARN(r, "we're stuck on fence %pS\n", fence->ops); > > Also adding Harry and Alex, I'm not really working on amdgpu ... [ 3511.998320] [ cut here ] [ 3511.998714] we're stuck on fence amdgpu_fence_ops+0x0/0xc220 [amdgpu] [ 3511.998991] WARNING: CPU: 10 PID: 1811 at drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:332 amdgpu_fence_wait_empty+0x1c6/0x240 [amdgpu] [ 3511.999009] Modules linked in: rfcomm fuse xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat edac_mce_amd kvm_amd snd_hda_codec_realtek rtwpci snd_hda_codec_generic kvm ledtrig_audio snd_hda_codec_hdmi uvcvideo rtw88 videobuf2_vmalloc snd_hda_intel videobuf2_memops videobuf2_v4l2 irqbypass snd_usb_audio snd_hda_codec videobuf2_common crct10dif_pclmul snd_usbmidi_lib crc32_pclmul mac80211 snd_rawmidi videodev snd_hda_core ghash_clmulni_intel btusb snd_hwdep btrtl snd_seq btbcm btintel snd_seq_device eeepc_wmi bluetooth xpad joydev mc snd_pcm [ 3511.999076] asus_wmi ff_memless cfg80211 sparse_keymap video wmi_bmof ecdh_generic snd_timer ecc sp5100_tco k10temp snd i2c_piix4 ccp rfkill soundcore libarc4 gpio_amdpt gpio_generic acpi_cpufreq binfmt_misc ip_tables hid_logitech_hidpp hid_logitech_dj amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper drm crc32c_intel igb dca nvme i2c_algo_bit nvme_core wmi pinctrl_amd [ 3511.999126] CPU: 10 PID: 1811 Comm: Xorg Not tainted 5.3.0-0.rc6.git2.1c.fc32.x86_64 #1 [ 3511.999131] Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2703 08/20/2019 [ 3511.999253] RIP: 0010:amdgpu_fence_wait_empty+0x1c6/0x240 [amdgpu] [ 3511.999278] Code: fe ff ff 31 c0 c3 48 89 ef e8 36 29 04 cb 84 c0 74 08 48 89 ef e8 8a a9 21 cb 48 8b 75 08 48 c7 c7 2c 16 86 c0 e8 82 b8 b9 ca <0f> 0b b8 ea ff ff ff 5d c3 e8 ec 57 c3 ca 84 c0 0f 85 6f ff ff ff [ 3511.999282] RSP: 0018:b9c04170f798 EFLAGS: 00210282 [ 3511.999288] RAX: RBX: 8d2ce5205a80 RCX: 0006 [ 3511.999292] RDX: 0007 RSI: 8d2c5bea4070 RDI: 8d2cfb5d9e00 [ 3511.999296] RBP: 8d28becae480 R08: 0331b36fd503 R09: [ 3511.999299] R10: R11: R12: 8d2ce520 [ 3511.999303] R13: R14: R15: 8d2ce154 [ 3511.999308] FS: 7f59a5bc6f00() GS:8d2cfb40() knlGS: [ 3511.999311] CS: 0010 DS: ES: CR0: 80050033 [ 3511.999315] CR2: 1108bc475960 CR3: 00075bf32000 CR4: 003406e0 [ 3511.999319] Call Trace: [ 3511.999394] amdgpu_pm_compute_clocks+0x70/0x5f0 [amdgpu] [ 3511.999503] dm_pp_apply_display_requirements+0x1a8/0x1c0 [amdgpu] [ 3511.999609] dce12_update_clocks+0xd8/0x110 [amdgpu] [ 3511.999712] dc_commit_state+0x414/0x590 [amdgpu] [ 3511.999725] ? find_held_lock+0x32/0x90 [ 3511.999832] amdgpu_dm_atomic_commit_tail+0xd18/0x1cf0 [amdgpu] [ 3511.999844] ? reacquire_held_locks+0xed/0x210 [ 3511.999859] ? ttm_eu_backoff_reservation+0xa5/0x160 [ttm] [ 3511.999866] ? find_held_lock+0x32/0x90 [ 3511.999872] ? find_held_lock+0x32/0x90 [ 3511.999881] ? __lock_acquire+0x247/0x1910 [ 3511.999893] ? find_held_lock+0x32/0x90 [ 3511.01] ? mark_held_locks+0x50/0x80 [ 3511.07] ? _raw_spin_unlock_irq+0x29/0x40 [ 3511.13] ? lockdep_hardirqs_on+0xf0/0x180 [ 3511.19] ? _raw_spin_unlock_irq+0x29/0x40 [ 3511.24] ? wait_for_completion_timeout+0x75/0x190 [ 3511.52] ? commit_tail+0x3c/0x70 [drm_kms_helper] [ 3511.66] commit_tail+0x3c/0x70 [drm_kms_helper] [ 3511.79] drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper] [ 3512.02] drm_mode_atomic_ioctl+0x793/0x9b0 [drm] [ 3512.14] ? __lock_acquire+0x247/0x1910 [ 3512.44] ? drm_atomic_set_property+0xa50/0xa50 [drm] [ 3512.66] drm_ioctl_kernel+0xaa/0xf0 [drm] [ 3512.88] drm_ioctl+0x208/0x390 [drm] [ 3512.000108] ? drm_atomic_set_property+0xa50/0xa50 [drm] [ 3512.000120] ? lockdep_hardirqs_on+0xf0/0x180 [ 3512.000205] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [ 3512.000216] do_vfs_ioctl+0x411/0x750 [ 3512.000229] ksys_ioctl+0x5e/0x90 [ 3512.000237] __x64_sys_ioctl+0x16/0x20 [ 3512.000242] do_syscall_64+0x5c/0xb0 [ 3512.000249] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 3512.000254] RIP: 0033:0x7f59a603d00b [ 3512.000259] Code: 0f 1e fa 48 8b 05 7d 9e 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d
Re: gnome-shell stuck because of amdgpu driver [5.3 RC5]
On Tue, 3 Sep 2019 at 13:21, Hillf Danton wrote: > > Describe the problems you are experiencing please. > Say is the screen locked up? Machine lockedup? > Anything unnormal after you see the warning? > According to my observations, all "gnome shell stuck warning" happened when me not sitting on the computer and the computer was locked. I did not notice any problems at the morning (I did not even look at the kernel logs), I found that the problem happened when I remotely connected to my computer via ssh from work and accidently look dmesg output. At the evening after work, I even played in the "Division", and still not noted any problems. Now 11:01pm and "gnome shell stuck warning" not appear since 19:17. So looks like issue happens only when computer blocked and monitor in power save mode. $ dmesg -T | grep gnome ---> I am goto sleep [Tue Sep 3 01:00:10 2019] gnome shell stuck warning [Tue Sep 3 01:00:55 2019] gnome shell stuck warning [Tue Sep 3 06:54:50 2019] gnome shell stuck warning <--- I am wake up at 8:00 am and sitting again on the computer ---> I am went to work at 9:30 [Tue Sep 3 10:00:05 2019] gnome shell stuck warning [Tue Sep 3 10:10:01 2019] gnome shell stuck warning [Tue Sep 3 10:13:43 2019] gnome shell stuck warning [Tue Sep 3 10:23:37 2019] gnome shell stuck warning [Tue Sep 3 10:42:07 2019] gnome shell stuck warning [Tue Sep 3 10:42:57 2019] gnome shell stuck warning [Tue Sep 3 10:59:25 2019] gnome shell stuck warning [Tue Sep 3 11:08:35 2019] gnome shell stuck warning [Tue Sep 3 11:13:19 2019] gnome shell stuck warning [Tue Sep 3 11:15:20 2019] gnome shell stuck warning [Tue Sep 3 11:26:20 2019] gnome shell stuck warning [Tue Sep 3 11:26:20 2019] gnome shell stuck warning [Tue Sep 3 11:36:30 2019] gnome shell stuck warning [Tue Sep 3 11:46:08 2019] gnome shell stuck warning [Tue Sep 3 11:53:52 2019] gnome shell stuck warning [Tue Sep 3 11:56:36 2019] gnome shell stuck warning [Tue Sep 3 12:17:10 2019] gnome shell stuck warning [Tue Sep 3 12:20:20 2019] gnome shell stuck warning [Tue Sep 3 12:20:20 2019] gnome shell stuck warning [Tue Sep 3 12:30:46 2019] gnome shell stuck warning [Tue Sep 3 12:40:52 2019] gnome shell stuck warning [Tue Sep 3 12:55:30 2019] gnome shell stuck warning [Tue Sep 3 12:57:52 2019] gnome shell stuck warning [Tue Sep 3 13:04:00 2019] gnome shell stuck warning [Tue Sep 3 13:12:38 2019] gnome shell stuck warning [Tue Sep 3 13:14:32 2019] gnome shell stuck warning [Tue Sep 3 13:53:12 2019] gnome shell stuck warning [Tue Sep 3 14:12:52 2019] gnome shell stuck warning [Tue Sep 3 14:15:54 2019] gnome shell stuck warning [Tue Sep 3 14:17:04 2019] gnome shell stuck warning [Tue Sep 3 14:21:57 2019] gnome shell stuck warning [Tue Sep 3 14:22:10 2019] gnome shell stuck warning [Tue Sep 3 14:37:42 2019] gnome shell stuck warning [Tue Sep 3 14:41:51 2019] gnome shell stuck warning [Tue Sep 3 14:42:52 2019] gnome shell stuck warning [Tue Sep 3 14:46:35 2019] gnome shell stuck warning [Tue Sep 3 15:03:18 2019] gnome shell stuck warning [Tue Sep 3 15:16:50 2019] gnome shell stuck warning [Tue Sep 3 15:27:30 2019] gnome shell stuck warning [Tue Sep 3 15:27:41 2019] gnome shell stuck warning [Tue Sep 3 16:08:06 2019] gnome shell stuck warning [Tue Sep 3 16:24:16 2019] gnome shell stuck warning [Tue Sep 3 16:33:04 2019] gnome shell stuck warning [Tue Sep 3 16:52:10 2019] gnome shell stuck warning [Tue Sep 3 17:18:27 2019] gnome shell stuck warning [Tue Sep 3 17:25:30 2019] gnome shell stuck warning [Tue Sep 3 17:41:16 2019] gnome shell stuck warning [Tue Sep 3 17:43:32 2019] gnome shell stuck warning [Tue Sep 3 17:51:10 2019] gnome shell stuck warning [Tue Sep 3 18:41:44 2019] gnome shell stuck warning [Tue Sep 3 18:44:18 2019] gnome shell stuck warning [Tue Sep 3 19:03:07 2019] gnome shell stuck warning [Tue Sep 3 19:17:58 2019] gnome shell stuck warning <--- Returned to home and sitting again on the computer -- Best Regards, Mike Gavrilov. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: gnome-shell stuck because of amdgpu driver [5.3 RC5]
On Fri, 30 Aug 2019 at 08:30, Hillf Danton wrote: > > Add a warning to show if it makes sense in field: neither regression nor > problem will have been observed with the warning printed. > I caught the problem. [21793.094289] [ cut here ] [21793.094296] gnome shell stuck warning [21793.094391] WARNING: CPU: 14 PID: 1768 at drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:332 amdgpu_fence_wait_empty+0x1c2/0x230 [amdgpu] [21793.094394] Modules linked in: rfcomm fuse xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat edac_mce_amd kvm_amd snd_hda_codec_realtek rtwpci rtw88 snd_hda_codec_generic snd_usb_audio kvm ledtrig_audio snd_hda_codec_hdmi snd_hda_intel mac80211 snd_hda_codec snd_usbmidi_lib irqbypass uvcvideo snd_rawmidi snd_hda_core videobuf2_vmalloc videobuf2_memops crct10dif_pclmul btusb videobuf2_v4l2 snd_hwdep crc32_pclmul btrtl videobuf2_common snd_seq eeepc_wmi btbcm xpad asus_wmi btintel snd_seq_device ghash_clmulni_intel cfg80211 sparse_keymap [21793.094426] ff_memless joydev bluetooth videodev video snd_pcm wmi_bmof mc ecdh_generic snd_timer ecc snd ccp rfkill libarc4 soundcore sp5100_tco k10temp i2c_piix4 gpio_amdpt gpio_generic acpi_cpufreq binfmt_misc ip_tables hid_logitech_hidpp hid_logitech_dj amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper igb drm nvme dca crc32c_intel i2c_algo_bit nvme_core wmi pinctrl_amd [21793.094449] CPU: 14 PID: 1768 Comm: Xorg Tainted: GW 5.3.0-0.rc6.git2.1b.fc32.x86_64 #1 [21793.094452] Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2406 06/21/2019 [21793.094499] RIP: 0010:amdgpu_fence_wait_empty+0x1c2/0x230 [amdgpu] [21793.094502] Code: b5 f4 e9 c1 fe ff ff 31 c0 c3 48 89 ef e8 36 69 f8 f4 84 c0 74 08 48 89 ef e8 8a e9 15 f5 48 c7 c7 2c d6 91 c0 e8 86 f8 ad f4 <0f> 0b b8 ea ff ff ff 5d c3 e8 f0 97 b7 f4 84 c0 0f 85 73 ff ff ff [21793.094505] RSP: 0018:ae13418c3798 EFLAGS: 00010282 [21793.094508] RAX: RBX: 8aa065f85a80 RCX: 0006 [21793.094511] RDX: 0007 RSI: 8a9fe32ec070 RDI: 8aa07bdd9e00 [21793.094513] RBP: 8aa069469d00 R08: 13d219a4ead6 R09: [21793.094516] R10: R11: R12: 8aa065f8 [21793.094518] R13: R14: R15: 8aa065fb [21793.094521] FS: 7f586201cf00() GS:8aa07bc0() knlGS: [21793.094524] CS: 0010 DS: ES: CR0: 80050033 [21793.094526] CR2: 7f57fc5b5000 CR3: 00076334 CR4: 003406e0 [21793.094528] Call Trace: [21793.094580] amdgpu_pm_compute_clocks+0x70/0x5f0 [amdgpu] [21793.094655] dm_pp_apply_display_requirements+0x1a8/0x1c0 [amdgpu] [21793.094728] dce12_update_clocks+0xd8/0x110 [amdgpu] [21793.094799] dc_commit_state+0x414/0x590 [amdgpu] [21793.094807] ? find_held_lock+0x32/0x90 [21793.094880] amdgpu_dm_atomic_commit_tail+0xd18/0x1cf0 [amdgpu] [21793.094888] ? reacquire_held_locks+0xed/0x210 [21793.094898] ? ttm_eu_backoff_reservation+0xa5/0x160 [ttm] [21793.094903] ? find_held_lock+0x32/0x90 [21793.094906] ? find_held_lock+0x32/0x90 [21793.094912] ? __lock_acquire+0x247/0x1910 [21793.094920] ? find_held_lock+0x32/0x90 [21793.094925] ? mark_held_locks+0x50/0x80 [21793.094929] ? _raw_spin_unlock_irq+0x29/0x40 [21793.094933] ? lockdep_hardirqs_on+0xf0/0x180 [21793.094937] ? _raw_spin_unlock_irq+0x29/0x40 [21793.094941] ? wait_for_completion_timeout+0x75/0x190 [21793.094954] ? commit_tail+0x3c/0x70 [drm_kms_helper] [21793.094962] commit_tail+0x3c/0x70 [drm_kms_helper] [21793.094971] drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper] [21793.094986] drm_mode_atomic_ioctl+0x793/0x9b0 [drm] [21793.094994] ? __lock_acquire+0x247/0x1910 [21793.095013] ? drm_atomic_set_property+0xa50/0xa50 [drm] [21793.095025] drm_ioctl_kernel+0xaa/0xf0 [drm] [21793.095039] drm_ioctl+0x208/0x390 [drm] [21793.095053] ? drm_atomic_set_property+0xa50/0xa50 [drm] [21793.095060] ? lockdep_hardirqs_on+0xf0/0x180 [21793.095108] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [21793.095114] do_vfs_ioctl+0x411/0x750 [21793.095121] ksys_ioctl+0x5e/0x90 [21793.095126] __x64_sys_ioctl+0x16/0x20 [21793.095130] do_syscall_64+0x5c/0xb0 [21793.095135] entry_SYSCALL_64_after_hwframe+0x49/0xbe [21793.095138] RIP: 0033:0x7f586249300b [21793.095142] Code: 0f 1e fa 48 8b 05 7d 9e 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4d 9e 0c 00 f7 d8 64 8
gnome-shell stuck because of amdgpu driver [5.3 RC5]
Hi folks, I left unblocked gnome-shell at noon, and when I returned at the evening I discovered than monitor not sleeping and show open gnome activity. At first, I thought that some application did not let fall asleep the system. But when I try to move the mouse, I realized that the system hanged. So I connect via ssh and tried to investigate the problem. I did not see anything strange in kernel logs. And my last idea before trying to kill the gnome-shell process was dumps tasks that are in uninterruptable (blocked) state. After [Alt + PrnScr + W] I saw this: [32840.701909] sysrq: Show Blocked State [32840.701976] taskPC stack pid father [32840.702407] gnome-shell D11240 1900 1830 0x [32840.702438] Call Trace: [32840.702446] ? __schedule+0x352/0x900 [32840.702453] schedule+0x3a/0xb0 [32840.702457] schedule_timeout+0x289/0x3c0 [32840.702461] ? find_held_lock+0x32/0x90 [32840.702464] ? find_held_lock+0x32/0x90 [32840.702469] ? mark_held_locks+0x50/0x80 [32840.702473] ? _raw_spin_unlock_irqrestore+0x4b/0x60 [32840.702478] dma_fence_default_wait+0x1f5/0x340 [32840.702482] ? dma_fence_free+0x20/0x20 [32840.702487] dma_fence_wait_timeout+0x182/0x1e0 [32840.702533] amdgpu_fence_wait_empty+0xe7/0x210 [amdgpu] [32840.702577] amdgpu_pm_compute_clocks+0x70/0x5f0 [amdgpu] [32840.702641] dm_pp_apply_display_requirements+0x19e/0x1c0 [amdgpu] [32840.702705] dce12_update_clocks+0xd8/0x110 [amdgpu] [32840.702766] dc_commit_state+0x414/0x590 [amdgpu] [32840.702834] amdgpu_dm_atomic_commit_tail+0xd1e/0x1cf0 [amdgpu] [32840.702840] ? reacquire_held_locks+0xed/0x210 [32840.702848] ? ttm_eu_backoff_reservation+0xa5/0x160 [ttm] [32840.702853] ? find_held_lock+0x32/0x90 [32840.702855] ? find_held_lock+0x32/0x90 [32840.702860] ? __lock_acquire+0x247/0x1910 [32840.702867] ? find_held_lock+0x32/0x90 [32840.702871] ? mark_held_locks+0x50/0x80 [32840.702874] ? _raw_spin_unlock_irq+0x29/0x40 [32840.702877] ? lockdep_hardirqs_on+0xf0/0x180 [32840.702881] ? _raw_spin_unlock_irq+0x29/0x40 [32840.702884] ? wait_for_completion_timeout+0x75/0x190 [32840.702895] ? commit_tail+0x3c/0x70 [drm_kms_helper] [32840.702902] commit_tail+0x3c/0x70 [drm_kms_helper] [32840.702909] drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper] [32840.702922] drm_atomic_connector_commit_dpms+0xd7/0x100 [drm] [32840.702936] set_property_atomic+0xcc/0x140 [drm] [32840.702955] drm_mode_obj_set_property_ioctl+0xcb/0x1c0 [drm] [32840.702968] ? drm_mode_obj_find_prop_id+0x40/0x40 [drm] [32840.702978] drm_ioctl_kernel+0xaa/0xf0 [drm] [32840.702990] drm_ioctl+0x208/0x390 [drm] [32840.703003] ? drm_mode_obj_find_prop_id+0x40/0x40 [drm] [32840.703007] ? sched_clock_cpu+0xc/0xc0 [32840.703012] ? lockdep_hardirqs_on+0xf0/0x180 [32840.703053] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [32840.703058] do_vfs_ioctl+0x411/0x750 [32840.703065] ksys_ioctl+0x5e/0x90 [32840.703069] __x64_sys_ioctl+0x16/0x20 [32840.703072] do_syscall_64+0x5c/0xb0 [32840.703076] entry_SYSCALL_64_after_hwframe+0x49/0xbe [32840.703079] RIP: 0033:0x7f8bcab0f00b [32840.703084] Code: Bad RIP value. [32840.703086] RSP: 002b:7ffe76c62338 EFLAGS: 0246 ORIG_RAX: 0010 [32840.703089] RAX: ffda RBX: 7ffe76c62370 RCX: 7f8bcab0f00b [32840.703092] RDX: 7ffe76c62370 RSI: c01864ba RDI: 0009 [32840.703094] RBP: c01864ba R08: 0003 R09: c0c0c0c0 [32840.703096] R10: 56476c86a018 R11: 0246 R12: 56476c8ad940 [32840.703098] R13: 0009 R14: 0002 R15: 0003 [root@localhost ~]# [root@localhost ~]# ps aux | grep gnome-shell mikhail 1900 0.3 1.1 6447496 378696 tty2 Dl+ Aug24 2:10 /usr/bin/gnome-shell mikhail 2099 0.0 0.0 519984 23392 ?Ssl Aug24 0:00 /usr/libexec/gnome-shell-calendar-server mikhail12214 0.0 0.0 399484 29660 pts/2Sl+ Aug24 0:00 /usr/bin/python3 /usr/bin/chrome-gnome-shell chrome-extension://gphhapmejobijbbhgpjhcjognlahblep/ root 22957 0.0 0.0 216120 2456 pts/10 S+ 03:59 0:00 grep --color=auto gnome-shell After it, I tried to kill gnome-shell process with signal 9, but the process won't terminate after several unsuccessful attempts. Only [Alt + PrnScr + B] helped reboot the hanging system. I am writing here because I hope some ampgpu hackers cal look in the trace and understand that is happening. Sorry, I don’t know how to reproduce this bug. But the problem itself is very annoying. Thanks. GPU: AMD Radeon VII Kernel: 5.3 RC5 -- Best Regards, Mike Gavrilov. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: The issue with page allocation 5.3 rc1-rc2 (seems drm culprit here)
On Fri, 9 Aug 2019 at 23:55, Mikhail Gavrilov wrote: > Finally initial problem "gnome-shell: page allocation failure: > order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), > nodemask=(null),cpuset=/,mems_allowed=0" did not happens anymore with > latest version of the patch (I tested more than 23 hours) > > But I hit a new problem: > > [73808.088801] [ cut here ] > [73808.088806] DEBUG_LOCKS_WARN_ON(ww_ctx->contending_lock) > [73808.088813] WARNING: CPU: 8 PID: 1348877 at > kernel/locking/mutex.c:757 __ww_mutex_lock.constprop.0+0xb0f/0x10c0 [pruned] > So I needed to report it separately (in another thread) or we continue here? Today after reboot issue "DEBUG LOCKS WARN_ON(ww_ctx->contending_lock)" happened again. -- Best Regards, Mike Gavrilov. [ 5406.584851] [ cut here ] [ 5406.584855] DEBUG_LOCKS_WARN_ON(ww_ctx->contending_lock) [ 5406.584862] WARNING: CPU: 2 PID: 4865 at kernel/locking/mutex.c:757 __ww_mutex_lock.constprop.0+0xb0f/0x10c0 [ 5406.584865] Modules linked in: macvtap macvlan tap rfcomm xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat snd_hda_codec_realtek snd_hda_codec_generic edac_mce_amd ledtrig_audio kvm_amd snd_hda_codec_hdmi snd_hda_intel kvm rtwpci snd_hda_codec rtw88 irqbypass snd_hda_core snd_usb_audio mac80211 snd_usbmidi_lib crct10dif_pclmul uvcvideo snd_hwdep snd_rawmidi crc32_pclmul btusb videobuf2_vmalloc videobuf2_memops snd_seq videobuf2_v4l2 btrtl btbcm ghash_clmulni_intel snd_seq_device btintel videobuf2_common xpad eeepc_wmi joydev ff_memless [ 5406.584895] bluetooth cfg80211 snd_pcm asus_wmi videodev snd_timer sparse_keymap video wmi_bmof snd ecdh_generic mc ecc soundcore ccp k10temp sp5100_tco rfkill libarc4 i2c_piix4 gpio_amdpt gpio_generic acpi_cpufreq binfmt_misc ip_tables hid_logitech_hidpp amdgpu crc32c_intel amd_iommu_v2 gpu_sched ttm drm_kms_helper igb drm nvme dca hid_logitech_dj i2c_algo_bit nvme_core wmi pinctrl_amd [ 5406.584915] CPU: 2 PID: 4865 Comm: firefox:cs0 Not tainted 5.3.0-0.rc3.git1.2.fc31.x86_64 #1 [ 5406.584917] Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2406 06/21/2019 [ 5406.584920] RIP: 0010:__ww_mutex_lock.constprop.0+0xb0f/0x10c0 [ 5406.584922] Code: 28 00 74 28 e8 42 29 a6 ff 85 c0 74 1f 8b 05 f8 6a e0 00 85 c0 75 15 48 c7 c6 70 35 32 92 48 c7 c7 f0 67 30 92 e8 e9 84 5c ff <0f> 0b 4d 89 74 24 28 b8 dd ff ff ff 65 48 8b 14 25 40 8e 01 00 48 [ 5406.584924] RSP: 0018:b738cca4f760 EFLAGS: 00010286 [ 5406.584926] RAX: RBX: 8e1732e13300 RCX: [ 5406.584927] RDX: 0002 RSI: 0001 RDI: 0246 [ 5406.584929] RBP: b738cca4f820 R08: R09: [ 5406.584931] R10: 93d3f740 R11: 93d3f373 R12: b738cca4fb90 [ 5406.584932] R13: b738cca4f7c0 R14: 8e172e0fb258 R15: 8e172e0fb260 [ 5406.584934] FS: 7fc2d5c6b700() GS:8e18ba40() knlGS: [ 5406.584935] CS: 0010 DS: ES: CR0: 80050033 [ 5406.584937] CR2: 7ff54bbd CR3: 0005ad12a000 CR4: 003406e0 [ 5406.584938] Call Trace: [ 5406.584943] ? _raw_spin_unlock_irq+0x29/0x40 [ 5406.584951] ? ttm_mem_evict_first+0x1ed/0x4f0 [ttm] [ 5406.584955] ? ww_mutex_lock_interruptible+0x43/0xb0 [ 5406.584957] ww_mutex_lock_interruptible+0x43/0xb0 [ 5406.584961] ttm_mem_evict_first+0x1ed/0x4f0 [ttm] [ 5406.584969] ttm_bo_mem_space+0x229/0x2c0 [ttm] [ 5406.584974] ttm_bo_validate+0xe5/0x190 [ttm] [ 5406.584979] ? lockdep_hardirqs_on+0xf0/0x180 [ 5406.585033] amdgpu_cs_bo_validate+0xaa/0x1b0 [amdgpu] [ 5406.585082] amdgpu_cs_validate+0x3b/0x260 [amdgpu] [ 5406.585131] amdgpu_cs_list_validate+0x110/0x180 [amdgpu] [ 5406.585184] amdgpu_cs_ioctl+0x5a9/0x1d10 [amdgpu] [ 5406.585189] ? sched_clock+0x5/0x10 [ 5406.585247] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [ 5406.585260] drm_ioctl_kernel+0xaa/0xf0 [drm] [ 5406.585271] drm_ioctl+0x208/0x390 [drm] [ 5406.585316] ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu] [ 5406.585319] ? sched_clock_cpu+0xc/0xc0 [ 5406.585322] ? lockdep_hardirqs_on+0xf0/0x180 [ 5406.585366] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [ 5406.585371] do_vfs_ioctl+0x411/0x750 [ 5406.585375] ksys_ioctl+0x5e/0x90 [ 5406.585378] __x64_sys_ioctl+0x16/0x20 [ 5406.585381] do_syscall_64+0x5c/0xb0 [ 5406.585385] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 5406.585387] RIP:
Re: The issue with page allocation 5.3 rc1-rc2 (seems drm culprit here)
On Mon, 5 Aug 2019 at 08:21, Hillf Danton wrote: > > > > Try to fix the failure above using vmalloc + kmalloc. > > --- a/drivers/gpu/drm/amd/display/dc/core/dc.c > +++ b/drivers/gpu/drm/amd/display/dc/core/dc.c > @@ -1174,8 +1174,12 @@ struct dc_state *dc_create_state(struct > struct dc_state *context = kzalloc(sizeof(struct dc_state), >GFP_KERNEL); > > - if (!context) > - return NULL; > + if (!context) { > + context = kvzalloc(sizeof(struct dc_state), > + GFP_KERNEL); > + if (!context) > + return NULL; > + } > /* Each context must have their own instance of VBA and in order to > * initialize and obtain IP and SOC the base DML instance from DC is > * initially copied into every context > @@ -1195,8 +1199,13 @@ struct dc_state *dc_copy_state(struct dc > struct dc_state *new_ctx = kmemdup(src_ctx, > sizeof(struct dc_state), GFP_KERNEL); > > - if (!new_ctx) > - return NULL; > + if (!new_ctx) { > + new_ctx = kvmalloc(sizeof(*new_ctx), GFP_KERNEL); > + if (new_ctx) > + *new_ctx = *src_ctx; > + else > + return NULL; > + } > > for (i = 0; i < MAX_PIPES; i++) { > struct pipe_ctx *cur_pipe = > &new_ctx->res_ctx.pipe_ctx[i]; > @@ -1230,7 +1239,7 @@ static void dc_state_free(struct kref *k > { > struct dc_state *context = container_of(kref, struct dc_state, > refcount); > dc_resource_state_destruct(context); > - kfree(context); > + kvfree(context); > } > > void dc_release_state(struct dc_state *context) > -- Unfortunately couldn't check this patch because, with the patch, the kernel did not compile. Here is compile error messages: drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c: In function 'dc_create_state': drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1178:13: error: implicit declaration of function 'kvzalloc'; did you mean 'kzalloc'? [-Werror=implicit-function-declaration] 1178 | context = kvzalloc(sizeof(struct dc_state), | ^~~~ | kzalloc drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1178:11: warning: assignment to 'struct dc_state *' from 'int' makes pointer from integer without a cast [-Wint-conversion] 1178 | context = kvzalloc(sizeof(struct dc_state), | ^ drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c: In function 'dc_copy_state': drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1203:13: error: implicit declaration of function 'kvmalloc'; did you mean 'kmalloc'? [-Werror=implicit-function-declaration] 1203 | new_ctx = kvmalloc(sizeof(*new_ctx), GFP_KERNEL); | ^~~~ | kmalloc drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1203:11: warning: assignment to 'struct dc_state *' from 'int' makes pointer from integer without a cast [-Wint-conversion] 1203 | new_ctx = kvmalloc(sizeof(*new_ctx), GFP_KERNEL); | ^ drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c: In function 'dc_state_free': drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1242:2: error: implicit declaration of function 'kvfree'; did you mean 'kzfree'? [-Werror=implicit-function-declaration] 1242 | kvfree(context); | ^~ | kzfree cc1: some warnings being treated as errors make[4]: *** [scripts/Makefile.build:274: drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.o] Error 1 make[4]: *** Waiting for unfinished jobs make[3]: *** [scripts/Makefile.build:490: drivers/gpu/drm/amd/amdgpu] Error 2 make[3]: *** Waiting for unfinished jobs make: *** [Makefile:1084: drivers] Error 2 -- Best Regards, Mike Gavrilov. ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel