Re: 6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang

2024-09-10 Thread Mikhail Gavrilov
On Tue, Sep 10, 2024 at 8:47 PM Leo Li  wrote:
>
> Thanks Mikhail, I think I know what's going on now.
>
> The `scale-monitor-framebuffer` experimental setting is what puts us down the
> bad code path. It seems VRR has nothing to do with this issue, just setting
> `scale-monitor-framebuffer` is enough to reproduce.

I ran some additional tests:

1)
$ gsettings set org.gnome.mutter experimental-features
"['variable-refresh-rate']"
Symptoms: No

2)
$ gsettings set org.gnome.mutter experimental-features
"['scale-monitor-framebuffer']"
Symptoms: Screen flickers happening when moving cursor.

3)
$ gsettings set org.gnome.mutter experimental-features
"['variable-refresh-rate', 'scale-monitor-framebuffer']"
But Variable Refresh Rate is disabled in the display settings.
Symptoms: As previous - Screen flickers happening when moving cursor.

4)
$ gsettings set org.gnome.mutter experimental-features
"['variable-refresh-rate', 'scale-monitor-framebuffer']"
And Variable Refresh Rate is enabled in the display settings.
Symptoms: On Radeon 7900XTX hardware computer completely hangs without
any messages in kernel logs.

On Wed, Sep 11, 2024 at 2:11 AM Leo Li  wrote:
>
> Hi Mikhail,
>
> Can you give this patch a try to see if it helps?
> https://gist.github.com/leeonadoh/3271e90ec95d768424c572c970ada743
>

Thanks, with this patch, the issue is not reproduced anymore.
Tested-by: Mikhail Gavrilov 

The only thing that worries me is the thought that the problem with
hang is now hidden.
It's one thing when the GPU hangs but the system continues to work,
another thing when the system hangs completely and even
Alt+SysRq+REISUB does not help to reboot the system. It shouldn't be
like this...

-- 
Best Regards,
Mike Gavrilov.


Re: 6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang

2024-09-08 Thread Mikhail Gavrilov
On Sat, Sep 7, 2024 at 12:47 AM Leo Li  wrote:
>
>
> Hi Mikhail,
>
> I've tried to align my system with yours as best as I can, but so far, I've 
> had
> no luck reproducing the hang. A video of what I'm doing:
> https://youtu.be/VeD-LPCnfWM?si=b2baF8MyDBuU4jRH
> (Under the hood, the W7900 and 7900xt should be the same)

I have done additional tests:
1. The computer does not hang with 6900XT instead the screen flickers
when moving the cursor.
2. The computer does not hang with 7900XTX if I turn off VRR. But the
screen flickers when moving the cursor, as on 6900XT.
To enable VRR, please set 'variable-refresh-rate' in
experimental-features, and in the Display setting, enable Variable
Refresh Rate.
$ gsettings set org.gnome.mutter experimental-features
"['variable-refresh-rate', 'scale-monitor-framebuffer']"
https://postimg.cc/PvXYdvGR

3. The chances of the problem reoccurring are much higher when running
the game "Play Innocence Or Money Season 1 - Episodes 1 to 3". There
is a free demo version.
https://store.steampowered.com/app/1958390/Innocence_Or_Money_Season_1__Episodes_1_to_3/
Demonstration: https://youtu.be/XIe0pQYPVUo

>
> I have a few suggestions:
>
> First, can you also open an issue on the amd gitlab tracker? It gives more
> visibility to others, and makes working together a bit easier:
> https://gitlab.freedesktop.org/drm/amd/-/issues
>
> Second, can you try adding "amdgpu.dcdebugmask=0x40" to your kernel cmdline at
> boot, and see if you can still repro the hang?
Yes. This didn't help.

> This setting disables hw planes. If it resolves the hang, then it's quite
> interesting, because it suggests that gnome may be using direct-scanout via hw
> planes. We may need to align our gnome configuration in that case, since I 
> don't
> see any additional hw planes being used on my setup.
>
> Third, in case these two issues are related, can you give the attached patch 
> on
> this issue thread a try as well?
> https://gitlab.freedesktop.org/drm/amd/-/issues/3569#note_2558359
This patch also didn't help.

Maybe try to compile a kernel with the same config as mine and enable
VRR to repeat the problem?
I attached my build config to this message.

-- 
Best Regards,
Mike Gavrilov.


.config.zip
Description: Zip archive


Re: 6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang

2024-09-04 Thread Mikhail Gavrilov
On Thu, Sep 5, 2024 at 4:06 AM Leo Li  wrote:
>
> Can you delete ", new_cursor_state" on that line and try again? Seems to be a
> unused variable warning being elevated to an error.
>

Thanks, I applied both patches and can confirm that this solved the issue.
The first patch was definitely not enough.

Tested-by: Mikhail Gavrilov 

-- 
Best Regards,
Mike Gavrilov.


Re: 6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang

2024-09-04 Thread Mikhail Gavrilov
On Wed, Sep 4, 2024 at 4:15 AM Leo Li  wrote:
> Hi Mike,
>
> Super sorry for the ridiculous wait. Your first two emails slipped by my 
> inbox,
> which is really silly, given I'm first in the to field...
>
> Thanks for bisecting and finding a free game to reproduce it on. I did not 
> have
> luck reproducing this today, but I am on sway and not gnome. While I get gnome
> set up, will you be able to test which one of these reverts fixes the hang for
> you? Whether just 1/2 is enough, or both 1/2 and 2/2 is required?
>
> I applied them on top of Linus's v6.11-rc6 tag, so hopefully they'll git am
> cleanly for you:
>
> 1/2:
> https://gist.github.com/leeonadoh/69147b5fa8d815b39c5f4c3e005cca28#file-0001-revert-drm-amd-display-move-primary-plane-zpos-highe-patch
> 2/2:
> https://gist.github.com/leeonadoh/69147b5fa8d815b39c5f4c3e005cca28#file-0002-revert-drm-amd-display-introduce-overlay-cursor-mode-patch
>

The first patch is not enough.
Yes, it fixes the system hang when I launch the game "Find the Orange Narwhal".
But it does not fix the issue completely.
Some RenPy games still can lead the system to hang.
For example "Innocence Or Money Season 1"
https://store.steampowered.com/app/1958390/Innocence_Or_Money_Season_1__Episodes_1_to_3/
on the language selection screen.

Unfortunately the kernel is not builded with both patches.
I have got compilation error after applying second patch:

  CC [M]  drivers/gpu/drm/nouveau/nvkm/engine/fifo/chid.o
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c: In
function ‘amdgpu_dm_atomic_check’:
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:11003:69:
error: unused variable ‘new_cursor_state’ [-Werror=unused-variable]
11003 | struct drm_plane_state *old_plane_state,
*new_plane_state, *new_cursor_state;
  |
 ^~~~
  CC [M]  drivers/gpu/drm/amd/amdgpu/../display/dc/basics/conversion.o
***
  CC [M]  drivers/gpu/drm/nouveau/nvkm/engine/gr/tu102.o
cc1: all warnings being treated as errors
  CC [M]  drivers/gpu/drm/amd/amdgpu/../display/dc/dml/calcs/dcn_calc_auto.o
  CC [M]  drivers/gpu/drm/nouveau/nvkm/engine/gr/ga102.o
  CC [M]  drivers/gpu/drm/nouveau/nvkm/engine/gr/ad102.o
  CC [M]  drivers/gpu/drm/nouveau/nvkm/engine/gr/r535.o
  CC [M]  drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/clk_mgr.o
  CC [M]  drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxnv40.o
  CC [M]  drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dce60/dce60_clk_mgr.o
make[6]: *** [scripts/Makefile.build:244:
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.o] Error 1
make[6]: *** Waiting for unfinished jobs
  CC [M]  drivers/gpu/drm/nouveau/nvkm/engine/gr/ctxnv50.o
***
make[5]: *** [scripts/Makefile.build:485: drivers/gpu/drm/amd/amdgpu] Error 2
make[4]: *** [scripts/Makefile.build:485: drivers/gpu/drm] Error 2
make[3]: *** [scripts/Makefile.build:485: drivers/gpu] Error 2
make[2]: *** [scripts/Makefile.build:485: drivers] Error 2
make[1]: *** [/home/mikhail/packaging-work/git/linux-3/Makefile:1925: .] Error 2
make: *** [Makefile:224: __sub-make] Error 2

-- 
Best Regards,
Mike Gavrilov.


Re: 6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang

2024-09-02 Thread Mikhail Gavrilov
On Sun, Aug 25, 2024 at 2:12 AM Mikhail Gavrilov
 wrote:
>
> Hi,
> Is anyone trying to look into it?
> I continue to reproduce this issue on fresh kernel builds 6.11-rc4+.
> In addition to the RenPy engine, the problem also reproduces on games
> from Ubisoft, such as Far Cry 4.
> A very important note that I missed in the first message.
> To reproduce the problem, you need to enable scaling in Gnome for
> HiDPI monitors.
> I am using 4K resolution with 200% of fractional scaling.

Sorry for persistence, but I'm afraid there's no time left to fix this
regression.
There's a week left until the release.
A month later, no one has looked at what the problem is.

-- 
Best Regards,
Mike Gavrilov.


Re: 6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang

2024-08-24 Thread Mikhail Gavrilov
On Mon, Aug 5, 2024 at 11:05 PM Mikhail Gavrilov
 wrote:
>
> Hi,
> After commit 1b04dcca4fb1, launching some RenPy games causes computer hang.
> After the hang, even Alt + sysrq + REISUB can't reboot the computer!
> And no trace in the kernel log!
> For demonstration, I'm going to use the game "Find the Orange Narwhal"
> because it is free and has 100% reproducivity for this issue.
> You can find it in the Steam Store:
> https://store.steampowered.com/app/2946010/Find_the_Orange_Narwhal/
> I uploaded demonstration video to youtube: https://youtu.be/yVW6rImRpXw
>
> Unfortunately, I can't check the revert commit 1541d63c5fe2 because of
> conflicts.
>
> mikhail@primary-ws ~/p/g/linux (master)> git reset v6.11-rc1 --hard
> HEAD is now at 8400291e289e Linux 6.11-rc1
>
> mikhail@primary-ws ~/p/g/linux (master)> git revert -n 1b04dcca4fb1
> Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> CONFLICT (content): Merge conflict in
> drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
> Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
> Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
> CONFLICT (content): Merge conflict in
> drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
> error: could not revert 1b04dcca4fb1... drm/amd/display: Introduce
> overlay cursor mode
> hint: after resolving the conflicts, mark the corrected paths
> hint: with 'git add ' or 'git rm '
> hint: Disable this message with "git config advice.mergeConflict false"
>
> commit 1b04dcca4fb10dd3834893a60de74edd99f2bfaf
> Author: Leo Li 
> Date:   Thu Jan 18 16:29:49 2024 -0500
>
> drm/amd/display: Introduce overlay cursor mode
>
> [Why]
>
> DCN is the display hardware for amdgpu. DRM planes are backed by DCN
> hardware pipes, which carry pixel data from one end (memory), to the
> other (output encoder).
>
> Each DCN pipe has the ability to blend in a cursor early on in the
> pipeline. In other words, there are no dedicated cursor planes in DCN,
> which makes cursor behavior somewhat unintuitive for compositors.
>
> For example, if the cursor is in RGB format, but the top-most DRM plane
> is in YUV format, DCN will not be able to blend them. Because of this,
> amdgpu_dm rejects all configurations where a cursor needs to be enabled
> on top of a YUV formatted plane.
>
> From a compositor's perspective, when computing an allocation for
> hardware plane offloading, this cursor-on-yuv configuration result in an
> atomic test failure. Since the failure reason is not obvious at all,
> compositors will likely fall back to full rendering, which is not ideal.
>
> Instead, amdgpu_dm can try to accommodate the cursor-on-yuv
> configuration by opportunistically reserving a separate DCN pipe just
> for the cursor. We can refer to this as "overlay cursor mode". It is
> contrasted with "native cursor mode", where the native DCN per-pipe
> cursor is used.
>
> [How]
>
> On each crtc, compute whether the cursor plane should be enabled in
> overlay mode. If it is, mark the CRTC as requesting overlay cursor mode.
>
> Overlay cursor should be enabled whenever there exists a underlying
> plane that has YUV format, or is scaled differently than the cursor. It
> should also be enabled if there is no underlying plane, or if underlying
> planes do not cover the entire CRTC.
>
> During DC validation, attempt to enable a separate DCN pipe for the
> cursor if it's in overlay mode. If that fails, or if no overlay mode is
> requested, then fallback to native mode.
>
> v2:
> * Update commit message for when overlay cursor should be enabled
> * Also consider scale and no-underlying-plane case (cursor on crtc bg)
> * Consider all underlying planes when determinig overlay/native, not
>   just the plane immediately beneath the cursor, as it may not cover the
>   entire CRTC.
> * Fix typo s/decending/descending/
> * Force native cursor on pre-DCN hardware
>
> Reviewed-by: Harry Wentland 
> Acked-by: Zaeem Mohamed 
> Signed-off-by: Leo Li 
> Acked-by: Harry Wentland 
> Acked-by: Pekka Paalanen 
> Tested-by: Daniel Wheeler 
> Signed-off-by: Alex Deucher 
>
>  drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c   | 490
> +++---
>  drivers/gpu

6.11/regression/bisected - after commit 1b04dcca4fb1, launching some RenPy games causes computer hang

2024-08-05 Thread Mikhail Gavrilov
Hi,
After commit 1b04dcca4fb1, launching some RenPy games causes computer hang.
After the hang, even Alt + sysrq + REISUB can't reboot the computer!
And no trace in the kernel log!
For demonstration, I'm going to use the game "Find the Orange Narwhal"
because it is free and has 100% reproducivity for this issue.
You can find it in the Steam Store:
https://store.steampowered.com/app/2946010/Find_the_Orange_Narwhal/
I uploaded demonstration video to youtube: https://youtu.be/yVW6rImRpXw

Unfortunately, I can't check the revert commit 1541d63c5fe2 because of
conflicts.

mikhail@primary-ws ~/p/g/linux (master)> git reset v6.11-rc1 --hard
HEAD is now at 8400291e289e Linux 6.11-rc1

mikhail@primary-ws ~/p/g/linux (master)> git revert -n 1b04dcca4fb1
Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
CONFLICT (content): Merge conflict in
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h
Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c
Auto-merging drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
CONFLICT (content): Merge conflict in
drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c
error: could not revert 1b04dcca4fb1... drm/amd/display: Introduce
overlay cursor mode
hint: after resolving the conflicts, mark the corrected paths
hint: with 'git add ' or 'git rm '
hint: Disable this message with "git config advice.mergeConflict false"

commit 1b04dcca4fb10dd3834893a60de74edd99f2bfaf
Author: Leo Li 
Date:   Thu Jan 18 16:29:49 2024 -0500

drm/amd/display: Introduce overlay cursor mode

[Why]

DCN is the display hardware for amdgpu. DRM planes are backed by DCN
hardware pipes, which carry pixel data from one end (memory), to the
other (output encoder).

Each DCN pipe has the ability to blend in a cursor early on in the
pipeline. In other words, there are no dedicated cursor planes in DCN,
which makes cursor behavior somewhat unintuitive for compositors.

For example, if the cursor is in RGB format, but the top-most DRM plane
is in YUV format, DCN will not be able to blend them. Because of this,
amdgpu_dm rejects all configurations where a cursor needs to be enabled
on top of a YUV formatted plane.

From a compositor's perspective, when computing an allocation for
hardware plane offloading, this cursor-on-yuv configuration result in an
atomic test failure. Since the failure reason is not obvious at all,
compositors will likely fall back to full rendering, which is not ideal.

Instead, amdgpu_dm can try to accommodate the cursor-on-yuv
configuration by opportunistically reserving a separate DCN pipe just
for the cursor. We can refer to this as "overlay cursor mode". It is
contrasted with "native cursor mode", where the native DCN per-pipe
cursor is used.

[How]

On each crtc, compute whether the cursor plane should be enabled in
overlay mode. If it is, mark the CRTC as requesting overlay cursor mode.

Overlay cursor should be enabled whenever there exists a underlying
plane that has YUV format, or is scaled differently than the cursor. It
should also be enabled if there is no underlying plane, or if underlying
planes do not cover the entire CRTC.

During DC validation, attempt to enable a separate DCN pipe for the
cursor if it's in overlay mode. If that fails, or if no overlay mode is
requested, then fallback to native mode.

v2:
* Update commit message for when overlay cursor should be enabled
* Also consider scale and no-underlying-plane case (cursor on crtc bg)
* Consider all underlying planes when determinig overlay/native, not
  just the plane immediately beneath the cursor, as it may not cover the
  entire CRTC.
* Fix typo s/decending/descending/
* Force native cursor on pre-DCN hardware

Reviewed-by: Harry Wentland 
Acked-by: Zaeem Mohamed 
Signed-off-by: Leo Li 
Acked-by: Harry Wentland 
Acked-by: Pekka Paalanen 
Tested-by: Daniel Wheeler 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c   | 490
+++---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h   |   7 +++
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_crtc.c  |   1 +
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_plane.c |  13 -
 4 files changed, 389 insertions(+), 122 deletions(-)


My hardware specs are: https://linux-hardware.org/?probe=61bd7390a9

Leo, can you look into it, please?

-- 
Best Regards,
Mike Gavrilov.


Re: 6.10/bisected/regression - Since commit e356d321d024 in the kernel log appears the message "MES failed to respond to msg=MISC (WAIT_REG_MEM)" which were never seen before

2024-08-02 Thread Mikhail Gavrilov
On Wed, Jul 24, 2024 at 10:16 PM Mikhail Gavrilov
 wrote:
> > https://patchwork.freedesktop.org/patch/605201/
> For which kernel is this patch intended? The patch is not applied on
> top of d67978318827.

I am able to apply this patch on top of e4fc196f5ba3 and the issue is gone.

Tested-by: Mikhail Gavrilov 

-- 
Best Regards,
Mike Gavrilov.


Re: 6.10/bisected/regression - Since commit e356d321d024 in the kernel log appears the message "MES failed to respond to msg=MISC (WAIT_REG_MEM)" which were never seen before

2024-07-24 Thread Mikhail Gavrilov
On Tue, Jul 23, 2024 at 2:34 AM Alex Deucher  wrote:
> Do either of these patches help?

> https://patchwork.freedesktop.org/patch/605437/
Unfortunately, this patch didn't help. Please see the attached kernel log.

> https://patchwork.freedesktop.org/patch/605201/
For which kernel is this patch intended? The patch is not applied on
top of d67978318827.

mikhail@primary-ws ~/p/g/linux-3 (master)> git reset d67978318827 --hard
HEAD is now at d67978318827 Merge tag 'x86_cpu_for_v6.11_rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

mikhail@primary-ws ~/p/g/linux-3 (master)> git apply
drm-amdgpu-mes-fix-mes-ring-buffer-overflow.patch
error: drivers/gpu/drm/amd/amdgpu/mes_v12_0.c: No such file or directory

-- 
Best Regards,
Mike Gavrilov.
<>


Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz

2024-07-16 Thread Mikhail Gavrilov
On Tue, Jul 16, 2024 at 10:10 PM Alex Deucher  wrote:
>
> Does the attached partial revert fix it?
>
> Alex
>

Yes, thanks.

Tested-by: Mikhail Gavrilov 

-- 
Best Regards,
Mike Gavrilov.


Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz

2024-07-10 Thread Mikhail Gavrilov
On Wed, Jul 10, 2024 at 12:01 PM Mikhail Gavrilov
 wrote:
>
> On Tue, Jul 9, 2024 at 7:48 PM Rodrigo Siqueira Jordao
>  wrote:
> > Hi,
> >
> > I also tried it with 6900XT. I got the same results on my side.
>
> This is weird.
>
> > Anyway, I could not reproduce the issue with the below components. I may
> > be missing something that will trigger this bug; in this sense, could
> > you describe the following:
> > - The display resolution and refresh rate.
>
> 3840x2160 and 120Hz
> At 60Hz issue not reproduced.
>
> > - Are you able to reproduce this issue with DP and HDMI?
>
> My TV, an OLED LG C3, has only an HDMI 2.1 port.
>
> > - Could you provide the firmware information: sudo cat
> > /sys/kernel/debug/dri/0/amdgpu_firmware_info
>
> > sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
> [sudo] password for mikhail:
> VCE feature version: 0, firmware version: 0x
> UVD feature version: 0, firmware version: 0x
> MC feature version: 0, firmware version: 0x
> ME feature version: 38, firmware version: 0x000e
> PFP feature version: 38, firmware version: 0x000e
> CE feature version: 38, firmware version: 0x0003
> RLC feature version: 1, firmware version: 0x001f
> RLC SRLC feature version: 1, firmware version: 0x0001
> RLC SRLG feature version: 1, firmware version: 0x0001
> RLC SRLS feature version: 1, firmware version: 0x0001
> RLCP feature version: 0, firmware version: 0x
> RLCV feature version: 0, firmware version: 0x
> MEC feature version: 38, firmware version: 0x0015
> MEC2 feature version: 38, firmware version: 0x0015
> IMU feature version: 0, firmware version: 0x
> SOS feature version: 0, firmware version: 0x
> ASD feature version: 553648344, firmware version: 0x21d8
> TA XGMI feature version: 0x, firmware version: 0x
> TA RAS feature version: 0x, firmware version: 0x
> TA HDCP feature version: 0x, firmware version: 0x173f
> TA DTM feature version: 0x, firmware version: 0x1216
> TA RAP feature version: 0x, firmware version: 0x
> TA SECUREDISPLAY feature version: 0x, firmware version: 0x
> SMC feature version: 0, program: 0, firmware version: 0x00544fdf (84.79.223)
> SDMA0 feature version: 52, firmware version: 0x0009
> VCN feature version: 0, firmware version: 0x0311f002
> DMCU feature version: 0, firmware version: 0x
> DMCUB feature version: 0, firmware version: 0x05000f00
> TOC feature version: 0, firmware version: 0x0007
> MES_KIQ feature version: 0, firmware version: 0x
> MES feature version: 0, firmware version: 0x
> VPE feature version: 0, firmware version: 0x
> VBIOS version: 102-RAPHAEL-008
>

I forgot to add output for discrete GPU:

> sudo cat /sys/kernel/debug/dri/1/amdgpu_firmware_info
[sudo] password for mikhail:
VCE feature version: 0, firmware version: 0x
UVD feature version: 0, firmware version: 0x
MC feature version: 0, firmware version: 0x
ME feature version: 44, firmware version: 0x0040
PFP feature version: 44, firmware version: 0x0062
CE feature version: 44, firmware version: 0x0025
RLC feature version: 1, firmware version: 0x0060
RLC SRLC feature version: 0, firmware version: 0x
RLC SRLG feature version: 0, firmware version: 0x
RLC SRLS feature version: 0, firmware version: 0x
RLCP feature version: 0, firmware version: 0x
RLCV feature version: 0, firmware version: 0x
MEC feature version: 44, firmware version: 0x0076
MEC2 feature version: 44, firmware version: 0x0076
IMU feature version: 0, firmware version: 0x
SOS feature version: 0, firmware version: 0x00210e64
ASD feature version: 553648345, firmware version: 0x21d9
TA XGMI feature version: 0x, firmware version: 0x200f
TA RAS feature version: 0x, firmware version: 0x1b00013e
TA HDCP feature version: 0x, firmware version: 0x173f
TA DTM feature version: 0x, firmware version: 0x1216
TA RAP feature version: 0x, firmware version: 0x0716
TA SECUREDISPLAY feature version: 0x, firmware version: 0x
SMC feature version: 0, program: 0, firmware version: 0x003a5a00 (58.90.0)
SDMA0 feature version: 52, firmware version: 0x0053
SDMA1 feature version: 52, firmware version: 0x0053
SDMA2 feature version: 52, firmware version: 0x0053
SDMA3 feature version: 52, firmware version: 0x0053
VCN feature version: 0, firmware version: 0x0311f002
DMCU feature version: 0, firmware version: 0x
DMCUB feature version: 0, firmware version: 0x02020020
TOC feature version: 0, firmware version: 0x
MES_KIQ feature version: 0, firmware version: 0x
MES feature version: 0, firmware version: 0x
VPE feature version: 0, firmware version: 0x
VBIOS version: 113-D4120100-100


-- 
Best Regards,
Mike Gavrilov.


Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz

2024-07-10 Thread Mikhail Gavrilov
On Tue, Jul 9, 2024 at 7:48 PM Rodrigo Siqueira Jordao
 wrote:
> Hi,
>
> I also tried it with 6900XT. I got the same results on my side.

This is weird.

> Anyway, I could not reproduce the issue with the below components. I may
> be missing something that will trigger this bug; in this sense, could
> you describe the following:
> - The display resolution and refresh rate.

3840x2160 and 120Hz
At 60Hz issue not reproduced.

> - Are you able to reproduce this issue with DP and HDMI?

My TV, an OLED LG C3, has only an HDMI 2.1 port.

> - Could you provide the firmware information: sudo cat
> /sys/kernel/debug/dri/0/amdgpu_firmware_info

> sudo cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
[sudo] password for mikhail:
VCE feature version: 0, firmware version: 0x
UVD feature version: 0, firmware version: 0x
MC feature version: 0, firmware version: 0x
ME feature version: 38, firmware version: 0x000e
PFP feature version: 38, firmware version: 0x000e
CE feature version: 38, firmware version: 0x0003
RLC feature version: 1, firmware version: 0x001f
RLC SRLC feature version: 1, firmware version: 0x0001
RLC SRLG feature version: 1, firmware version: 0x0001
RLC SRLS feature version: 1, firmware version: 0x0001
RLCP feature version: 0, firmware version: 0x
RLCV feature version: 0, firmware version: 0x
MEC feature version: 38, firmware version: 0x0015
MEC2 feature version: 38, firmware version: 0x0015
IMU feature version: 0, firmware version: 0x
SOS feature version: 0, firmware version: 0x
ASD feature version: 553648344, firmware version: 0x21d8
TA XGMI feature version: 0x, firmware version: 0x
TA RAS feature version: 0x, firmware version: 0x
TA HDCP feature version: 0x, firmware version: 0x173f
TA DTM feature version: 0x, firmware version: 0x1216
TA RAP feature version: 0x, firmware version: 0x
TA SECUREDISPLAY feature version: 0x, firmware version: 0x
SMC feature version: 0, program: 0, firmware version: 0x00544fdf (84.79.223)
SDMA0 feature version: 52, firmware version: 0x0009
VCN feature version: 0, firmware version: 0x0311f002
DMCU feature version: 0, firmware version: 0x
DMCUB feature version: 0, firmware version: 0x05000f00
TOC feature version: 0, firmware version: 0x0007
MES_KIQ feature version: 0, firmware version: 0x
MES feature version: 0, firmware version: 0x
VPE feature version: 0, firmware version: 0x
VBIOS version: 102-RAPHAEL-008

> Also, could you conduct the below tests and report the results:
>
> - Test 1: Just revert the fallback patch (drm/amd/display: Add fallback
> configuration for set DRR in DCN10) and see if it solves the issue.

It's not enough.
I checked revert commit bc87d666c05 on top of 34afb82a3c67.

> - Test 2: Try the latest amd-staging-drm-next
> (https://gitlab.freedesktop.org/agd5f/linux) and see if the issue is gone.

I checked commit 7cef45b1347a in the amd-staging-drm-next branch. Same here.

> - Test 3: In the kernel that you see the issue, could you install the
> latest firmware and see if it fix the issue? Check:
> https://gitlab.freedesktop.org/drm/firmware P.S.: Don't forget to update
> the initramfs or something similar in your system.

Is this any sense? Fedora Rawhide always ships with the latest kernel
and firmware.

-- 
Best Regards,
Mike Gavrilov.


Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz

2024-06-29 Thread Mikhail Gavrilov
On Sat, Jun 29, 2024 at 9:46 PM Rodrigo Siqueira Jordao
 wrote:
> Hi Mikhail,
>
> I'm trying to reproduce this issue, but until now, I've been unable to
> reproduce it. I tried some different scenarios with the following
> components:
>
> 1. Displays: I tried with one and two displays
>   - 4k@120 - DP && 4k@60 - HDMI
>   - 4k@244 Oled - DP
> 2. GPU: 7900XTX

The issue only reproduced with RDNA2 (6900XT)
RDNA3 (7900XTX) is not affected.

-- 
Best Regards,
Mike Gavrilov.


Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz

2024-06-21 Thread Mikhail Gavrilov
On Fri, Jun 21, 2024 at 12:56 PM Linux regression tracking (Thorsten
Leemhuis)  wrote:
> Hmmm, I might have missed something, but it looks like nothing happened
> here since then. What's the status? Is the issue still happening?

Yes. Tested on e5b3efbe1ab1.

I spotted that the problem disappears after forcing the TV to sleep
(activate screensaver  + ) and then wake it up by pressing
any button and entering a password.
Hope this information can't help figure out how to fix it.

-- 
Best Regards,
Mike Gavrilov.


Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz

2024-06-10 Thread Mikhail Gavrilov
On Fri, Jun 7, 2024 at 5:29 PM Linux regression tracking (Thorsten
Leemhuis)  wrote:
>
> [CCing the other amd drm maintainers]
>
> Mikhail: are those details in any way relevant? Then in the future best
> leave them out (or make things easier to follow), they make the bug
> report confusing and sounds like this is just a bug, when it fact from
> your bisection is sounds like this is a regression.

Apologies if my pre-story is confused. I just wanna say I completely
moved to the 7900XTX more than a year ago and I was surprised to see
this regression on the old 6900XT. An accident helped me find this
issue because I didn't plan to use old hardware.

-- 
Best Regards,
Mike Gavrilov.


Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz

2024-06-09 Thread Mikhail Gavrilov
On Fri, Jun 7, 2024 at 6:39 PM Alex Deucher  wrote:
>
> --- a/drivers/gpu/drm/amd/display/dc/optc/dcn10/dcn10_optc.c
> +++ b/drivers/gpu/drm/amd/display/dc/optc/dcn10/dcn10_optc.c
> @@ -944,7 +944,7 @@ void optc1_set_drr(
> OTG_V_TOTAL_MAX_SEL, 1,
> OTG_FORCE_LOCK_ON_EVENT, 0,
> OTG_SET_V_TOTAL_MIN_MASK_EN, 0,
> -   OTG_SET_V_TOTAL_MIN_MASK, 0);
> +   OTG_SET_V_TOTAL_MIN_MASK, (1 << 1)); /* TRIGA 
> */
>
> // Setup manual flow control for EOF via TRIG_A
> optc->funcs->setup_manual_trigger(optc);

Thanks, Alex.
I applied this patch on top of 771ed66105de and unfortunately the
issue is not fixed.
I saw a green flashing bar on top of the screen again.

-- 
Best Regards,
Mike Gavrilov.


Re: 6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz

2024-06-05 Thread Mikhail Gavrilov
On Sun, May 26, 2024 at 7:06 PM Mikhail Gavrilov
 wrote:
>
> Hi,
> Day before yesterday I replaced 7900XTX to 6900XT for got clear in
> which kernel first time appeared warning message "DMA-API: amdgpu
> :0f:00.0: cacheline tracking EEXIST, overlapping mappings aren't
> supported".
> The kernel 6.3 and older won't boot on a computer with Radeon 7900XTX.
> When I booted the system with 6900XT I saw a green flashing bar on top
> of the screen when I typed commands in the gnome terminal which was
> maximized on full screen.
> Demonstration: https://youtu.be/tTvwQ_5pRkk
> For reproduction you need Radeon 6900XT GPU connected to 120Hz OLED TV by 
> HDMI.
>
> I bisected the issue and the first commit which I found was 6d4279cb99ac.
> commit 6d4279cb99ac4f51d10409501d29969f687ac8dc (HEAD)
> Author: Rodrigo Siqueira 
> Date:   Tue Mar 26 10:42:05 2024 -0600
>
> drm/amd/display: Drop legacy code
>
> This commit removes code that are not used by display anymore.
>
> Acked-by: Hamza Mahfooz 
> Signed-off-by: Rodrigo Siqueira 
> Signed-off-by: Alex Deucher 
>
>  drivers/gpu/drm/amd/display/dc/inc/hw/stream_encoder.h |  4 
>  drivers/gpu/drm/amd/display/dc/inc/resource.h  |  7 ---
>  drivers/gpu/drm/amd/display/dc/optc/dcn20/dcn20_optc.c | 10 
> --
>  drivers/gpu/drm/amd/display/dc/resource/dcn21/dcn21_resource.c | 33
> +
>  4 files changed, 1 insertion(+), 53 deletions(-)
>
> Every time after bisecting I usually make sure that I found the right
> commit and build the kernel with revert of the bad commit.
> But this time I again observed an issue after running a kernel builded
> without commit 6d4279cb99ac.
> And I decided to find a second bad commit.
> The second bad commit has been bc87d666c05.
> commit bc87d666c05a13e6d4ae1ddce41fc43d2567b9a2 (HEAD)
> Author: Rodrigo Siqueira 
> Date:   Tue Mar 26 11:55:19 2024 -0600
>
> drm/amd/display: Add fallback configuration for set DRR in DCN10
>
> Set OTG/OPTC parameters to 0 if something goes wrong on DCN10.
>
> Acked-by: Hamza Mahfooz 
> Signed-off-by: Rodrigo Siqueira 
> Signed-off-by: Alex Deucher 
>
>  drivers/gpu/drm/amd/display/dc/optc/dcn10/dcn10_optc.c | 15 ---
>  1 file changed, 12 insertions(+), 3 deletions(-)
>
> After reverting both these commits on top of 54f71b0369c9 the issue is gone.
>
> I also attach the build config.
>
> My hardware specs: https://linux-hardware.org/?probe=f25a873c5e
>
> Rodrigo or anyone else from the AMD team can you look please.
>

Did anyone watch?

-- 
Best Regards,
Mike Gavrilov.


6.10/bisected/regression - commits bc87d666c05 and 6d4279cb99ac cause appearing green flashing bar on top of screen on Radeon 6900XT and 120Hz

2024-05-26 Thread Mikhail Gavrilov
Hi,
Day before yesterday I replaced 7900XTX to 6900XT for got clear in
which kernel first time appeared warning message "DMA-API: amdgpu
:0f:00.0: cacheline tracking EEXIST, overlapping mappings aren't
supported".
The kernel 6.3 and older won't boot on a computer with Radeon 7900XTX.
When I booted the system with 6900XT I saw a green flashing bar on top
of the screen when I typed commands in the gnome terminal which was
maximized on full screen.
Demonstration: https://youtu.be/tTvwQ_5pRkk
For reproduction you need Radeon 6900XT GPU connected to 120Hz OLED TV by HDMI.

I bisected the issue and the first commit which I found was 6d4279cb99ac.
commit 6d4279cb99ac4f51d10409501d29969f687ac8dc (HEAD)
Author: Rodrigo Siqueira 
Date:   Tue Mar 26 10:42:05 2024 -0600

drm/amd/display: Drop legacy code

This commit removes code that are not used by display anymore.

Acked-by: Hamza Mahfooz 
Signed-off-by: Rodrigo Siqueira 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/display/dc/inc/hw/stream_encoder.h |  4 
 drivers/gpu/drm/amd/display/dc/inc/resource.h  |  7 ---
 drivers/gpu/drm/amd/display/dc/optc/dcn20/dcn20_optc.c | 10 --
 drivers/gpu/drm/amd/display/dc/resource/dcn21/dcn21_resource.c | 33
+
 4 files changed, 1 insertion(+), 53 deletions(-)

Every time after bisecting I usually make sure that I found the right
commit and build the kernel with revert of the bad commit.
But this time I again observed an issue after running a kernel builded
without commit 6d4279cb99ac.
And I decided to find a second bad commit.
The second bad commit has been bc87d666c05.
commit bc87d666c05a13e6d4ae1ddce41fc43d2567b9a2 (HEAD)
Author: Rodrigo Siqueira 
Date:   Tue Mar 26 11:55:19 2024 -0600

drm/amd/display: Add fallback configuration for set DRR in DCN10

Set OTG/OPTC parameters to 0 if something goes wrong on DCN10.

Acked-by: Hamza Mahfooz 
Signed-off-by: Rodrigo Siqueira 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/display/dc/optc/dcn10/dcn10_optc.c | 15 ---
 1 file changed, 12 insertions(+), 3 deletions(-)

After reverting both these commits on top of 54f71b0369c9 the issue is gone.

I also attach the build config.

My hardware specs: https://linux-hardware.org/?probe=f25a873c5e

Rodrigo or anyone else from the AMD team can you look please.

-- 
Best Regards,
Mike Gavrilov.


.config.zip
Description: Zip archive


Re: regression/bisected/6.8 commit f7fe64ad0f22ff034f8ebcfbd7299ee9cc9b57d7 leads to GPU hang when I open GNOME activities

2024-01-24 Thread Mikhail Gavrilov
On Wed, Jan 24, 2024 at 7:19 AM Mikhail Gavrilov
 wrote:
>
> Who could dig into it, please?

You decided to revert it?
https://lkml.org/lkml/2024/1/22/1866

Also I forgot to attach the kernel build .config in the previous
message. I'm going to fix it here.
It may be useful for reproducing my bug script.

-- 
Best Regards,
Mike Gavrilov.


.config.zip
Description: Zip archive


Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-12-19 Thread Mikhail Gavrilov
On Fri, Dec 15, 2023 at 5:37 PM Christian König
 wrote:
>
> I have no idea :)
>
>  From the logs I can see that the AMDGPU now has the proper BARs assigned:
>
> [5.722015] pci :03:00.0: [1002:73df] type 00 class 0x038000
> [5.722051] pci :03:00.0: reg 0x10: [mem
> 0xf8-0xfb 64bit pref]
> [5.722081] pci :03:00.0: reg 0x18: [mem
> 0xfc-0xfc0fff 64bit pref]
> [5.722112] pci :03:00.0: reg 0x24: [mem 0xfca0-0xfcaf]
> [5.722134] pci :03:00.0: reg 0x30: [mem 0xfcb0-0xfcb1 pref]
> [5.722368] pci :03:00.0: PME# supported from D1 D2 D3hot D3cold
> [5.722484] pci :03:00.0: 63.008 Gb/s available PCIe bandwidth,
> limited by 8.0 GT/s PCIe x8 link at :00:01.1 (capable of 252.048
> Gb/s with 16.0 GT/s PCIe x16 link)
>
> And with that the driver can work perfectly fine.
>
> Have you updated the BIOS or added/removed some other hardware? Maybe
> somebody added a quirk for your BIOS into the PCIe code or something
> like that.

No, nothing changed in hardware.
But I found the commit which fixes it.

> git bisect unfixed
92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 is the first fixed commit
commit 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6
Author: Vasant Hegde 
Date:   Thu Sep 21 09:21:45 2023 +

iommu/amd: Introduce iommu_dev_data.flags to track device capabilities

Currently we use struct iommu_dev_data.iommu_v2 to keep track of the device
ATS, PRI, and PASID capabilities. But these capabilities can be enabled
independently (except PRI requires ATS support). Hence, replace
the iommu_v2 variable with a flags variable, which keep track of the device
capabilities.

From commit 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability
with all VFs"), device PRI/PASID is shared between PF and any associated
VFs. Hence use pci_pri_supported() and pci_pasid_features() instead of
pci_find_ext_capability() to check device PRI/PASID support.

Signed-off-by: Vasant Hegde 
Reviewed-by: Jason Gunthorpe 
Reviewed-by: Jerry Snitselaar 
Link: https://lore.kernel.org/r/20230921092147.5930-13-vasant.he...@amd.com
Signed-off-by: Joerg Roedel 

 drivers/iommu/amd/amd_iommu_types.h |  3 ++-
 drivers/iommu/amd/iommu.c   | 46 ++---
 2 files changed, 30 insertions(+), 19 deletions(-)


> git bisect log
git bisect start '--term-new=fixed' '--term-old=unfixed'
# status: waiting for both good and bad commits
# fixed: [33cc938e65a98f1d29d0a18403dbbee050dcad9a] Linux 6.7-rc4
git bisect fixed 33cc938e65a98f1d29d0a18403dbbee050dcad9a
# status: waiting for good commit(s), bad commit known
# unfixed: [ffc253263a1375a65fa6c9f62a893e9767fbebfa] Linux 6.6
git bisect unfixed ffc253263a1375a65fa6c9f62a893e9767fbebfa
# unfixed: [7d461b291e65938f15f56fe58da2303b07578a76] Merge tag
'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm
git bisect unfixed 7d461b291e65938f15f56fe58da2303b07578a76
# unfixed: [e14aec23025eeb1f2159ba34dbc1458467c4c347] s390/ap: fix AP
bus crash on early config change callback invocation
git bisect unfixed e14aec23025eeb1f2159ba34dbc1458467c4c347
# unfixed: [be3ca57cfb777ad820c6659d52e60bbdd36bf5ff] Merge tag
'media/v6.7-1' of
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect unfixed be3ca57cfb777ad820c6659d52e60bbdd36bf5ff
# fixed: [c0d12d769299e1e08338988c7745009e0db2a4a0] Merge tag
'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm
git bisect fixed c0d12d769299e1e08338988c7745009e0db2a4a0
# fixed: [4bbdb725a36b0d235f3b832bd0c1e885f0442d9f] Merge tag
'iommu-updates-v6.7' of
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
git bisect fixed 4bbdb725a36b0d235f3b832bd0c1e885f0442d9f
# unfixed: [25b6377007ebe1c3ede773fd6979f613386db000] Merge tag
'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drm
git bisect unfixed 25b6377007ebe1c3ede773fd6979f613386db000
# unfixed: [67c0afb6424fee94238d9a32b97c407d0c97155e] Merge tag
'exfat-for-6.7-rc1-part2' of
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
git bisect unfixed 67c0afb6424fee94238d9a32b97c407d0c97155e
# unfixed: [3613047280ec42a4e1350fdc1a6dd161ff4008cc] Merge tag
'v6.6-rc7' into core
git bisect unfixed 3613047280ec42a4e1350fdc1a6dd161ff4008cc
# fixed: [cedc811c76778bdef91d405717acee0de54d8db5] iommu/amd: Remove
DMA_FQ type from domain allocation path
git bisect fixed cedc811c76778bdef91d405717acee0de54d8db5
# unfixed: [b0cc5dae1ac0c18748706a4beb636e3b726dd744] iommu/amd:
Rename ats related variables
git bisect unfixed b0cc5dae1ac0c18748706a4beb636e3b726dd744
# fixed: [5a0b11a180a9b82b4437a4be1cf73530053f139b] iommu/amd: Remove
iommu_v2 module
git bisect fixed 5a0b11a180a9b82b4437a4be1cf73530053f139b
# fixed: [92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6] iommu/amd:
Introduce iommu_dev_data.flags to track device capabilities
git bisect fixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6
# unfixed: [739eb25514c90aa

Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-12-15 Thread Mikhail Gavrilov
On Tue, Feb 28, 2023 at 5:43 PM Christian König
 wrote:
>
> The point is it doesn't need to talk to the amdgpu hardware. What it
> does is that it talks to the good old VGA/VESA emulation and that just
> happens to be still enabled by the BIOS/GRUB.
>
> And that VGA/VESA emulation doesn't need any BAR or whatever to keep the
> hw running in the state where it was initialized before the kernel
> started. The kernel just grabs the addresses where it needs to write the
> display data and keeps going with that.
>
> But when a hw specific driver wants to load this is the first thing
> which gets disabled because we need to load new firmware. And with the
> BARs disabled this can't be re-enabled without rebooting the system.
>
> > My suggestion is that if
> > amdgpu fails to talk to the hardware, then let another suitable driver
> > do it. I attached a system log when I apply "pci=nocrs" with
> > "modprobe.blacklist=amdgpu" for showing that graphics work right in
> > this case.
> > To do this, does the Linux module loading mechanism need to be refined?
>
> That's actually working as expected. The real problem is that the BIOS
> on that system is so broken that we can't access the hw correctly.
>
> What we could to do is to check the BARs very early on and refuse to
> load when they are disable. The problem with this approach is that there
> are systems where it is normal that the BARs are disable until the
> driver loads and get enabled during the hardware initialization process.
>
> What you might want to look into is to find a quirk for the BIOS to
> properly enable the nvme controller.
>

That's interesting. I noticed that now amdgpu could work even with
parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels.
It means BARs became available?
I attached here the kerner log and lspci. What's changed?

-- 
Best Regards,
Mike Gavrilov.
<>
<>


Re: 6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6

2023-11-17 Thread Mikhail Gavrilov
On Thu, Nov 16, 2023 at 11:56 PM Alex Deucher  wrote:
>
> This patch should address the issue:
> https://patchwork.freedesktop.org/patch/567101/
> If you still see issues, you may also need this series:
> https://patchwork.freedesktop.org/series/126220/
>
> Alex

Thanks.
The first one patch is enough.
Tested-on: 7900XTX, 6900XT and 6800M.
Tested-by: Mikhail Gavrilov 

-- 
Best Regards,
Mike Gavrilov.


Re: 6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6

2023-11-07 Thread Mikhail Gavrilov
On Wed, Nov 8, 2023 at 12:12 AM Alex Deucher  wrote:
>
> The attached patch should fix it.  Not sure why your GPU shows up as
> busy.  The AGP aperture was just disabled.

Tested-by: Mikhail Gavrilov 
Thanks, after applying the patch GPU loading meets expectations.
Games are working so overall all looking good for now.

-- 
Best Regards,
Mike Gavrilov.


Re: 6.7/regression/KASAN: null-ptr-deref in amdgpu_ras_reset_error_count+0x2d6

2023-11-07 Thread Mikhail Gavrilov
On Mon, Nov 6, 2023 at 8:29 PM Alex Deucher  wrote:
>
> Already fixed in this commit:
> https://gitlab.freedesktop.org/agd5f/linux/-/commit/d1d4c0b7b65b7fab2bc6f97af9e823b1c42ccdb0
> Which is in included in last weeks PR.
>

Thanks, it fixed the issue above.
But, unfortunately this is not the only problem which I see on my laptop.
Now I am observing 100% GPU loading all the time.
And it looks as I show on this screenshot: https://postimg.cc/QHLQncMg

And another bisect round says that this commit is blame:
❯ git bisect good
de59b69932e64d77445d973a101d81d6e7e670c6 is the first bad commit
commit de59b69932e64d77445d973a101d81d6e7e670c6
Author: Alex Deucher 
Date:   Wed Sep 20 13:27:58 2023 -0400

drm/amdgpu/gmc: set a default disable value for AGP

To disable AGP, the start needs to be set to a higher
value than the end.  Set a default disable value for
the AGP aperture and allow the IP specific GMC code
to enable it selectively be calling amdgpu_gmc_agp_location().

Reviewed-by: Christian König 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c   | 27 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h   |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c|  3 +++
 drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|  3 ++-
 drivers/gpu/drm/amd/amdgpu/gmc_v11_0.c|  3 ++-
 drivers/gpu/drm/amd/amdgpu/gmc_v6_0.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/gmc_v7_0.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/gmc_v8_0.c |  4 ++--
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |  3 ++-
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c |  2 +-
 10 files changed, 37 insertions(+), 18 deletions(-)

I checked twice and ensure that it not happens on commit
29495d81457a483c2859ccde59cc063034bfe47d

-- 
Best Regards,
Mike Gavrilov.


Re: [PATCH] drm/ttm: check null pointer before accessing when swapping

2023-07-27 Thread Mikhail Gavrilov
On Thu, Jul 27, 2023 at 12:33 PM Chen, Guchun  wrote:
> > Reviewed-by: Christian König 
> >
> > Has this already been pushed to drm-misc-next?
> >
> > Thanks,
> > Christian.
>
> Not yet, Christian, as I don't have push permission. I saw you were on 
> vacation, so I would expect to ping you to push after you are back with full 
> recharge.

I expect to see it in drm-fixes-6.5 cause the problem appeared during
the 6.5 release cycle.
And yes, I follow all pull requests. This patch was not included in
yesterday's pull request :(

-- 
Best Regards,
Mike Gavrilov.


Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

2023-04-25 Thread Mikhail Gavrilov
On Thu, Apr 20, 2023 at 3:32 PM Mikhail Gavrilov
 wrote:
>
> Important don't give up.
> https://youtu.be/25zhHBGIHJ8 [40 min]
> https://youtu.be/utnDR26eYBY [50 min]
> https://youtu.be/DJQ_tiimW6g [12 min]
> https://youtu.be/Y6AH1oJKivA [6 min]
> Yes the issue is everything reproducible, but time to time it not
> happens at first attempt.
> I also uploaded other videos which proves that the issue definitely
> exists if someone will launch those games in turn.
> Reproducibility is only a matter of time.
>
> Anyway I didn't want you to spend so much time trying to reproduce it.
> This monkey business fits me more than you.
> It would be better if I could collect more useful info.

Christian,
Did you manage to reproduce the problem?

At the weekend I faced with slab-use-after-free in amdgpu_vm_handle_moved.
I didn't play in the games at this time.
The Xwayland process was affected so it leads to desktop hang.

==
BUG: KASAN: slab-use-after-free in amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
Read of size 8 at addr 888295c66190 by task Xwayland:cs0/173185

CPU: 21 PID: 173185 Comm: Xwayland:cs0 Tainted: GWL
---  ---  6.3.0-0.rc7.20230420gitcb0856346a60.59.fc39.x86_64+debug
#1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 4601 02/02/2023
Call Trace:
 
 dump_stack_lvl+0x76/0xd0
 print_report+0xcf/0x670
 ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
 ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
 kasan_report+0xa8/0xe0
 ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
 amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
 amdgpu_cs_ioctl+0x2b7e/0x5630 [amdgpu]
 ? __pfx___lock_acquire+0x10/0x10
 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
 ? mark_lock+0x101/0x16e0
 ? __lock_acquire+0xe54/0x59f0
 ? __pfx_lock_release+0x10/0x10
 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
 drm_ioctl_kernel+0x1fc/0x3d0
 ? __pfx_drm_ioctl_kernel+0x10/0x10
 drm_ioctl+0x4c5/0xaa0
 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
 ? __pfx_drm_ioctl+0x10/0x10
 ? _raw_spin_unlock_irqrestore+0x66/0x80
 ? lockdep_hardirqs_on+0x81/0x110
 ? _raw_spin_unlock_irqrestore+0x4f/0x80
 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
 __x64_sys_ioctl+0x131/0x1a0
 do_syscall_64+0x60/0x90
 ? do_syscall_64+0x6c/0x90
 ? lockdep_hardirqs_on+0x81/0x110
 ? do_syscall_64+0x6c/0x90
 ? lockdep_hardirqs_on+0x81/0x110
 ? do_syscall_64+0x6c/0x90
 ? lockdep_hardirqs_on+0x81/0x110
 ? do_syscall_64+0x6c/0x90
 ? lockdep_hardirqs_on+0x81/0x110
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7ffb71b0892d
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:7ffb677fe840 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 7ffb677fe9f8 RCX: 7ffb71b0892d
RDX: 7ffb677fe900 RSI: c0186444 RDI: 000d
RBP: 7ffb677fe890 R08: 7ffb677fea50 R09: 7ffb677fe8e0
R10: 556c4611bec0 R11: 0246 R12: 7ffb677fe900
R13: c0186444 R14: 000d R15: 7ffb677fe9f8
 

Allocated by task 173181:
 kasan_save_stack+0x33/0x60
 kasan_set_track+0x25/0x30
 __kasan_kmalloc+0x8f/0xa0
 __kmalloc_node+0x65/0x160
 amdgpu_bo_create+0x31e/0xfb0 [amdgpu]
 amdgpu_bo_create_user+0xca/0x160 [amdgpu]
 amdgpu_gem_create_ioctl+0x398/0x980 [amdgpu]
 drm_ioctl_kernel+0x1fc/0x3d0
 drm_ioctl+0x4c5/0xaa0
 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
 __x64_sys_ioctl+0x131/0x1a0
 do_syscall_64+0x60/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Freed by task 173185:
 kasan_save_stack+0x33/0x60
 kasan_set_track+0x25/0x30
 kasan_save_free_info+0x2e/0x50
 __kasan_slab_free+0x10b/0x1a0
 slab_free_freelist_hook+0x11e/0x1d0
 __kmem_cache_free+0xc0/0x2e0
 ttm_bo_release+0x667/0x9e0 [ttm]
 amdgpu_bo_unref+0x35/0x70 [amdgpu]
 amdgpu_gem_object_free+0x73/0xb0 [amdgpu]
 drm_gem_handle_delete+0xe3/0x150
 drm_ioctl_kernel+0x1fc/0x3d0
 drm_ioctl+0x4c5/0xaa0
 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
 __x64_sys_ioctl+0x131/0x1a0
 do_syscall_64+0x60/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Last potentially related work creation:
 kasan_save_stack+0x33/0x60
 __kasan_record_aux_stack+0x97/0xb0
 __call_rcu_common.constprop.0+0xf8/0x1af0
 drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
 dma_resv_reserve_fences+0x4dc/0x7f0
 ttm_eu_reserve_buffers+0x3f6/0x1190 [ttm]
 amdgpu_cs_ioctl+0x204d/0x5630 [amdgpu]
 drm_ioctl_kernel+0x1fc/0x3d0
 drm_ioctl+0x4c5/0xaa0
 amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
 __x64_sys_ioctl+0x131/0x1a0
 do_syscall_64+0x60/0x90
 entry_SYSCALL_64_after_hwframe+0x72/0xdc

Second to last potentially related work creation:
 kasan_save_stack+0x33/0x60
 __kasan_record_aux_stack+0x97/0xb0
 __call_rcu_common.constprop.0+0xf8/0x1af0
 drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
 amdgpu_ctx_add

Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

2023-04-20 Thread Mikhail Gavrilov
On Thu, Apr 20, 2023 at 2:59 PM Christian König
 wrote:
> Could you try drm-misc-next as well?

If as I assume I cloned right repo
$ git clone -b drm-misc-next
git://anongit.freedesktop.org/drm/drm-misc linux-drm-misc-next
for my hardware last commit on this branch is turned out completely unworking.
Instead of the GDM login screen I see a black screen and hear howls of GPU fans.

In the kernel logs I see general protection fault:
general protection fault, probably for non-canonical address
0xdc2b:  [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0158-0x015f]
CPU: 0 PID: 749 Comm: sdma0 Tainted: GWL
6.3.0-rc4-misc-next-91c249b2b9f6a80c744387b6713adf275ffd296b+ #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 4601 02/02/2023
RIP: 0010:drm_sched_get_cleanup_job+0x41b/0x5c0 [gpu_sched]
Code: fa 48 c1 ea 03 80 3c 02 00 75 5c 49 8b 9f 80 00 00 00 48 b8 00
00 00 00 00 fc ff df 48 8d bb 58 01 00 00 48 89 fa 48 c1 ea 03 <80> 3c
02 00 75 55 48 01 ab 58 01 00 00 e9 0c fd ff ff 48 89 ef e8
RSP: 0018:c9000548fdb8 EFLAGS: 00010216
RAX: dc00 RBX:  RCX: 
RDX: 002b RSI: 0004 RDI: 0158
RBP: 085c R08:  R09: 888170711783
R10: ed102e0e22f0 R11: 8da81678 R12: 8881707116b0
R13: 888170711780 R14: 888266f89820 R15: 888266f89808
FS:  () GS:888fa200() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 560cea4a8000 CR3: 000191602000 CR4: 00350ef0
Call Trace:
 
 drm_sched_main+0xc3/0x930 [gpu_sched]
 ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
 ? __pfx_autoremove_wake_function+0x10/0x10
 ? __kthread_parkme+0xc1/0x1f0
 ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
 kthread+0x2a2/0x340
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2c/0x50
 
Modules linked in: amdgpu(+) drm_ttm_helper ttm video crct10dif_pclmul
drm_suballoc_helper crc32_pclmul iommu_v2 crc32c_intel drm_buddy
polyval_clmulni gpu_sched polyval_generic ucsi_ccg drm_display_helper
typec_ucsi nvme ghash_clmulni_intel igb typec ccp sha512_ssse3 cec
nvme_core sp5100_tco dca i2c_algo_bit nvme_common wmi ip6_tables
ip_tables fuse
---[ end trace  ]---
RIP: 0010:drm_sched_get_cleanup_job+0x41b/0x5c0 [gpu_sched]
Code: fa 48 c1 ea 03 80 3c 02 00 75 5c 49 8b 9f 80 00 00 00 48 b8 00
00 00 00 00 fc ff df 48 8d bb 58 01 00 00 48 89 fa 48 c1 ea 03 <80> 3c
02 00 75 55 48 01 ab 58 01 00 00 e9 0c fd ff ff 48 89 ef e8
RSP: 0018:c9000548fdb8 EFLAGS: 00010216
RAX: dc00 RBX:  RCX: 
RDX: 002b RSI: 0004 RDI: 0158
RBP: 085c R08:  R09: 888170711783
R10: ed102e0e22f0 R11: 8da81678 R12: 8881707116b0
R13: 888170711780 R14: 888266f89820 R15: 888266f89808
FS:  () GS:888fa200() knlGS:


I also attached a full system log.

-- 
Best Regards,
Mike Gavrilov.


system-log.tar.xz
Description: application/xz


Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

2023-04-20 Thread Mikhail Gavrilov
On Thu, Apr 20, 2023 at 2:59 PM Christian König
 wrote:
>
> Could you try drm-misc-next as well?
>
> Going to give drm-fixes another round of testing.
>
> Thanks,
> Christian.

Important don't give up.
https://youtu.be/25zhHBGIHJ8 [40 min]
https://youtu.be/utnDR26eYBY [50 min]
https://youtu.be/DJQ_tiimW6g [12 min]
https://youtu.be/Y6AH1oJKivA [6 min]
Yes the issue is everything reproducible, but time to time it not
happens at first attempt.
I also uploaded other videos which proves that the issue definitely
exists if someone will launch those games in turn.
Reproducibility is only a matter of time.

Anyway I didn't want you to spend so much time trying to reproduce it.
This monkey business fits me more than you.
It would be better if I could collect more useful info.

-- 
Best Regards,
Mike Gavrilov.


Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

2023-04-19 Thread Mikhail Gavrilov
On Wed, Apr 19, 2023 at 1:12 PM Christian König
 wrote:
>
> I'm already looking into this, but can't figure out why we run into
> problems here.
>
> What happens is that a CS is aborted without sending the job to the
> scheduler and in this case the cleanup function doesn't seem to work.
>
> Christian.

I can easily reproduce it on any AMD GPU hardware.
You can add more logs to debug and I return with new logs which explains this.
Thanks.

-- 
Best Regards,
Mike Gavrilov.


Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

2023-04-19 Thread Mikhail Gavrilov
Christian?

❯ /usr/src/kernels/6.3.0-0.rc7.56.fc39.x86_64/scripts/faddr2line
/lib/debug/lib/modules/6.3.0-0.rc7.56.fc39.x86_64/kernel/drivers/gpu/drm/scheduler/gpu-sched.ko.debug
drm_sched_job_cleanup+0x9a
drm_sched_job_cleanup+0x9a/0x130:
drm_sched_job_cleanup at
/usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c:808
(discriminator 3)

❯ cat -s -n 
/usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c
| head -818 | tail -20
   799 /* drm_sched_job_arm() has been called */
   800 dma_fence_put(&job->s_fence->finished);
   801 } else {
   802 /* aborted job before committing to run it */
   803 drm_sched_fence_free(job->s_fence);
   804 }
   805
   806 job->s_fence = NULL;
   807
   808 xa_for_each(&job->dependencies, index, fence) {
   809 dma_fence_put(fence);
   810 }
   811 xa_destroy(&job->dependencies);
   812
   813 }
   814 EXPORT_SYMBOL(drm_sched_job_cleanup);
   815
   816 /**
   817 * drm_sched_ready - is the scheduler ready
   818 *

> git blame drivers/gpu/drm/scheduler/sched_main.c -L 800,819
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-17 10:49:16 +0200 800)
dma_fence_put(&job->s_fence->finished);
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-17 10:49:16 +0200 801) } else {
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-17 10:49:16 +0200 802) /* aborted job
before committing to run it */
d4c16733e7960 drivers/gpu/drm/scheduler/sched_main.c(Boris
Brezillon 2021-09-03 14:05:54 +0200 803)
drm_sched_fence_free(job->s_fence);
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-17 10:49:16 +0200 804) }
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-17 10:49:16 +0200 805)
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat
Masetty  2018-10-29 15:02:28 +0530 806) job->s_fence = NULL;
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-05 12:46:49 +0200 807)
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-05 12:46:49 +0200 808)
xa_for_each(&job->dependencies, index, fence) {
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-05 12:46:49 +0200 809)
dma_fence_put(fence);
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-05 12:46:49 +0200 810) }
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-05 12:46:49 +0200 811)
xa_destroy(&job->dependencies);
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c(Daniel
Vetter   2021-08-05 12:46:49 +0200 812)
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat
Masetty  2018-10-29 15:02:28 +0530 813) }
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat
Masetty  2018-10-29 15:02:28 +0530 814)
EXPORT_SYMBOL(drm_sched_job_cleanup);
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c(Sharat
Masetty  2018-10-29 15:02:28 +0530 815)
e688b728228b9 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c (Christian
König 2015-08-20 17:01:01 +0200 816) /**
2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
Deshmukh  2018-05-29 11:23:07 +0530 817)  * drm_sched_ready - is the
scheduler ready
2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
Deshmukh  2018-05-29 11:23:07 +0530 818)  *
2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
Deshmukh  2018-05-29 11:23:07 +0530 819)  * @sched: scheduler instance

Daniel, because Christian, looks a little busy. Can you help? The git
blame says that you are the author of code which KASAN mentions in its
report.
The issue is reproducible on all available AMD hardware: 6800M, 6900XT, 7900XTX.

-- 
Best Regards,
Mike Gavrilov.


Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

2023-04-14 Thread Mikhail Gavrilov
On Tue, Apr 11, 2023 at 10:40 PM Mikhail Gavrilov
 wrote:
>
> Hi,
> KASAN continues to find problems in the drm_sched_job_cleanup code at 6.3rc6.
> I not got any feedback in the thread
> https://lore.kernel.org/lkml/cabxgcsmvub2ra4d+k5cna0_2521tox++d4nmoukki4x2-q_...@mail.gmail.com/
> Therefore, I decided to start a separate thread. Since the problems
> are different, the symptoms are also different.
>
> Reproduction scenario.
> After launching one of the listed games:
> - Cyberpunk 2077
> - Forza Horizon 4
> - Forza Horizon 5
> - Sackboy: A Big Adventure
>
> Firstly after some time (may be after several attempts) appears bug
> message from KASAN:
> ==
> BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
> Read of size 4 at addr 0078 by task ForzaHorizon4.e/31587
>
> CPU: 15 PID: 31587 Comm: ForzaHorizon4.e Tainted: GWL
> ---  ---  6.3.0-0.rc6.49.fc39.x86_64+debug #1
> Hardware name: System manufacturer System Product Name/ROG STRIX
> X570-I GAMING, BIOS 4601 02/02/2023
> Call Trace:
>  
>  dump_stack_lvl+0x72/0xc0
>  kasan_report+0xa4/0xe0
>  ? drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
>  kasan_check_range+0x104/0x1b0
>  drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
>  ? __pfx_drm_sched_job_cleanup+0x10/0x10 [gpu_sched]
>  ? slab_free_freelist_hook+0x11e/0x1d0
>  ? amdgpu_cs_parser_fini+0x363/0x5a0 [amdgpu]
>  amdgpu_job_free+0x40/0x1b0 [amdgpu]
>  amdgpu_cs_parser_fini+0x3c9/0x5a0 [amdgpu]
>  ? __pfx_amdgpu_cs_parser_fini+0x10/0x10 [amdgpu]
>  amdgpu_cs_ioctl+0x3d9/0x5630 [amdgpu]
>  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>  ? __kmem_cache_free+0xbc/0x2e0
>  ? mark_lock+0x101/0x16e0
>  ? __lock_acquire+0xe54/0x59f0
>  ? kasan_save_stack+0x3f/0x50
>  ? __pfx_lock_release+0x10/0x10
>  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>  drm_ioctl_kernel+0x1f8/0x3d0
>  ? __pfx_drm_ioctl_kernel+0x10/0x10
>  drm_ioctl+0x4c1/0xaa0
>  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>  ? __pfx_drm_ioctl+0x10/0x10
>  ? _raw_spin_unlock_irqrestore+0x62/0x80
>  ? lockdep_hardirqs_on+0x7d/0x100
>  ? _raw_spin_unlock_irqrestore+0x4b/0x80
>  amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu]
>  __x64_sys_ioctl+0x12d/0x1a0
>  do_syscall_64+0x5c/0x90
>  ? do_syscall_64+0x68/0x90
>  ? lockdep_hardirqs_on+0x7d/0x100
>  ? do_syscall_64+0x68/0x90
>  ? do_syscall_64+0x68/0x90
>  ? lockdep_hardirqs_on+0x7d/0x100
>  ? do_syscall_64+0x68/0x90
>  ? asm_exc_page_fault+0x22/0x30
>  ? lockdep_hardirqs_on+0x7d/0x100
>  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7fb8a270881d
> Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
> 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
> 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
> RSP: 002b:467ad060 EFLAGS: 0246 ORIG_RAX: 0010
> RAX: ffda RBX: 467ad358 RCX: 7fb8a270881d
> RDX: 467ad140 RSI: c0186444 RDI: 005a
> RBP: 467ad0b0 R08: 7fb7f00d3eb0 R09: 467ad100
> R10: 7fb88c68fb20 R11: 0246 R12: 467ad140
> R13: c0186444 R14: 005a R15: 7fb7f00d3e50
>  
> ==
>
> Finally it ends up with the games listed above stopping working they
> stuck after a kernel warning:
> general protection fault, probably for non-canonical address
> 0xdc0f:  [#1] PREEMPT SMP KASAN NOPTI
> KASAN: null-ptr-deref in range [0x0078-0x007f]
> CPU: 15 PID: 31587 Comm: ForzaHorizon4.e Tainted: GB   WL
> ---  ---  6.3.0-0.rc6.49.fc39.x86_64+debug #1
> Hardware name: System manufacturer System Product Name/ROG STRIX
> X570-I GAMING, BIOS 4601 02/02/2023
> RIP: 0010:drm_sched_job_cleanup+0xa7/0x290 [gpu_sched]
> Code: d6 01 00 00 4c 8b 75 20 be 04 00 00 00 4d 8d 66 78 4c 89 e7 e8
> ba 4d 4e c9 4c 89 e2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6
> 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 8a
> RSP: 0018:c9003676f5a8 EFLAGS: 00010216
> RAX: dc00 RBX: 88816f81f020 RCX: 0001
> RDX: 000f RSI: 0008 RDI: 9053e5e0
> RBP: 88816f81f000 R08: 0001 R09: 9053e5e7
> R10: fbfff20a7cbc R11: 6e696c6261736944 R12: 0078
> R13: 192006cedeb5 R14:  R15: c9003676f870
> FS:  4680f6c0() GS:888fa5c0() knlGS:2991
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7fb854d6f010 CR3: 00017b2d6000 CR4: 00350ee0
> Call Trace

Re: BUG: KASAN: slab-use-after-free in drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]

2023-04-04 Thread Mikhail Gavrilov
On Fri, Mar 24, 2023 at 7:37 PM Christian König
 wrote:
>
> Yeah, that one
>
> Thanks for the info, looks like this isn't fixed.
>
> Christian.
>

Hi,
glad to see that "BUG: KASAN: slab-use-after-free in
drm_sched_get_cleanup_job+0x47b/0x5c0" was fixed in 6.3-rc5.
For history it would be good to know the commit which fixes this issue.
I waited for this moment because I know other one issue which was also
found by KASAN santiniser.

BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
Read of size 4 at addr 0078 by task GameThread/23915

CPU: 10 PID: 23915 Comm: GameThread Tainted: GWL
---  ---  6.3.0-0.rc5.42.fc39.x86_64+debug #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 4601 02/02/2023
Call Trace:
 
 dump_stack_lvl+0x72/0xc0
 kasan_report+0xa4/0xe0
 ? drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
 kasan_check_range+0x104/0x1b0
 drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
 ? __pfx_drm_sched_job_cleanup+0x10/0x10 [gpu_sched]
 ? slab_free_freelist_hook+0x11e/0x1d0
 ? amdgpu_cs_parser_fini+0x363/0x5a0 [amdgpu]
 amdgpu_job_free+0x40/0x1b0 [amdgpu]
 amdgpu_cs_parser_fini+0x3c9/0x5a0 [amdgpu]
 ? __pfx_amdgpu_cs_parser_fini+0x10/0x10 [amdgpu]
 amdgpu_cs_ioctl+0x3d9/0x5630 [amdgpu]
 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
 ? mark_lock+0x101/0x16e0
 ? __lock_acquire+0xe54/0x59f0
 ? __pfx_lock_release+0x10/0x10
 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
 drm_ioctl_kernel+0x1f8/0x3d0
 ? __pfx_drm_ioctl_kernel+0x10/0x10
 drm_ioctl+0x4c1/0xaa0
 ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
 ? __pfx_drm_ioctl+0x10/0x10
 ? _raw_spin_unlock_irqrestore+0x62/0x80
 ? lockdep_hardirqs_on+0x7d/0x100
 ? _raw_spin_unlock_irqrestore+0x4b/0x80
 amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu]
 __x64_sys_ioctl+0x12d/0x1a0
 do_syscall_64+0x5c/0x90
 ? do_syscall_64+0x68/0x90
 ? lockdep_hardirqs_on+0x7d/0x100
 ? do_syscall_64+0x68/0x90
 ? do_syscall_64+0x68/0x90
 ? lockdep_hardirqs_on+0x7d/0x100
 ? do_syscall_64+0x68/0x90
 ? do_syscall_64+0x68/0x90
 ? lockdep_hardirqs_on+0x7d/0x100
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fe97a50881d
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:7c35d3f0 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 7c35d6e8 RCX: 7fe97a50881d
RDX: 7c35d4d0 RSI: c0186444 RDI: 00ae
RBP: 7c35d440 R08: 7fe8fc0f0970 R09: 7c35d490
R10: 7fb79000 R11: 0246 R12: 7c35d4d0
R13: c0186444 R14: 00ae R15: 7fe8fc0f0900
 

I know at least 3 games which 100% triggering this bug:
- Cyberpunk 2077
- Forza Horizon 4
- Forza Horizon 5

We would continue to discuss it here or better create a new thread
(for someone who is also faced with this issue could easily find a
solution on the internet)?

A full kernel log as usual attached here.

-- 
Best Regards,
Mike Gavrilov.


dmesg.tar.xz
Description: application/xz


Re: BUG: KASAN: slab-use-after-free in drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]

2023-03-23 Thread Mikhail Gavrilov
On Tue, Mar 21, 2023 at 11:47 PM Christian König
 wrote:
>
> Hi Mikhail,
>
> That looks like a reference counting issue to me.
>
> I'm going to take a look, but we have already fixed one of those recently.
>
> Probably best that you try this on drm-fixes, just to double check that
> this isn't the same issue.
>

Hi Christian,
you meant this branch?
$ git clone -b drm-fixes git://anongit.freedesktop.org/drm/drm linux-drm

If yes I just checked and unfortunately see this issue unfixed there.

[ 1984.295833] 
==
[ 1984.295876] BUG: KASAN: slab-use-after-free in
drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
[ 1984.295898] Read of size 8 at addr 88814cadc4c0 by task sdma1/764

[ 1984.295924] CPU: 12 PID: 764 Comm: sdma1 Tainted: GWL
  6.3.0-rc3-drm-fixes+ #1
[ 1984.295937] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4601 02/02/2023
[ 1984.295951] Call Trace:
[ 1984.295963]  
[ 1984.295975]  dump_stack_lvl+0x72/0xc0
[ 1984.295991]  print_report+0xcf/0x670
[ 1984.296007]  ? drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
[ 1984.296030]  ? drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
[ 1984.296047]  kasan_report+0xa4/0xe0
[ 1984.296118]  ? drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
[ 1984.296149]  drm_sched_get_cleanup_job+0x47b/0x5c0 [gpu_sched]
[ 1984.296175]  drm_sched_main+0x643/0x990 [gpu_sched]
[ 1984.296204]  ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
[ 1984.296222]  ? __pfx_autoremove_wake_function+0x10/0x10
[ 1984.296290]  ? __kthread_parkme+0xc1/0x1f0
[ 1984.296304]  ? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
[ 1984.296321]  kthread+0x29e/0x340
[ 1984.296334]  ? __pfx_kthread+0x10/0x10
[ 1984.296501]  ret_from_fork+0x2c/0x50
[ 1984.296518]  

[ 1984.296539] Allocated by task 12194:
[ 1984.296552]  kasan_save_stack+0x2f/0x50
[ 1984.296566]  kasan_set_track+0x21/0x30
[ 1984.296578]  __kasan_kmalloc+0x8b/0x90
[ 1984.296590]  amdgpu_driver_open_kms+0x10b/0x5a0 [amdgpu]
[ 1984.297051]  drm_file_alloc+0x46e/0x880
[ 1984.297064]  drm_open_helper+0x161/0x460
[ 1984.297076]  drm_open+0x1e7/0x5c0
[ 1984.297089]  drm_stub_open+0x24d/0x400
[ 1984.297107]  chrdev_open+0x215/0x620
[ 1984.297125]  do_dentry_open+0x5f1/0x1000
[ 1984.297146]  path_openat+0x1b3d/0x28a0
[ 1984.297164]  do_filp_open+0x1bd/0x400
[ 1984.297180]  do_sys_openat2+0x140/0x420
[ 1984.297197]  __x64_sys_openat+0x11f/0x1d0
[ 1984.297213]  do_syscall_64+0x5b/0x80
[ 1984.297231]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

[ 1984.297266] Freed by task 12195:
[ 1984.297284]  kasan_save_stack+0x2f/0x50
[ 1984.297303]  kasan_set_track+0x21/0x30
[ 1984.297323]  kasan_save_free_info+0x2a/0x50
[ 1984.297343]  __kasan_slab_free+0x107/0x1a0
[ 1984.297361]  slab_free_freelist_hook+0x11e/0x1d0
[ 1984.297373]  __kmem_cache_free+0xbc/0x2e0
[ 1984.297385]  amdgpu_driver_postclose_kms+0x582/0x8d0 [amdgpu]
[ 1984.297821]  drm_file_free.part.0+0x638/0xb70
[ 1984.297834]  drm_release+0x1ea/0x470
[ 1984.297845]  __fput+0x213/0x9e0
[ 1984.297857]  task_work_run+0x11b/0x200
[ 1984.297869]  exit_to_user_mode_prepare+0x23a/0x260
[ 1984.297883]  syscall_exit_to_user_mode+0x16/0x50
[ 1984.297896]  do_syscall_64+0x67/0x80
[ 1984.297907]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

[ 1984.298033] Last potentially related work creation:
[ 1984.298044]  kasan_save_stack+0x2f/0x50
[ 1984.298057]  __kasan_record_aux_stack+0x97/0xb0
[ 1984.298075]  __call_rcu_common.constprop.0+0xf8/0x1af0
[ 1984.298095]  amdgpu_bo_list_put+0x1a4/0x1f0 [amdgpu]
[ 1984.298557]  amdgpu_cs_parser_fini+0x293/0x5a0 [amdgpu]
[ 1984.299055]  amdgpu_cs_ioctl+0x4f2a/0x5630 [amdgpu]
[ 1984.299624]  drm_ioctl_kernel+0x1f8/0x3d0
[ 1984.299637]  drm_ioctl+0x4c1/0xaa0
[ 1984.299649]  amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu]
[ 1984.300083]  __x64_sys_ioctl+0x12d/0x1a0
[ 1984.300097]  do_syscall_64+0x5b/0x80
[ 1984.300109]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

[ 1984.300135] Second to last potentially related work creation:
[ 1984.300149]  kasan_save_stack+0x2f/0x50
[ 1984.300167]  __kasan_record_aux_stack+0x97/0xb0
[ 1984.300185]  __call_rcu_common.constprop.0+0xf8/0x1af0
[ 1984.300203]  amdgpu_bo_list_put+0x1a4/0x1f0 [amdgpu]
[ 1984.300692]  amdgpu_cs_parser_fini+0x293/0x5a0 [amdgpu]
[ 1984.301133]  amdgpu_cs_ioctl+0x4f2a/0x5630 [amdgpu]
[ 1984.301577]  drm_ioctl_kernel+0x1f8/0x3d0
[ 1984.301598]  drm_ioctl+0x4c1/0xaa0
[ 1984.301610]  amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu]
[ 1984.302043]  __x64_sys_ioctl+0x12d/0x1a0
[ 1984.302056]  do_syscall_64+0x5b/0x80
[ 1984.302068]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

[ 1984.302090] The buggy address belongs to the object at 88814cadc000
which belongs to the cache kmalloc-4k of size 4096
[ 1984.302103] The buggy address is located 1216 bytes inside of
freed 4096-byte region [88814cadc000, 88814cadd000)

[ 1984.302129] The buggy address belongs to the physical page:
[ 1984.302141] page:

[6.3][regression] commit a4e771729a51168bc36317effaa9962e336d4f5e lead to flood kernel logs with warning messages "at kernel/workqueue.c:3167 __flush_work+0x472/0x500"

2023-03-08 Thread Mikhail Gavrilov
Hi,
I didn't faced to issue drm_bridge_hpd_enable+0x94/0x9c [drm] but
fixing this issue leads to warning messages on my laptop ASUS ROG
Strix G15 Advantage Edition G513QY-HQ007 which has two AMD GPU.
Discrete Radeon 6800M and integrated in CPU Cezanne Vega 8.

I found bad commit by bisecting:
❯ git bisect bad
a4e771729a51168bc36317effaa9962e336d4f5e is the first bad commit
commit a4e771729a51168bc36317effaa9962e336d4f5e
Author: Dmitry Baryshkov 
Date:   Tue Jan 24 12:45:48 2023 +0200

drm/probe_helper: sort out poll_running vs poll_enabled

There are two flags attemting to guard connector polling:
poll_enabled and poll_running. While poll_enabled semantics is clearly
defined and fully adhered (mark that drm_kms_helper_poll_init() was
called and not finalized by the _fini() call), the poll_running flag
doesn't have such clearliness.

This flag is used only in drm_helper_probe_single_connector_modes() to
guard calling of drm_kms_helper_poll_enable, it doesn't guard the
drm_kms_helper_poll_fini(), etc. Change it to only be set if the polling
is actually running. Tie HPD enablement to this flag.

This fixes the following warning reported after merging the HPD series:

Hot plug detection already enabled
WARNING: CPU: 2 PID: 9 at drivers/gpu/drm/drm_bridge.c:1257
drm_bridge_hpd_enable+0x94/0x9c [drm]
Modules linked in: videobuf2_memops snd_soc_simple_card
snd_soc_simple_card_utils fsl_imx8_ddr_perf videobuf2_common
snd_soc_imx_spdif adv7511 etnaviv imx8m_ddrc imx_dcss mc cec nwl_dsi
gov
CPU: 2 PID: 9 Comm: kworker/u8:0 Not tainted
6.2.0-rc2-15208-g25b283acd578 #6
Hardware name: NXP i.MX8MQ EVK (DT)
Workqueue: events_unbound deferred_probe_work_func
pstate: 6005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : drm_bridge_hpd_enable+0x94/0x9c [drm]
lr : drm_bridge_hpd_enable+0x94/0x9c [drm]
sp : 89ef3740
x29: 89ef3740 x28: 09331f00 x27: 1000
x26: 0020 x25: 81148ed8 x24: 0a8fe000
x23: fffd x22: 05086348 x21: 81133ee0
x20: 0550d800 x19: 05086288 x18: 0006
x17:  x16: 896ef008 x15: 972891004260
x14: 2a1403e19400 x13: 972891004260 x12: 2a1403e19400
x11: 7100385f29400801 x10: 0aa0 x9 : 88112744
x8 : 00250b00 x7 : 0003 x6 : 0011
x5 :  x4 : bd986a48 x3 : 0001
x2 :  x1 :  x0 : 0025
Call trace:
 drm_bridge_hpd_enable+0x94/0x9c [drm]
 drm_bridge_connector_enable_hpd+0x2c/0x3c [drm_kms_helper]
 drm_kms_helper_poll_enable+0x94/0x10c [drm_kms_helper]
 drm_helper_probe_single_connector_modes+0x1a8/0x510 [drm_kms_helper]
 drm_client_modeset_probe+0x204/0x1190 [drm]
 __drm_fb_helper_initial_config_and_unlock+0x5c/0x4a4 [drm_kms_helper]
 drm_fb_helper_initial_config+0x54/0x6c [drm_kms_helper]
 drm_fbdev_client_hotplug+0xd0/0x140 [drm_kms_helper]
 drm_fbdev_generic_setup+0x90/0x154 [drm_kms_helper]
 dcss_kms_attach+0x1c8/0x254 [imx_dcss]
 dcss_drv_platform_probe+0x90/0xfc [imx_dcss]
 platform_probe+0x70/0xcc
 really_probe+0xc4/0x2e0
 __driver_probe_device+0x80/0xf0
 driver_probe_device+0xe0/0x164
 __device_attach_driver+0xc0/0x13c
 bus_for_each_drv+0x84/0xe0
 __device_attach+0xa4/0x1a0
 device_initial_probe+0x1c/0x30
 bus_probe_device+0xa4/0xb0
 deferred_probe_work_func+0x90/0xd0
 process_one_work+0x200/0x474
 worker_thread+0x74/0x43c
 kthread+0xfc/0x110
 ret_from_fork+0x10/0x20
---[ end trace  ]---

Reported-by: Laurentiu Palcu 
Fixes: c8268795c9a9 ("drm/probe-helper: enable and disable HPD on
connectors")
Tested-by: Marek Szyprowski 
Tested-by: Chen-Yu Tsai 
Acked-by: Laurentiu Palcu 
Tested-by: Laurentiu Palcu 
Tested-by: Laurent Pinchart 
Signed-off-by: Dmitry Baryshkov 
Signed-off-by: Neil Armstrong 
Link: 
https://patchwork.freedesktop.org/patch/msgid/20230124104548.3234554-2-dmitry.barysh...@linaro.org
(cherry picked from commit d33a54e3991dfce88b4fc6d9c3360951c2c5660d)
Signed-off-by: Thomas Zimmermann 

 drivers/gpu/drm/drm_probe_helper.c | 42 +++---
 1 file changed, 21 insertions(+), 21 deletions(-)

Of course I tried to check the bisect assumption by reverting this
commit. And I can confirm without commit
a4e771729a51168bc36317effaa9962e336d4f5e the warning messages do not
appear within a day.

I attached a full kernel log if someone would be interested to see it.

-- 
Best Regards,
Mike Gavrilov.
git bisect start
# status: waiting for both good and bad commits
# good: [5b7c4cabbb65f5c469464da6c5f614cbd7f730f2] Merge tag 'net-next-6.3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bis

Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-02-28 Thread Mikhail Gavrilov
On Mon, Feb 27, 2023 at 3:22 PM Christian König
>
> Unfortunately yes. We could clean that up a bit more so that you don't
> run into a BUG() assertion, but what essentially happens here is that we
> completely fail to talk to the hardware.
>
> In this situation we can't even re-enable vesa or text console any more.
>
Then I don't understand why when amdgpu is blacklisted via
modprobe.blacklist=amdgpu then I see graphics and could login into
GNOME. Yes without hardware acceleration, but it is better than non
working graphics. It means there is some other driver (I assume this
is "video") which can successfully talk to the AMD hardware in
conditions where amdgpu cannot do this. My suggestion is that if
amdgpu fails to talk to the hardware, then let another suitable driver
do it. I attached a system log when I apply "pci=nocrs" with
"modprobe.blacklist=amdgpu" for showing that graphics work right in
this case.
To do this, does the Linux module loading mechanism need to be refined?


-- 
Best Regards,
Mike Gavrilov.


system-without-amdgpu.tar.xz
Description: application/xz


Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-02-24 Thread Mikhail Gavrilov
On Fri, Feb 24, 2023 at 8:31 PM Christian König
 wrote:
>
> Sorry I totally missed that you attached the full dmesg to your original
> mail.
>
> Yeah, the driver did fail gracefully. But then X doesn't come up and
> then gdm just dies.

Are you sure that these messages should be present when the driver
fails gracefully?

turning off the locking correctness validator.
CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L
---  ---  6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
#1
Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
BIOS G513QY.320 09/07/2022
Call Trace:
 
 dump_stack_lvl+0x57/0x90
 register_lock_class+0x47d/0x490
 __lock_acquire+0x74/0x21f0
 ? lock_release+0x155/0x450
 lock_acquire+0xd2/0x320
 ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
 ? lock_is_held_type+0xce/0x120
 _raw_spin_lock_irqsave+0x4d/0xa0
 ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
 amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
 amdgpu_device_fini_hw+0x43/0x2c0 [amdgpu]
 amdgpu_driver_load_kms+0xe8/0x190 [amdgpu]
 amdgpu_pci_probe+0x140/0x420 [amdgpu]
 local_pci_probe+0x41/0x90
 pci_device_probe+0xc3/0x230
 really_probe+0x1b6/0x410
 __driver_probe_device+0x78/0x170
 driver_probe_device+0x1f/0x90
 __driver_attach+0xd2/0x1c0
 ? __pfx___driver_attach+0x10/0x10
 bus_for_each_dev+0x8a/0xd0
 bus_add_driver+0x141/0x230
 driver_register+0x77/0x120
 ? __pfx_init_module+0x10/0x10 [amdgpu]
 do_one_initcall+0x6e/0x350
 do_init_module+0x4a/0x220
 __do_sys_init_module+0x192/0x1c0
 do_syscall_64+0x5b/0x80
 ? asm_exc_page_fault+0x22/0x30
 ? lockdep_hardirqs_on+0x7d/0x100
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fd58cfcb1be
Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01
RSP: 002b:7ffd1d1065d8 EFLAGS: 0246 ORIG_RAX: 00af
RAX: ffda RBX: 55b0b5aa6d70 RCX: 7fd58cfcb1be
RDX: 55b0b5a96670 RSI: 016b6156 RDI: 7fd589392010
RBP: 7ffd1d106690 R08: 55b0b5a93bd0 R09: 016b6ff0
R10: 55b5eea2c333 R11: 0246 R12: 55b0b5a96670
R13: 0002 R14: 55b0b5a9c170 R15: 55b0b5aa58a0
 
amdgpu: probe of :03:00.0 failed with error -12
amdgpu :08:00.0: enabling device (0006 -> 0007)
[drm] initializing kernel modesetting (RENOIR 0x1002:0x1638 0x1043:0x16C2 0xC4).


list_add corruption. prev->next should be next (c0940328), but
was . (prev=8c9b734062b0).
[ cut here ]
kernel BUG at lib/list_debug.c:30!
invalid opcode:  [#1] PREEMPT SMP NOPTI
CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L
---  ---  6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
#1
Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
BIOS G513QY.320 09/07/2022
RIP: 0010:__list_add_valid+0x74/0x90
Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b
48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b
48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d
RSP: 0018:a50f81aafa00 EFLAGS: 00010246
RAX: 0075 RBX: 8c9b734062b0 RCX: 
RDX:  RSI: 0027 RDI: 
RBP: 8c9b734062b0 R08:  R09: a50f81aaf8a0
R10: 0003 R11: 8caa1d2fffe8 R12: 8c9b7c0a5e48
R13:  R14: c13a6d20 R15: 
FS:  7fd58c6a5940() GS:8ca9d9a0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 55b0b5a955e0 CR3: 00017e86 CR4: 00750ee0
PKRU: 5554
Call Trace:
 
 ttm_device_init+0x184/0x1c0 [ttm]
 amdgpu_ttm_init+0xb8/0x610 [amdgpu]
 ? _printk+0x60/0x80
 gmc_v9_0_sw_init+0x4a3/0x7c0 [amdgpu]
 amdgpu_device_init+0x14e5/0x2520 [amdgpu]
 amdgpu_driver_load_kms+0x15/0x190 [amdgpu]
 amdgpu_pci_probe+0x140/0x420 [amdgpu]
 local_pci_probe+0x41/0x90
 pci_device_probe+0xc3/0x230
 really_probe+0x1b6/0x410
 __driver_probe_device+0x78/0x170
 driver_probe_device+0x1f/0x90
 __driver_attach+0xd2/0x1c0
 ? __pfx___driver_attach+0x10/0x10
 bus_for_each_dev+0x8a/0xd0
 bus_add_driver+0x141/0x230
 driver_register+0x77/0x120
 ? __pfx_init_module+0x10/0x10 [amdgpu]
 do_one_initcall+0x6e/0x350
 do_init_module+0x4a/0x220
 __do_sys_init_module+0x192/0x1c0
 do_syscall_64+0x5b/0x80
 ? asm_exc_page_fault+0x22/0x30
 ? lockdep_hardirqs_on+0x7d/0x100
 entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fd58cfcb1be
Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 48
RSP: 002b:7ffd1d1065d8 EFLAGS: 0246 ORIG_RAX: 00af
RAX: ffda RBX: 55b0b5aa6d70 RCX: 7fd58cfcb1be
RDX: 55b0b5a96670 RSI: 016b6156 RDI: 7fd589392010
RBP: 7f

Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-02-24 Thread Mikhail Gavrilov
On Fri, Feb 24, 2023 at 12:13 PM Christian König
 wrote:
>
> Hi Mikhail,
>
> this is pretty clearly a problem with the system and/or it's BIOS and
> not the GPU hw or the driver.
>
> The option pci=nocrs makes the kernel ignore additional resource windows
> the BIOS reports through ACPI. This then most likely leads to problems
> with amdgpu because it can't bring up its PCIe resources any more.
>
> The output of "sudo lspci - -s $BUSID_OF_AMDGPU" might help
> understand the problem

I attach both lspci for pci=nocrs and without pci=nocrs.

The differences for Cezanne Radeon Vega Series:
with pci=nocrs:
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Interrupt: pin A routed to IRQ 255
Region 4: I/O ports at e000 [disabled] [size=256]
Capabilities: [c0] MSI-X: Enable- Count=4 Masked-

Without pci=nocrs:
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Interrupt: pin A routed to IRQ 44
Region 4: I/O ports at e000 [size=256]
Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-


The differences for Navi 22 Radeon 6800M:
with pci=nocrs:
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Interrupt: pin A routed to IRQ 255
Region 0: Memory at f8 (64-bit, prefetchable) [disabled] [size=16G]
Region 2: Memory at fc (64-bit, prefetchable) [disabled] [size=256M]
Region 5: Memory at fca0 (32-bit, non-prefetchable) [disabled] [size=1M]
AtomicOpsCtl: ReqEn-
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address:   Data: 

Without pci=nocrs:
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 103
Region 0: Memory at f8 (64-bit, prefetchable) [size=16G]
Region 2: Memory at fc (64-bit, prefetchable) [size=256M]
Region 5: Memory at fca0 (32-bit, non-prefetchable) [size=1M]
AtomicOpsCtl: ReqEn+
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee0  Data: 

> but I strongly suggest to try a BIOS update first.

This is the first thing that was done. And I am afraid no more BIOS updates.
https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/

I also have experience in dealing with manufacturers' tech support.
Usually it ends with "we do not provide drivers for Linux".

-- 
Best Regards,
Mike Gavrilov.
❯ sudo lspci - -s 08:00.0
08:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] 
Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c4) (prog-if 00 
[VGA controller])
Subsystem: ASUSTeK Computer Inc. Radeon Vega 8
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort+ SERR- 
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0-,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [64] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 
unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- 
TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit 
Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- 
LTR-
 10BitTagComp+ 10BitTagReq- OBFF Not Supported, ExtFmt+ 
EETLPPrefix+, MaxEETLPPrefixes 1
 EmergencyPowerReduction Not Supported, 
EmergencyPowerReductionInit-
 FRS-
 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 
10BitTagReq- OBFF Disabled,
 AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 
2Retimers+ DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
 Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
 

amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

2023-02-23 Thread Mikhail Gavrilov
Hi,
I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But
it is impossible to use without AC power because the system losts nvme
when I disconnect the power adapter.

Messages from kernel log when it happens:
nvme nvme0: controller is down; will reset: CSTS=0x, PCI_STATUS=0x10
nvme nvme0: Does your device have a faulty power saving mode enabled?
nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
and report a bug

I tried to use recommended parameters
(nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve
this issue, but without successed.

In the linux-nvme mail list the last advice was to try the "pci=nocrs"
parameter.

But with this parameter the amdgpu driver refuses to work and makes
the system unbootable. I can solve the problem with the booting system
by blacklisting the driver but it is not a good solution, because I
don't wanna lose the GPU.

Why amdgpu not work with "pci=nocrs" ?
And is it possible to solve this incompatibility?
It is very important because when I boot the system without amdgpu
driver with "pci=nocrs" nvme is not losts when I disconnect the power
adapter. So "pci=nocrs" really helps.

Below that I see in kernel log when adds "pci=nocrs" parameter:

amdgpu :03:00.0: amdgpu: Fetched VBIOS from ATRM
amdgpu: ATOM BIOS: SWBRT77321.001
[drm] VCN(0) decode is enabled in VM mode
[drm] VCN(0) encode is enabled in VM mode
[drm] JPEG decode is enabled in VM mode
Console: switching to colour dummy device 80x25
amdgpu :03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature
disabled as experimental (default)
[drm] GPU posting now...
[drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment
size is 9-bit
amdgpu :03:00.0: amdgpu: VRAM: 12272M 0x0080 -
0x0082FEFF (12272M used)
amdgpu :03:00.0: amdgpu: GART: 512M 0x - 0x1FFF
amdgpu :03:00.0: amdgpu: AGP: 267894784M 0x0084 -
0x
[drm] Detected VRAM RAM=12272M, BAR=16384M
[drm] RAM width 192bits GDDR6
[drm] amdgpu: 12272M of VRAM memory ready
[drm] amdgpu: 31774M of GTT memory ready.
amdgpu :03:00.0: amdgpu: (-14) failed to allocate kernel bo
[drm] Debug VRAM access will use slowpath MM access
amdgpu :03:00.0: amdgpu: Failed to DMA MAP the dummy page
[drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block
 failed -12
amdgpu :03:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu :03:00.0: amdgpu: Fatal error during GPU init
amdgpu :03:00.0: amdgpu: amdgpu: finishing device.

Of course a full system log is also attached.

-- 
Best Regards,
Mike Gavrilov.


system-log-Fatal-error-during-GPU-init.tar.xz
Description: application/xz


Re: [regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70

2023-02-13 Thread Mikhail Gavrilov
On Thu, Feb 9, 2023 at 10:17 PM Leo Li  wrote:
>
> Hi Mikhail, seems like your report flew past me, thanks for the ping.
>
> This might be a simple issue of not backing off when deadlock was hit.
> drm_atomic_normalize_zpos() can return an error code, and I ignored it
> (oops!)
>
> Can you give this patch a try?
> https://gitlab.freedesktop.org/-/snippets/7414
>
> - Leo
>

Thanks,
I think the time for testing was enough.
I observed three computers with different GPUs 6800M, 6900XT and
7900XTX for more than 3 days. And a warning message about
drm_modeset_drop_locks no longer appears anymore.

I hope this patch will have time to be merged in 6.2 before release.

Tested-by: Mikhail Gavrilov 

-- 
Best Regards,
Mike Gavrilov.


uptime.tar.xz
Description: application/xz


Re: [regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70

2023-02-09 Thread Mikhail Gavrilov
Harry, please don't ignore me.
This issue still happens in 6.1 and 6.2
Leo you are the author of the problematic commit please don't stand aside.
Really nobody is interested in clean logs without warnings and errors?
I am 100% sure that reverting commit
b261509952bc19d1012cf732f853659be6ebc61e will stop these warnings. I
also attached fresh logs from 6.2.0-0.rc6.
6.2-rc7 I started to build without commit
b261509952bc19d1012cf732f853659be6ebc61e to avoid these warnings.


On Thu, Oct 13, 2022 at 6:36 PM Mikhail Gavrilov
>
> Hi!
> I bisected an issue of the 6.0 kernel which started happening after
> 6.0-rc7 on all my machines.
>
> Backtrace of this issue looks like as:
>
> [ 2807.339439] [ cut here ]
> [ 2807.339445] WARNING: CPU: 11 PID: 2061 at
> drivers/gpu/drm/drm_modeset_lock.c:276
> drm_modeset_drop_locks+0x63/0x70
> [ 2807.339453] Modules linked in: tls uinput rfcomm snd_seq_dummy
> snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
> nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
> nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
> qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir
> snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek sunrpc snd_sof
> snd_hda_codec_hdmi snd_hda_codec_generic snd_sof_utils snd_hda_intel
> snd_intel_dspcfg mt7921e snd_intel_sdw_acpi binfmt_misc snd_soc_core
> mt7921_common snd_hda_codec snd_compress vfat ac97_bus edac_mce_amd
> mt76_connac_lib snd_pcm_dmaengine fat snd_hda_core snd_rpl_pci_acp6x
> snd_pci_acp6x mt76 btusb snd_hwdep kvm_amd btrtl snd_seq btbcm
> mac80211 snd_seq_device kvm btintel btmtk libarc4 snd_pcm
> snd_pci_acp5x bluetooth snd_timer snd_rn_pci_acp3x irqbypass
> snd_acp_config snd_soc_acpi cfg80211 rapl snd joydev pcspkr
> asus_nb_wmi wmi_bmof
> [ 2807.339519]  snd_pci_acp3x soundcore i2c_piix4 k10temp amd_pmc
> asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi
> crct10dif_pclmul iommu_v2 crc32_pclmul ledtrig_audio crc32c_intel
> gpu_sched sparse_keymap platform_profile hid_multitouch
> polyval_clmulni nvme ucsi_acpi drm_buddy polyval_generic
> drm_display_helper ghash_clmulni_intel serio_raw nvme_core ccp
> typec_ucsi rfkill sp5100_tco r8169 cec nvme_common typec wmi video
> i2c_hid_acpi i2c_hid ip6_tables ip_tables fuse
> [ 2807.339540] Unloaded tainted modules: acpi_cpufreq():1
> acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1
> acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1
> amd64_edac():1 acpi_cpufreq():1 acpi_cpufreq():1 amd64_edac():1
> amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1
> amd64_edac():1 acpi_cpufreq():1 amd64_edac():1 acpi_cpufreq():1
> fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1
> amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
> fjes():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1
> amd64_edac():1 fjes():1 acpi_cpufreq():1 amd64_edac():1
> pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 amd64_edac():1
> pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1
> fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1
> acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 fjes():1
> acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
> acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 acpi_cpufreq():1
> pcc_cpufreq():1 fjes():1
> [ 2807.339579]  acpi_cpufreq():1 fjes():1 pcc_cpufreq():1
> acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1
> acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1
> acpi_cpufreq():1 fjes():1 acpi_cpufreq():1 fjes():1 fjes():1 fjes():1
> fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
> fjes():1 fjes():1 fjes():1 fjes():1
> [ 2807.339596] CPU: 11 PID: 2061 Comm: gnome-shell Tainted: GW
>L 6.0.0-rc4-07-cb0eca01ad9756e853efec3301203c2b5b45aa9f+ #16
> [ 2807.339598] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
> G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
> [ 2807.339600] RIP: 0010:drm_modeset_drop_locks+0x63/0x70
> [ 2807.339602] Code: 42 08 48 89 10 48 89 1b 48 8d bb 50 ff ff ff 48
> 89 5b 08 e8 3f 41 55 00 48 8b 45 78 49 39 c4 75 c6 5b 5d 41 5c c3 cc
> cc cc cc <0f> 0b eb ac 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55
> 41 54
> [ 2807.339604] RSP: 0018:b6ad46e07b80 EFLAGS: 00010282
> [ 2807.339606] RAX: 0001 RBX:  RCX: 
> 0002
> [ 2807.339607] RDX: 0001 RSI: a6a118b1 RDI: 
> b6ad46e07c00
> [ 2807.339608] RBP: b6ad46e07c00 R08:  R09: 
> 
> [ 2807.339609] R10:  R11: 0001 R12: 
> 
> [ 2807.339610] 

[6.2][regression] looks like commit aab9cf7b6954136f4339136a1a7fc0602a2c4d8b leads to use-after-free and random computer hangs

2022-12-18 Thread Mikhail Gavrilov
Hi,
The kernel 6.2 preparation cycle has begun.
And after the kernel was updated on my Fedora Rawhide I started
receiving use-after-free errors with complete computer hangs.
At least a good reproducer of this behaviour is launch of the game
"Marvel's Avengers".

The backtrace of the issue looks like:
[  550.435083] [ cut here ]
[  550.435110] refcount_t: underflow; use-after-free.
[  550.435808] WARNING: CPU: 9 PID: 738 at lib/refcount.c:25
refcount_warn_saturate+0x97/0x110
[  550.435812] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack
[  550.435887] refcount_t: saturated; leaking memory.
[  550.435893]  nf_defrag_ipv6 nf_defrag_ipv4
[  550.435902] WARNING: CPU: 26 PID: 5032 at lib/refcount.c:19
refcount_warn_saturate+0x74/0x110
[  550.435907]  ip_set
[  550.435909] Modules linked in:
[  550.435910]  nf_tables
[  550.435912]  uinput rfcomm
[  550.435918]  nfnetlink
[  550.435919]  snd_seq_dummy snd_hrtimer
[  550.435925]  qrtr
[  550.435926]  netconsole nft_objref
[  550.435931]  bnep
[  550.435933]  nf_conntrack_netbios_ns nf_conntrack_broadcast
[  550.435938]  sunrpc
[  550.435939]  nft_fib_inet
[  550.435941]  binfmt_misc
[  550.435942]  nft_fib_ipv4
[  550.435943]  iwlmvm
[  550.435130] WARNING: CPU: 25 PID: 740 at lib/refcount.c:28
refcount_warn_saturate+0xba/0x110
[  550.435945]  nft_fib_ipv6
[  550.435946]  btusb
[  550.435947]  nft_fib nft_reject_inet
[  550.435954]  btrtl
[  550.435955]  nf_reject_ipv4 nf_reject_ipv6
[  550.435963]  btbcm
[  550.435964]  nft_reject nft_ct
[  550.435969]  btintel
[  550.435971]  nft_chain_nat nf_nat
[  550.435977]  btmtk
[  550.435979]  nf_conntrack nf_defrag_ipv6
[  550.435984]  snd_seq_midi
[  550.435985]  nf_defrag_ipv4 ip_set
[  550.435991]  snd_seq_midi_event
[  550.435992]  nf_tables
[  550.435993]  bluetooth
[  550.435995]  nfnetlink
[  550.435996]  hid_logitech_hidpp
[  550.435142] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc iwlmvm btusb btrtl
btbcm btintel btmtk snd_seq_midi snd_seq_midi_event bluetooth
hid_logitech_hidpp snd_usb_audio iwlwifi xpad ff_memless
snd_usbmidi_lib snd_rawmidi mc ecdh_generic intel_rapl_msr
intel_rapl_common mt76x2u mt76x2_common joydev snd_hda_codec_realtek
mt76x02_usb edac_mce_amd snd_hda_codec_generic mt76_usb
snd_hda_codec_hdmi mt76x02_lib kvm_amd snd_hda_intel snd_intel_dspcfg
snd_intel_sdw_acpi snd_hda_codec mt76 vfat kvm snd_hda_core fat
snd_seq snd_hwdep irqbypass snd_seq_device mac80211 snd_pcm eeepc_wmi
asus_wmi ledtrig_audio sparse_keymap rapl platform_profile wmi_bmof
snd_timer snd pcspkr i2c_piix4
[  550.435997]  qrtr bnep
[  550.436003]  snd_usb_audio
[  550.436004]  sunrpc binfmt_misc
[  550.436010]  iwlwifi
[  550.436012]  iwlmvm btusb
[  550.436018]  xpad
[  550.436019]  btrtl btbcm
[  550.436025]  ff_memless
[  550.436026]  btintel
[  550.436027]  snd_usbmidi_lib
[  550.436029]  btmtk
[  550.436030]  snd_rawmidi
[  550.436031]  snd_seq_midi snd_seq_midi_event
[  550.436037]  mc
[  550.436038]  bluetooth
[  550.436039]  ecdh_generic
[  550.436041]  hid_logitech_hidpp snd_usb_audio
[  550.436046]  intel_rapl_msr
[  550.436048]  iwlwifi xpad
[  550.436054]  intel_rapl_common
[  550.436055]  ff_memless
[  550.436056]  mt76x2u
[  550.436058]  snd_usbmidi_lib snd_rawmidi
[  550.436063]  mt76x2_common
[  550.436064]  mc ecdh_generic
[  550.436070]  joydev
[  550.436071]  intel_rapl_msr intel_rapl_common
[  550.436076]  snd_hda_codec_realtek
[  550.436078]  mt76x2u
[  550.436079]  mt76x02_usb
[  550.436080]  mt76x2_common joydev
[  550.436086]  edac_mce_amd
[  550.436088]  snd_hda_codec_realtek mt76x02_usb
[  550.436094]  snd_hda_codec_generic
[  550.436095]  edac_mce_amd
[  550.436096]  mt76_usb
[  550.436098]  snd_hda_codec_generic mt76_usb
[  550.436104]  snd_hda_codec_hdmi
[  550.436106]  snd_hda_codec_hdmi
[  550.436107]  mt76x02_lib
[  550.435234]  k10temp soundcore libarc4 acpi_cpufreq cfg80211
hid_logitech_dj rfkill zram amdgpu drm_ttm_helper ttm video iommu_v2
gpu_sched drm_buddy crct10dif_pclmul crc32_pclmul crc32c_intel igb
ucsi_ccg drm_display_helper nvme typec_ucsi ghash_clmulni_intel ccp
typec cec sp5100_tco dca sha512_ssse3 nvme_core wmi ip6_tables
ip_tables fuse
[  550.436108]  mt76x02_lib kvm_amd
[  550.436115]  kvm_amd
[  550.436116]  snd_hda_intel snd_intel_dspcfg
[  550.436122]  snd_hda_intel
[  550.436123]  snd_intel_sdw_acpi
[  550.435284] CPU: 25 PID: 740 Comm: sdma2 Tainted: GWL
  6.1.0-rc1-13-aab9cf7b6954136f4339136a1a7fc0602a2c4d

Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-11-28 Thread Mikhail Gavrilov
On Tue, Nov 22, 2022 at 12:16 PM Christian König
 wrote:
>
> Ah, thanks a lot for this. I've already pushed the patches into our
> internal branch, but getting this confirmation is still great!
>
> This was quite some fundamental bug in the handling and I hope to get
> this completely reworked at some point since it is currently only mitigated.

Looks like the final version of this patch successfully merged in 6.1-rc7.
Big thanks, all games work again!

> No idea what that could be. Modesetting is not something I work on.
>
> The best advice I can give you is to maybe ping Harry and our other
> display people, they should know that stuff better than I do.

Unfortunately Harry didn't answer. I hope my email wasn't marked as spam.

-- 
Best Regards,
Mike Gavrilov.


Re: [regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70

2022-11-22 Thread Mikhail Gavrilov
On Thu, Oct 13, 2022 at 6:36 PM Mikhail Gavrilov
 wrote:
>
> Hi!
> I bisected an issue of the 6.0 kernel which started happening after
> 6.0-rc7 on all my machines.
>
> Backtrace of this issue looks like as:
>
> [ 2807.339439] [ cut here ]
> [ 2807.339445] WARNING: CPU: 11 PID: 2061 at
> drivers/gpu/drm/drm_modeset_lock.c:276
> drm_modeset_drop_locks+0x63/0x70
>
> bisect points to this commit: b261509952bc19d1012cf732f853659be6ebc61e.
>
> After reverting this commit the WARNING messages described here disappeared.
>

Hi Harry, Christian says that you can help with it.

Thanks.

-- 
Best Regards,
Mike Gavrilov.


Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-11-21 Thread Mikhail Gavrilov
On Mon, Nov 14, 2022 at 6:22 PM Christian König
 wrote:
>
> I've found and fixed a few problems around the userptr handling which
> might explain what you see here.
>
> A series of four patches starting with "drm/amdgpu: always register an
> MMU notifier for userptr" is under review now.
>
> Going to give that a bit cleanup later today and will CC you when I send
> that out. Would be nice if you could give that some testing.
>
> Thanks,
> Christian.
>

Christian, I tested all four patches around week and can say that this
issue is completely gone.
All known broken games working.
Tested-by: Mikhail Gavrilov 

The only thing I don't like is the flood in the kernel logs of the
message "WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276
drm_modeset_drop_locks+0x63/0x70", but this is not related to the
patches being checked.
All kernel logs uploaded to pastebin [1][2][3][4][5][6][7][8]

I wrote a separate bug report about "drm_modeset_lock" [9], it's a
pity that no one paid attention to it. I even found the first bad
commit. It is b261509952bc19d1012cf732f853659be6ebc61e.

[1] https://pastebin.com/WZWczupk
[2] https://pastebin.com/f4i9pvjS
[3] https://pastebin.com/rsDWaMR1
[4] https://pastebin.com/tDNEYJq0
[5] https://pastebin.com/xfZVbm1f
[6] https://pastebin.com/Vx9gDyKt
[7] https://pastebin.com/XvRkLckV
[8] https://pastebin.com/pd8WBkgx
[9] https://www.spinics.net/lists/dri-devel/msg367543.html

Thanks.

-- 
Best Regards,
Mike Gavrilov.


Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-11-02 Thread Mikhail Gavrilov
On Tue, Nov 1, 2022 at 10:52 PM Christian König
 wrote:
>
> Let's focus on one problem at a time.
>
> The issue here is that somehow userptr handling became racy after we
> removed the lock, but I don't see why.
>
> We need to fix this ASAP since it is probably a much wider problem and
> the additional lock just hides it somehow.
>
> Going to provide you with an updated patch tomorrow.
>
> Thanks,
> Christian.

Recently sackboy has been updated and now the kernel log contains a
trace very similar to the one in the first post, even with the patch
applied.

[  155.948044] [ cut here ]
[  155.948164] WARNING: CPU: 3 PID: 4850 at
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:678
amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu]
[  155.948342] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep intel_rapl_msr intel_rapl_common snd_hda_codec_realtek
snd_sof_amd_renoir snd_sof_amd_acp snd_hda_codec_generic
snd_hda_codec_hdmi snd_sof_pci sunrpc binfmt_misc snd_sof
snd_hda_intel snd_sof_utils snd_intel_dspcfg mt7921e
snd_intel_sdw_acpi snd_hda_codec mt7921_common snd_soc_core
edac_mce_amd mt76_connac_lib btusb snd_hda_core snd_compress snd_hwdep
mt76 btrtl ac97_bus kvm_amd snd_pcm_dmaengine btbcm snd_rpl_pci_acp6x
snd_pci_acp6x btintel mac80211 btmtk snd_seq snd_seq_device kvm
snd_pcm snd_pci_acp5x libarc4 bluetooth irqbypass vfat snd_timer
snd_rn_pci_acp3x fat rapl snd_acp_config asus_nb_wmi snd cfg80211
snd_soc_acpi wmi_bmof k10temp pcspkr
[  155.948436]  snd_pci_acp3x i2c_piix4 soundcore asus_wireless
amd_pmc joydev zram amdgpu drm_ttm_helper ttm crct10dif_pclmul
hid_asus crc32_pclmul asus_wmi crc32c_intel iommu_v2 ledtrig_audio
polyval_clmulni gpu_sched sparse_keymap polyval_generic
platform_profile drm_buddy drm_display_helper nvme rfkill
ghash_clmulni_intel hid_multitouch ucsi_acpi sha512_ssse3 nvme_core
typec_ucsi serio_raw sp5100_tco r8169 ccp cec nvme_common typec
i2c_hid_acpi i2c_hid video wmi ip6_tables ip_tables fuse
[  155.948540] CPU: 3 PID: 4850 Comm: Sackboy-Win64-T Tainted: G
 WL---  ---
6.1.0-0.rc3.20221101git5aaef24b5c6d.29.fc38.x86_64 #1
[  155.948544] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
[  155.948547] RIP: 0010:amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu]
[  155.948748] Code: 9e f1 e9 32 ff ff ff 4c 89 e9 89 ea 48 c7 c6 a8
a3 fd c0 48 c7 c7 88 81 1e c1 e8 af 97 ea f1 eb 8e 66 90 bd f2 ff ff
ff eb 8d <0f> 0b eb f5 bd fd ff ff ff eb 82 bd f2 ff ff ff e9 62 ff ff
ff 48
[  155.948751] RSP: 0018:960b544d3a50 EFLAGS: 00010282
[  155.948756] RAX: 8a4e40d44e00 RBX: 8a4f0e564140 RCX: 0001
[  155.948759] RDX:  RSI: 8a4e40d44e00 RDI: 8a4f4b52b400
[  155.948761] RBP: 8a4e8c979000 R08: 0dc0 R09: 
[  155.948764] R10: 0001 R11:  R12: 8a4e8aaad558
[  155.948767] R13: 3b91 R14: 8a4f0e667180 R15: 8a4f4b52b458
[  155.948770] FS:  7fa13fe006c0() GS:8a5d16e0()
knlGS:36f8
[  155.948772] CS:  0010 DS:  ES:  CR0: 80050033
[  155.948775] CR2: 25c9e1d0 CR3: 00036199 CR4: 00750ee0
[  155.948778] PKRU: 5554
[  155.948780] Call Trace:
[  155.948783]  
[  155.948790]  amdgpu_cs_ioctl+0x9fd/0x2030 [amdgpu]
[  155.948992]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  155.949155]  drm_ioctl_kernel+0xac/0x160
[  155.949165]  drm_ioctl+0x1e7/0x450
[  155.949172]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  155.949344]  amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
[  155.949528]  __x64_sys_ioctl+0x90/0xd0
[  155.949537]  do_syscall_64+0x5b/0x80
[  155.949547]  ? lock_is_held_type+0xe8/0x140
[  155.949559]  ? do_syscall_64+0x67/0x80
[  155.949565]  ? lockdep_hardirqs_on+0x7d/0x100
[  155.949573]  ? do_syscall_64+0x67/0x80
[  155.949579]  ? do_syscall_64+0x67/0x80
[  155.949586]  ? do_syscall_64+0x67/0x80
[  155.949592]  ? lockdep_hardirqs_on+0x7d/0x100
[  155.949597]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  155.949603] RIP: 0033:0x7fa1b7fd912f
[  155.949610] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24
10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00
00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28
00 00
[  155.949615] RSP: 002b:7fa13fdfe920 EFLAGS: 0246 ORIG_RAX:
0010
[  155.949621] RAX: ffda RBX: 7fa13fdfebe8 RCX: 7fa1b7fd912f
[  155.949625] RDX: 7fa13fdfea10 RSI: c0186444 RDI: 0165
[  155.949629] RBP: 7fa13fdfea10 R08: 7f9ff80018e0 R09: 7fa13fdfe9c0
[  155.949633] R10: 7eb11590 R11: 0246 R12: c0186444
[  15

Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-10-30 Thread Mikhail Gavrilov
On Wed, Oct 26, 2022 at 12:29 PM Christian König
 wrote:
>
> Attached is the original test patch rebased on current amd-staging-drm-next.
>
> Can you test if this is enough to make sure that the games start without
> crashing by fetching the userptrs?

1. Over the past week the list of games affected by this issue updated
with new games: The Outlast Trials, Gotham Knights, Sackboy: A Big
Adventure.

2. I tested the patch and it really solves the problem with the launch
of all the listed games and does not create new problems.

3. The only thing I noticed is that in the game Sackboy: A Big
Adventure, when using the kernel built from the commit
b229b6ca5abbd63ff40c1396095b1b36b18139c3 + the attached patch, I can’t
connect to friend coop session due to the steam client hangs. The
kernel built from commit 736ec9fadd7a1fde8480df7e5cfac465c07ff6f3
(this is the commit prior to dd80d9c8eecac8c516da5b240d01a35660ba6cb6)
free of this problem.

I need to spend some more time to find the commit after which leads to
hanging [3] the steam client.

Thanks.

-- 
Best Regards,
Mike Gavrilov.


Re: [6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-10-21 Thread Mikhail Gavrilov
On Fri, Oct 21, 2022 at 1:33 PM Christian König
 wrote:
>
> Hi,
>
> yes Bas already reported this issue, but I couldn't reproduce it. Need
> to come up with a patch to narrow this down further.
>
> Can I send you something to test?

I would appreciate to test any patches and ideas.

-- 
Best Regards,
Mike Gavrilov.


[6.1][regression] after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6 some games (Cyberpunk 2077, Forza Horizon 4/5) hang at start

2022-10-21 Thread Mikhail Gavrilov
Hi!
I found that some games (Cyberpunk 2077, Forza Horizon 4/5) hang at
start after commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6.

dd80d9c8eecac8c516da5b240d01a35660ba6cb6 is the first bad commit
commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6
Author: Christian König 
Date:   Thu Jul 14 10:23:38 2022 +0200

drm/amdgpu: revert "partial revert "remove ctx->lock" v2"

This reverts commit 94f4c4965e5513ba624488f4b601d6b385635aec.

We found that the bo_list is missing a protection for its list entries.
Since that is fixed now this workaround can be removed again.

Signed-off-by: Christian König 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c |  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  1 -
 3 files changed, 6 insertions(+), 18 deletions(-)


And when it happening in kernel log appears a such backtrace:
[  231.331210] [ cut here ]
[  231.331262] WARNING: CPU: 11 PID: 6555 at
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:675
amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu]
[  231.331424] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir
snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek snd_sof
snd_hda_codec_generic snd_hda_codec_hdmi snd_sof_utils mt7921e
snd_hda_intel sunrpc snd_intel_dspcfg mt7921_common binfmt_misc
snd_intel_sdw_acpi snd_hda_codec mt76_connac_lib edac_mce_amd btusb
snd_soc_core mt76 snd_hda_core btrtl snd_hwdep snd_compress kvm_amd
ac97_bus snd_seq btbcm snd_pcm_dmaengine btintel snd_rpl_pci_acp6x
mac80211 btmtk snd_pci_acp6x kvm snd_seq_device snd_pcm snd_pci_acp5x
libarc4 irqbypass bluetooth snd_rn_pci_acp3x snd_timer pcspkr
asus_nb_wmi rapl joydev wmi_bmof snd_acp_config cfg80211 snd_soc_acpi
vfat snd
[  231.331490]  snd_pci_acp3x i2c_piix4 soundcore fat k10temp amd_pmc
asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi
iommu_v2 crct10dif_pclmul crc32_pclmul gpu_sched crc32c_intel
ledtrig_audio sparse_keymap polyval_clmulni platform_profile drm_buddy
polyval_generic hid_multitouch drm_display_helper rfkill nvme
ucsi_acpi ghash_clmulni_intel nvme_core video typec_ucsi serio_raw ccp
sha512_ssse3 sp5100_tco r8169 cec nvme_common typec wmi i2c_hid_acpi
i2c_hid ip6_tables ip_tables fuse
[  231.331532] CPU: 11 PID: 6555 Comm: GameThread Tainted: GW
  L---  ---
6.1.0-0.rc1.20221019gitaae703b02f92.17.fc38.x86_64 #1
[  231.331534] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
[  231.331537] RIP: 0010:amdgpu_ttm_tt_get_user_pages+0x14c/0x190 [amdgpu]
[  231.331654] Code: a8 d0 e9 32 ff ff ff 4c 89 e9 89 ea 48 c7 c6 40
82 f3 c0 48 c7 c7 10 60 14 c1 e8 2f a0 f4 d0 eb 8e 66 90 bd f2 ff ff
ff eb 8d <0f> 0b eb f5 bd fd ff ff ff eb 82 bd f2 ff ff ff e9 62 ff ff
ff 48
[  231.331656] RSP: 0018:aad4c705bae8 EFLAGS: 00010286
[  231.331659] RAX: 8e9cbdbe3200 RBX: 8e997e3f2440 RCX: 
[  231.331661] RDX:  RSI: 8e9cbdbe3200 RDI: 8e9c31208000
[  231.331663] RBP: 0001 R08: 0dc0 R09: 
[  231.331665] R10: 0001 R11:  R12: aad4c705bb90
[  231.331666] R13: 7651 R14: 8e9c89f334e0 R15: 8e991fda8000
[  231.331668] FS:  7c2af6c0() GS:8ea7d8e0()
knlGS:7b2c
[  231.331671] CS:  0010 DS:  ES:  CR0: 80050033
[  231.331673] CR2: 7ff65ffd8000 CR3: 0004f90f CR4: 00750ee0
[  231.331674] PKRU: 5554
[  231.331676] Call Trace:
[  231.331678]  
[  231.331682]  amdgpu_cs_ioctl+0x87e/0x1fc0 [amdgpu]
[  231.331824]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  231.331981]  drm_ioctl_kernel+0xac/0x160
[  231.331990]  drm_ioctl+0x1e7/0x450
[  231.331994]  ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[  231.332118]  amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
[  231.332233]  __x64_sys_ioctl+0x90/0xd0
[  231.332238]  do_syscall_64+0x5b/0x80
[  231.332243]  ? asm_exc_page_fault+0x22/0x30
[  231.332247]  ? lockdep_hardirqs_on+0x7d/0x100
[  231.332250]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  231.332253] RIP: 0033:0x7ff677c5704f
[  231.332256] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24
10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00
00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28
00 00
[  231.332258] RSP: 002b:7c2ad470 EFLAGS: 0246 ORIG_RAX:
0010
[  231.332261] RAX: ffda RBX: 7c2ad718 RCX: 7ff677c5704f
[  231.332263] RDX: 7c2ad540 RSI: c0186444 

Re: [Bug][5.18-rc0] Between commits ed4643521e6a and 34af78c4e616, appears warning "WARNING: CPU: 31 PID: 51848 at drivers/dma-buf/dma-fence-array.c:191 dma_fence_array_create+0x101/0x120" and some ga

2022-10-17 Thread Mikhail Gavrilov
On Wed, May 11, 2022 at 5:01 PM Christian König
 wrote:
>
>
> We have implemented a workaround, but still don't know the exact root cause.
>
> If anybody wants to look into this it would be rather helpful to be able
> to reproduce the issue.
>
> Regards,
> Christian.

I see that issue was returned after this commit
dd80d9c8eecac8c516da5b240d01a35660ba6cb6 is the first bad commit
commit dd80d9c8eecac8c516da5b240d01a35660ba6cb6
Author: Christian König 
Date:   Thu Jul 14 10:23:38 2022 +0200

drm/amdgpu: revert "partial revert "remove ctx->lock" v2"

This reverts commit 94f4c4965e5513ba624488f4b601d6b385635aec.

We found that the bo_list is missing a protection for its list entries.
Since that is fixed now this workaround can be removed again.

Signed-off-by: Christian König 
Reviewed-by: Alex Deucher 
Signed-off-by: Alex Deucher 

 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  | 21 ++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c |  2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.h |  1 -
 3 files changed, 6 insertions(+), 18 deletions(-)

The games Forza Horizon 4 and Cyberpunk 2077 again hangs at start.


-- 
Best Regards,
Mike Gavrilov.


[regression][6.0] After commit b261509952bc19d1012cf732f853659be6ebc61e I see WARNING message at drivers/gpu/drm/drm_modeset_lock.c:276 drm_modeset_drop_locks+0x63/0x70

2022-10-13 Thread Mikhail Gavrilov
Hi!
I bisected an issue of the 6.0 kernel which started happening after
6.0-rc7 on all my machines.

Backtrace of this issue looks like as:

[ 2807.339439] [ cut here ]
[ 2807.339445] WARNING: CPU: 11 PID: 2061 at
drivers/gpu/drm/drm_modeset_lock.c:276
drm_modeset_drop_locks+0x63/0x70
[ 2807.339453] Modules linked in: tls uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep intel_rapl_msr intel_rapl_common snd_sof_amd_renoir
snd_sof_amd_acp snd_sof_pci snd_hda_codec_realtek sunrpc snd_sof
snd_hda_codec_hdmi snd_hda_codec_generic snd_sof_utils snd_hda_intel
snd_intel_dspcfg mt7921e snd_intel_sdw_acpi binfmt_misc snd_soc_core
mt7921_common snd_hda_codec snd_compress vfat ac97_bus edac_mce_amd
mt76_connac_lib snd_pcm_dmaengine fat snd_hda_core snd_rpl_pci_acp6x
snd_pci_acp6x mt76 btusb snd_hwdep kvm_amd btrtl snd_seq btbcm
mac80211 snd_seq_device kvm btintel btmtk libarc4 snd_pcm
snd_pci_acp5x bluetooth snd_timer snd_rn_pci_acp3x irqbypass
snd_acp_config snd_soc_acpi cfg80211 rapl snd joydev pcspkr
asus_nb_wmi wmi_bmof
[ 2807.339519]  snd_pci_acp3x soundcore i2c_piix4 k10temp amd_pmc
asus_wireless zram amdgpu drm_ttm_helper ttm hid_asus asus_wmi
crct10dif_pclmul iommu_v2 crc32_pclmul ledtrig_audio crc32c_intel
gpu_sched sparse_keymap platform_profile hid_multitouch
polyval_clmulni nvme ucsi_acpi drm_buddy polyval_generic
drm_display_helper ghash_clmulni_intel serio_raw nvme_core ccp
typec_ucsi rfkill sp5100_tco r8169 cec nvme_common typec wmi video
i2c_hid_acpi i2c_hid ip6_tables ip_tables fuse
[ 2807.339540] Unloaded tainted modules: acpi_cpufreq():1
acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1
acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1 acpi_cpufreq():1
amd64_edac():1 acpi_cpufreq():1 acpi_cpufreq():1 amd64_edac():1
amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 fjes():1
amd64_edac():1 acpi_cpufreq():1 amd64_edac():1 acpi_cpufreq():1
fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1 fjes():1
amd64_edac():1 acpi_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
fjes():1 acpi_cpufreq():1 acpi_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 fjes():1 acpi_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 acpi_cpufreq():1 fjes():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1
fjes():1 pcc_cpufreq():1 amd64_edac():1 acpi_cpufreq():1
acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 fjes():1
acpi_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
acpi_cpufreq():1 fjes():1 pcc_cpufreq():1 acpi_cpufreq():1
pcc_cpufreq():1 fjes():1
[ 2807.339579]  acpi_cpufreq():1 fjes():1 pcc_cpufreq():1
acpi_cpufreq():1 pcc_cpufreq():1 acpi_cpufreq():1 fjes():1
acpi_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1
acpi_cpufreq():1 fjes():1 acpi_cpufreq():1 fjes():1 fjes():1 fjes():1
fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
fjes():1 fjes():1 fjes():1 fjes():1
[ 2807.339596] CPU: 11 PID: 2061 Comm: gnome-shell Tainted: GW
   L 6.0.0-rc4-07-cb0eca01ad9756e853efec3301203c2b5b45aa9f+ #16
[ 2807.339598] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
[ 2807.339600] RIP: 0010:drm_modeset_drop_locks+0x63/0x70
[ 2807.339602] Code: 42 08 48 89 10 48 89 1b 48 8d bb 50 ff ff ff 48
89 5b 08 e8 3f 41 55 00 48 8b 45 78 49 39 c4 75 c6 5b 5d 41 5c c3 cc
cc cc cc <0f> 0b eb ac 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 55
41 54
[ 2807.339604] RSP: 0018:b6ad46e07b80 EFLAGS: 00010282
[ 2807.339606] RAX: 0001 RBX:  RCX: 0002
[ 2807.339607] RDX: 0001 RSI: a6a118b1 RDI: b6ad46e07c00
[ 2807.339608] RBP: b6ad46e07c00 R08:  R09: 
[ 2807.339609] R10:  R11: 0001 R12: 
[ 2807.339610] R13: 9801ca24bb00 R14: 9801ca24bb00 R15: 
[ 2807.339611] FS:  7f57445b0600() GS:981198e0()
knlGS:
[ 2807.339613] CS:  0010 DS:  ES:  CR0: 80050033
[ 2807.339614] CR2: 7f574367f000 CR3: 0001236ae000 CR4: 00750ee0
[ 2807.339615] PKRU: 5554
[ 2807.339616] Call Trace:
[ 2807.339618]  
[ 2807.339621]  drm_mode_atomic_ioctl+0x3b9/0xac0
[ 2807.339627]  ? drm_atomic_set_property+0xb60/0xb60
[ 2807.339629]  drm_ioctl_kernel+0xac/0x160
[ 2807.339633]  drm_ioctl+0x22d/0x410
[ 2807.339635]  ? drm_atomic_set_property+0xb60/0xb60
[ 2807.339639]  amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
[ 2807.339834]  __x64_sys_ioctl+0x90/0xd0
[ 2807.339838]  do_syscall_64+0x5b/0x80
[ 2807.339843]  ? rcu_read_lock_sched_held+0x10/0x80
[ 2807.339846]  ? trace_hardirqs_on_prepare+0x55/0xe0
[ 2807.339849]  ? do_syscall_64+0x67/0x80
[ 2807.339851]  ? rcu_read_loc

[regression][6.1] After commit e4dc45b1848bc6bcac31eb1b4ccdd7f6718b3c86 system randomly hungs

2022-10-11 Thread Mikhail Gavrilov
Hi!
The hungs occurs randomly, but I found good reproductive scenario
(This is running the campaign in the game Halo Infinite)
The backtrace is look like this:

[  147.260971] BUG: kernel NULL pointer dereference, address: 0088
[  147.260987] [ cut here ]
[  147.260988] WARNING: CPU: 3 PID: 0 at kernel/softirq.c:321
__local_bh_disable_ip+0x9e/0xb0
[  147.260993] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer netconsole nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
nf_tables nfnetlink qrtr bnep sunrpc snd_sof_amd_renoir intel_rapl_msr
snd_sof_amd_acp intel_rapl_common mt7921e snd_sof_pci mt7921_common
binfmt_misc snd_sof mt76_connac_lib snd_sof_utils vfat
snd_hda_codec_realtek snd_soc_core snd_hda_codec_generic mt76 fat
snd_hda_codec_hdmi snd_hda_intel edac_mce_amd snd_compress ac97_bus
btusb kvm_amd snd_intel_dspcfg snd_pcm_dmaengine btrtl
snd_intel_sdw_acpi btbcm snd_hda_codec snd_pci_acp6x mac80211 kvm
snd_hda_core btintel btmtk irqbypass snd_hwdep snd_seq libarc4
snd_seq_device bluetooth snd_pcm snd_pci_acp5x snd_timer
snd_rn_pci_acp3x cfg80211 rapl pcspkr joydev asus_nb_wmi wmi_bmof
snd_acp_config snd snd_soc_acpi k10temp
[  147.261033]  soundcore i2c_piix4 snd_pci_acp3x asus_wireless
amd_pmc zram amdgpu drm_ttm_helper ttm hid_asus iommu_v2 asus_wmi
gpu_sched ledtrig_audio sparse_keymap drm_buddy platform_profile
drm_display_helper crct10dif_pclmul crc32_pclmul nvme rfkill
crc32c_intel ucsi_acpi hid_multitouch video ghash_clmulni_intel
nvme_core ccp typec_ucsi serio_raw r8169 cec sp5100_tco typec
i2c_hid_acpi wmi i2c_hid ip6_tables ip_tables fuse
[  147.261045] CPU: 3 PID: 0 Comm: swapper/3 Tainted: GWL
   6.0.0-rc2-02-907cc346ff6a69a08b4786c4ed2a78ac0120b9da+ #124
[  147.261046] Hardware name: ASUSTeK COMPUTER INC. ROG Strix
G513QY_G513QY/G513QY, BIOS G513QY.318 03/29/2022
[  147.261047] RIP: 0010:__local_bh_disable_ip+0x9e/0xb0
[  147.261048] Code: 25 00 1e 02 00 48 89 df e8 6f 23 08 00 85 c0 75
0e 48 89 9d 30 1c 00 00 5b 5d c3 cc cc cc cc 31 ff 31 db e8 54 23 08
00 eb e7 <0f> 0b e9 76 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
44 00
[  147.261049] RSP: 0018:a4e1c028c8d8 EFLAGS: 00010006
[  147.261050] RAX: 80010005 RBX: 0201 RCX: 0018
[  147.261051] RDX: 0f440b255950 RSI: 0201 RDI: c1b652e5
[  147.261051] RBP: 93a4eaf00fd8 R08: 0001 R09: 
[  147.261052] R10: 7635d840c31a8942 R11: fcca632b3d1b0d46 R12: 93a4f7831000
[  147.261052] R13: 93a4eaf00ee0 R14: 93a4efd84178 R15: 93a4efd84000
[  147.261053] FS:  () GS:93b396e0()
knlGS:
[  147.261054] CS:  0010 DS:  ES:  CR0: 80050033
[  147.261055] CR2: 0088 CR3: 00012a61 CR4: 00750ee0
[  147.261056] PKRU: 5554
[  147.261056] Call Trace:
[  147.261060]  
[  147.261068]  _raw_spin_lock_bh+0x1d/0x80
[  147.261074]  ieee80211_queue_skb+0x125/0x7a0 [mac80211]
[  147.261113]  ? __skb_get_hash+0x55/0x200
[  147.261117]  ieee80211_tx_8023+0x9c/0x1c0 [mac80211]
[  147.261155]  ieee80211_subif_start_xmit_8023+0x2b5/0x510 [mac80211]
[  147.261191]  netpoll_start_xmit+0x121/0x190
[  147.261199]  netpoll_send_skb+0x1fc/0x300
[  147.261202]  write_msg+0xdc/0xf0 [netconsole]
[  147.261207]  console_emit_next_record.constprop.0+0x17d/0x300
[  147.261214]  console_unlock+0xf3/0x1f0
[  147.261215]  vprintk_emit+0x152/0x350
[  147.261217]  ? plist_add+0xba/0xf0
[  147.261223]  _printk+0x48/0x4e
[  147.261231]  ? rcu_read_lock_sched_held+0x10/0x80
[  147.261235]  page_fault_oops.cold+0xcf/0x1f9
[  147.261240]  ? do_user_addr_fault+0x65/0x6b0
[  147.261243]  ? _raw_spin_unlock_irqrestore+0x40/0x60
[  147.261247]  exc_page_fault+0x7e/0x300
[  147.261249]  asm_exc_page_fault+0x22/0x30
[  147.261252] RIP: 0010:drm_sched_job_done.isra.0+0xc/0x1e0 [gpu_sched]
[  147.261255] Code: 89 d7 e8 87 02 0d f0 e9 54 ff ff ff 48 89 d7 e8
ea 66 37 f0 e9 47 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 41 54 55 53
48 89 fb <48> 8b af 88 00 00 00 f0 ff 8d 70 02 00 00 48 8b 85 a8 03 00
00 f0
[  147.261256] RSP: 0018:a4e1c028cdc8 EFLAGS: 00010093
[  147.261257] RAX: c06dc380 RBX:  RCX: 0018
[  147.261257] RDX: 0efa9afe3594 RSI: 93a7a4c1ec90 RDI: 
[  147.261258] RBP: 93a7a4c1ee10 R08: 0001 R09: 
[  147.261259] R10:  R11: 0001 R12: a4e1c028cde8
[  147.261259] R13: 0086 R14:  R15: 93a4fbed0198
[  147.261261]  ? drm_sched_job_done.isra.0+0x1e0/0x1e0 [gpu_sched]
[  147.261266]  dma_fence_signal_timestamp_locked+0x9e/0x1c0
[  147.261274]  dma_fence_signal+0x36/0x70
[  147.261276]  amdgpu_fence_process+

Re: [BUG][5.20] refcount_t: underflow; use-after-free

2022-09-19 Thread Mikhail Gavrilov
Hi!
Unfortunately the use-after-free issue still happens on the 6.0-rc5 kernel.
The issue became hard to repeat. I spent the whole day at the computer
when use-after-free again happened, I was playing the game Tiny Tina's
Wonderlands.
Therefore, forget about repeatability. It remains only to hope for
logs and tracing.
I didn't see anything new in the logs. It seems that we need to
somehow expand the logging so that the next time this happens we have
more information.

Sep 18 20:52:16 primary-ws gnome-shell[2388]:
meta_window_set_stack_position_no_sync: assertion
'window->stack_position >= 0' failed
Sep 18 20:52:27 primary-ws gnome-shell[2388]:
meta_window_set_stack_position_no_sync: assertion
'window->stack_position >= 0' failed
Sep 18 20:53:44 primary-ws gnome-shell[2388]: Window manager warning:
Window 0x4e3 sets an MWM hint indicating it isn't resizable, but
sets min size 1 x 1 and max size 2147483647 x 2147483647; this doesn't
make much sense.
Sep 18 20:53:45 primary-ws kernel: umip_printk: 11 callbacks suppressed
Sep 18 20:53:45 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:14ebb0d03 sp:4ee528: SGDT instruction cannot be used by
applications.
Sep 18 20:53:45 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:14ebb0d03 sp:4ee528: For now, expensive software emulation returns
the result.
Sep 18 20:53:53 primary-ws gnome-shell[2388]:
meta_window_set_stack_position_no_sync: assertion
'window->stack_position >= 0' failed
Sep 18 20:53:53 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:14ebb0d03 sp:4ee528: SGDT instruction cannot be used by
applications.
Sep 18 20:53:53 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:14ebb0d03 sp:4ee528: For now, expensive software emulation returns
the result.
Sep 18 20:54:15 primary-ws kernel: umip: Wonderlands.exe[214194]
ip:15a270815 sp:6eaef490: SGDT instruction cannot be used by
applications.
Sep 18 20:56:01 primary-ws kernel: umip_printk: 15 callbacks suppressed
Sep 18 20:56:01 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:15e3a82b0 sp:4ed178: SGDT instruction cannot be used by
applications.
Sep 18 20:56:01 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:15e3a82b0 sp:4ed178: For now, expensive software emulation returns
the result.
Sep 18 20:56:03 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:15e3a82b0 sp:4edbe8: SGDT instruction cannot be used by
applications.
Sep 18 20:56:03 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:15e3a82b0 sp:4edbe8: For now, expensive software emulation returns
the result.
Sep 18 20:56:03 primary-ws kernel: umip: Wonderlands.exe[213853]
ip:15e3a82b0 sp:4ebf18: SGDT instruction cannot be used by
applications.
Sep 18 20:57:55 primary-ws kernel: [ cut here ]
Sep 18 20:57:55 primary-ws kernel: refcount_t: underflow; use-after-free.
Sep 18 20:57:55 primary-ws kernel: WARNING: CPU: 22 PID: 235114 at
lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
Sep 18 20:57:55 primary-ws kernel: Modules linked in: tls uinput
rfcomm snd_seq_dummy snd_hrtimer nft_objref nf_conntrack_netbios_ns
nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
nft_reject_inet nf_reject_ipv4 nf_>
Sep 18 20:57:55 primary-ws kernel:  asus_wmi ledtrig_audio
sparse_keymap platform_profile irqbypass rfkill mc rapl snd_timer
video wmi_bmof pcspkr snd k10temp i2c_piix4 soundcore acpi_cpufreq
zram amdgpu drm_ttm_helper ttm iommu_v2 crct1>
Sep 18 20:57:55 primary-ws kernel: Unloaded tainted modules:
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_eda>
Sep 18 20:57:55 primary-ws kernel:  pcc_cpufreq():1 pcc_cpufreq():1
fjes():1 fjes():1 pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1
fjes():1
Sep 18 20:57:55 primary-ws kernel: CPU: 22 PID: 235114 Comm:
kworker/22:0 Tainted: GWL---  ---
6.0.0-0.rc5.20220914git3245cb65fd91.39.fc38.x86_64 #1
Sep 18 20:57:55 primary-ws kernel: Hardware name: System manufacturer
System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
Sep 18 20:57:55 primary-ws kernel: Workqueue: events
drm_sched_entity_kill_jobs_work [gpu_sched]
Sep 18 20:57:55 primary-ws kernel: RIP: 0010:refcount_warn_saturate+0xba/0x110
Sep 18 20:57:55 primary-ws kernel: Code: 01 01 e8 69 6b 6f 00 0f 0b e9
32 38 a5 00 80 3d 4d 7d be 01 00 75 85 48 c7 c7 80 b7 8e 95 c6 05 3d
7d be 01 01 e8 46 6b 6f 00 <0f> 0b e9 0f 38 a5 00 80 3d 28 7d be 01 00
0f 85 5e ff ff ff 48 c7
Sep 18 20:57:55 primary-ws kernel: RSP: 0018:a1a853ccbe60 EFLAGS: 00010286
Sep 18 20:57:55 primary-ws kernel: RAX: 0026 RBX:
8e0e60a96c28 RCX: 
Sep 18 20:57:55 primary-ws kernel: RDX: 0001 RSI:
958d255c RDI: 
Sep 18 20:57:55 primary-ws kernel: RBP: 8e19a83f5600 R08:
 R09: a1a853ccbd10
Sep 18 20:57:55 primary-ws kernel: R10: 0003 R11:
8e19ee2fffe8 R12: 8e19a83fc800
Sep 18 20:

Re: [BUG][5.20] refcount_t: underflow; use-after-free

2022-08-24 Thread Mikhail Gavrilov
On Fri, Aug 19, 2022 at 5:13 PM Maíra Canal  wrote:
>
> Hi Mikhail,
>
> Could you please specify the steps to reproduce this use-after-free? I
> will try to reproduce it on the RX5700 XT and bisect the issue.
>

Hi Maíra, thanks for help.

I'm afraid that it will be unrealistic to reproduce, because on a
laptop with 6800M (also RDNA 2 graphics) the problem does not repeat.

Sorry for the long silence, but I was trying to bisect the problem myself.

git bisect start
# status: waiting for both good and bad commits
# good: [3d7cb6b04c3f3115719235cc6866b10326de34cd] Linux 5.19
git bisect good 3d7cb6b04c3f3115719235cc6866b10326de34cd
# status: waiting for bad commit, 1 good commit known
# bad: [7ebfc85e2cd7b08f518b526173e9a33b56b3913b] Merge tag
'net-6.0-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect bad 7ebfc85e2cd7b08f518b526173e9a33b56b3913b

# bad: [b44f2fd87919b5ae6e1756d4c7ba2cbba22238e1] Merge tag
'drm-next-2022-08-03' of git://anongit.freedesktop.org/drm/drm
# 001: GPU hangs + use-after-free issue - https://pastebin.com/z86E9ydx
git bisect bad b44f2fd87919b5ae6e1756d4c7ba2cbba22238e1

# good: [526942b8134cc34d25d27f95dfff98b8ce2f6fcd] Merge tag
'ata-5.20-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
# 002: good - https://pastebin.com/9qki65Sj
git bisect good 526942b8134cc34d25d27f95dfff98b8ce2f6fcd

# good: [45490ce2ff833c4ec0de66705e46ba41320860cb] nfp: flower: add
support for tunnel offload without key ID
# 003: good - https://pastebin.com/vHk5eRkw
git bisect good 45490ce2ff833c4ec0de66705e46ba41320860cb

# skip: [e23a5e14aa278858c2e3d81ec34e83aa9a4177c5] Backmerge tag
'v5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux
into drm-next
# 004: GPU not switched in graphic mode - https://pastebin.com/RmqCTMLD
git bisect skip e23a5e14aa278858c2e3d81ec34e83aa9a4177c5

# bad: [b2065fb21d9a789b14f737ea90facedabadeb8a4] drm/amdgpu: fix
i2s_pdata out of bound array access
# 005: GPU hangs + use-after-free issue - https://pastebin.com/Zgw5Hc48
git bisect bad b2065fb21d9a789b14f737ea90facedabadeb8a4

# skip: [344feb7ccf764756937cfd74fa4ac5caba069c99] Merge tag
'amd-drm-next-5.20-2022-07-05' of
https://gitlab.freedesktop.org/agd5f/linux into drm-next
# 006: GPU not switched in graphic mode - https://pastebin.com/b8BUBE7Q
git bisect skip 344feb7ccf764756937cfd74fa4ac5caba069c99

# skip: [869b10ac8d2300327f554d83f4dbab041bf27d49] drm/amdgpu: add dm
ip block for dcn 3.1.4
# 007: GPU not switched in graphic mode - https://pastebin.com/byd7HECH
git bisect skip 869b10ac8d2300327f554d83f4dbab041bf27d49

# skip: [676ad8e997036e2f815c293b76c356fb7cc97a08] drm: rcar-du: Lift
z-pos restriction on primary plane for Gen3
# 008: GPU not switched in graphic mode - https://pastebin.com/3fXCTinb
git bisect skip 676ad8e997036e2f815c293b76c356fb7cc97a08

# skip: [5c57cbc390b166950c2e6c2f0c4edaeb0f47e97d] drm/bridge: lt9211:
Convert to drm_of_get_data_lanes_count
# 009: Build error - https://pastebin.com/rxHe9QRB
git bisect skip 5c57cbc390b166950c2e6c2f0c4edaeb0f47e97d

# skip: [6db5e0c8692e590734a7ec7455365d9cbaa15ef1] Merge tag
'drm-intel-next-2022-07-06' of
git://anongit.freedesktop.org/drm/drm-intel into drm-next
# 010: GPU not switched in graphic mode - https://pastebin.com/rqubSuc8
git bisect skip 6db5e0c8692e590734a7ec7455365d9cbaa15ef1

# skip: [5d763a9955f0fbf2681a2f1fa87c416056bd0c89] drm/amd/display:
Remove compiler warning
# 011: GPU not switched in graphic mode - https://pastebin.com/BrJs6ybP
git bisect skip 5d763a9955f0fbf2681a2f1fa87c416056bd0c89

# skip: [e6c2db2be986158afb9991d9fa8a38fe65a88516] drm/i915: Don't use
DRM_DEBUG_WARN_ON for unexpected l3bank/mslice config
# 012: GPU not switched in graphic mode - https://pastebin.com/yxppyqbD
git bisect skip e6c2db2be986158afb9991d9fa8a38fe65a88516

# bad: [cb6b81b21bd9cf09d72b7fe711be1b55001eb166] Merge tag
'drm-misc-next-fixes-2022-07-21' of
git://anongit.freedesktop.org/drm/drm-misc into drm-next
# 013: GPU hangs without use-after-free issue - https://pastebin.com/iRek4bBy
git bisect bad cb6b81b21bd9cf09d72b7fe711be1b55001eb166

# skip: [48b927770f8ad3f8cf4a024a552abf272af9f592]
drm/exynos/exynos7_drm_decon: free resources when clk_set_parent()
failed.
# 014: GPU not switched in graphic mode - https://pastebin.com/ekp10xhP
git bisect skip 48b927770f8ad3f8cf4a024a552abf272af9f592

# skip: [c5da61cf5bab30059f22ea368702c445ee87171a] drm/amdgpu/display:
add missing FP_START/END checks dcn32_clk_mgr.c
# 015: GPU not switched in graphic mode - https://pastebin.com/YbskKWmA
git bisect skip c5da61cf5bab30059f22ea368702c445ee87171a

# skip: [a77f7c89e62c6dfe405a64995812746f27adc510] drm/edid: convert
drm_gtf_modes_for_range() to drm_edid
# 016: GPU not switched in graphic mode - https://pastebin.com/bA2AwkJ7
git bisect skip a77f7c89e62c6dfe405a64995812746f27adc510

# skip: [6fde8eec71796f3534f0c274066862829813b21f] drm/doc: Add KUnit
documentation
# 017: GPU not switched in graphic mode - https://pasteb

Re: [BUG][5.20] refcount_t: underflow; use-after-free

2022-08-17 Thread Mikhail Gavrilov
On Wed, Aug 17, 2022 at 11:43 PM Maíra Canal  wrote:
>
> Hi Mikhail,
>
> Looks like 45ecaea738830b9d521c93520c8f201359dcbd95 ("drm/sched: Partial
> revert of 'drm/sched: Keep s_fence->parent pointer'") introduced the
> error. Try reverting it and check if the use-after-free still happens.

Thanks, but unfortunately, this did not lead to the expected result.
Again happens use-after-free in an incomprehensible context.
>From the new: added warning "suspicious RCU usage" but it looks like
it is completely not related to the use-after-free issue.

[ 215.434115] [ cut here ]
[ 215.434184] refcount_t: underflow; use-after-free.
[ 215.434204] WARNING: CPU: 7 PID: 1258 at lib/refcount.c:28
refcount_warn_saturate+0xba/0x110
[ 215.434214] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep sunrpc binfmt_misc snd_seq_midi snd_seq_midi_event
intel_rapl_msr intel_rapl_common snd_hda_codec_realtek vfat
snd_hda_codec_generic snd_hda_codec_hdmi mt76x2u fat mt76x2_common
snd_hda_intel mt76x02_usb snd_intel_dspcfg snd_intel_sdw_acpi mt76_usb
iwlmvm edac_mce_amd snd_usb_audio snd_hda_codec mt76x02_lib
snd_hda_core snd_usbmidi_lib snd_hwdep snd_rawmidi uvcvideo mt76
kvm_amd snd_seq videobuf2_vmalloc videobuf2_memops snd_seq_device
mac80211 videobuf2_v4l2 videobuf2_common kvm btusb iwlwifi snd_pcm
btrtl videodev libarc4 eeepc_wmi btbcm asus_wmi iwlmei btintel
ledtrig_audio xpad irqbypass sparse_keymap btmtk platform_profile
joydev
[ 215.434436] hid_logitech_hidpp rapl ff_memless mc snd_timer
bluetooth cfg80211 video pcspkr wmi_bmof snd soundcore k10temp
i2c_piix4 rfkill mei asus_ec_sensors acpi_cpufreq zram amdgpu
drm_ttm_helper ttm iommu_v2 ucsi_ccg gpu_sched crct10dif_pclmul
crc32_pclmul typec_ucsi drm_buddy crc32c_intel ghash_clmulni_intel ccp
igb sp5100_tco typec drm_display_helper nvme dca nvme_core cec wmi
ip6_tables ip_tables fuse
[ 215.434528] Unloaded tainted modules: amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 fjes():1
[ 215.434672] pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1
pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
[ 215.434702] CPU: 7 PID: 1258 Comm: kworker/7:3 Tainted: G W L
--- --- 6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
[ 215.434709] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[ 215.434715] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]
[ 215.434728] RIP: 0010:refcount_warn_saturate+0xba/0x110
[ 215.434734] Code: 01 01 e8 59 59 6f 00 0f 0b e9 22 46 a5 00 80 3d be
7d be 01 00 75 85 48 c7 c7 c0 99 8e 92 c6 05 ae 7d be 01 01 e8 36 59
6f 00 <0f> 0b e9 ff 45 a5 00 80 3d 99 7d be 01 00 0f 85 5e ff ff ff 48
c7
[ 215.434740] RSP: 0018:9ccb0237fe60 EFLAGS: 00010286
[ 215.434747] RAX: 0026 RBX: 8d531f6f2828 RCX: 
[ 215.434753] RDX: 0001 RSI: 928d07a4 RDI: 
[ 215.434757] RBP: 8d61e47f5600 R08:  R09: 9ccb0237fd10
[ 215.434762] R10: 0003 R11: 8d622e2fffe8 R12: 8d61e47fc800
[ 215.434767] R13: 8d5313e95500 R14: 8d61e47fc805 R15: 8d531f6f2830
[ 215.434772] FS: () GS:8d61e460()
knlGS:
[ 215.434777] CS: 0010 DS:  ES:  CR0: 80050033
[ 215.434782] CR2: 7f0c8b815048 CR3: 0001ab0e8000 CR4: 00350ee0
[ 215.434788] Call Trace:
[ 215.434792] 
[ 215.434797] process_one_work+0x2a0/0x600
[ 215.434819] worker_thread+0x4f/0x3a0
[ 215.434830] ? process_one_work+0x600/0x600
[ 215.434836] kthread+0xf5/0x120
[ 215.434842] ? kthread_complete_and_exit+0x20/0x20
[ 215.434854] ret_from_fork+0x22/0x30
[ 215.434881] 
[ 215.434885] irq event stamp: 134873
[ 215.434890] hardirqs last enabled at (134881): []
__up_console_sem+0x5e/0x70
[ 215.434897] hardirqs l

Re: [BUG][5.20] refcount_t: underflow; use-after-free

2022-08-17 Thread Mikhail Gavrilov
On Wed, Aug 17, 2022 at 9:08 PM Melissa Wen  wrote:
>
> Hi Mikhail,
>
> IIUC, you got this second user-after-free by applying the first version
> of Maíra's patch, right? So, that version was adding another unbalanced
> unlock to the cs ioctl flow, but it was solved in the latest version,
> that you can find here: https://patchwork.freedesktop.org/patch/497680/
> If this is the situation, can you check this last version?
>
> Thanks,
>
> Melissa

With the last version warning "bad unlock balance detected!" was gone,
but the user-after-free issue remains.
And again "Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]".

[  297.834779] [ cut here ]
[  297.834818] refcount_t: underflow; use-after-free.
[  297.834831] WARNING: CPU: 30 PID: 2377 at lib/refcount.c:28
refcount_warn_saturate+0xba/0x110
[  297.834838] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep sunrpc binfmt_misc snd_seq_midi snd_seq_midi_event mt76x2u
mt76x2_common mt76x02_usb mt76_usb mt76x02_lib snd_hda_codec_realtek
iwlmvm intel_rapl_msr snd_hda_codec_generic snd_hda_codec_hdmi mt76
vfat fat snd_hda_intel intel_rapl_common mac80211 snd_intel_dspcfg
snd_intel_sdw_acpi snd_usb_audio snd_hda_codec snd_usbmidi_lib btusb
edac_mce_amd iwlwifi libarc4 uvcvideo snd_hda_core btrtl snd_rawmidi
snd_hwdep videobuf2_vmalloc btbcm kvm_amd videobuf2_memops snd_seq
iwlmei btintel videobuf2_v4l2 eeepc_wmi snd_seq_device
videobuf2_common btmtk kvm xpad videodev joydev irqbypass snd_pcm
asus_wmi hid_logitech_hidpp ff_memless cfg80211 bluetooth rapl mc
[  297.834932]  ledtrig_audio snd_timer sparse_keymap platform_profile
wmi_bmof snd video pcspkr k10temp i2c_piix4 rfkill soundcore mei
asus_ec_sensors acpi_cpufreq zram amdgpu drm_ttm_helper ttm
crct10dif_pclmul crc32_pclmul crc32c_intel iommu_v2 ucsi_ccg gpu_sched
typec_ucsi drm_buddy ghash_clmulni_intel drm_display_helper ccp igb
typec sp5100_tco nvme cec nvme_core dca wmi ip6_tables ip_tables fuse
[  297.834978] Unloaded tainted modules: amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1 fjes():1
[  297.835055]  pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1
pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
[  297.835071] CPU: 30 PID: 2377 Comm: kworker/30:6 Tainted: G
WL---  ---
6.0.0-0.rc1.20220817git3cc40a443a04.14.fc38.x86_64 #1
[  297.835075] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[  297.835078] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]
[  297.835085] RIP: 0010:refcount_warn_saturate+0xba/0x110
[  297.835088] Code: 01 01 e8 59 59 6f 00 0f 0b e9 22 46 a5 00 80 3d
be 7d be 01 00 75 85 48 c7 c7 c0 99 8e aa c6 05 ae 7d be 01 01 e8 36
59 6f 00 <0f> 0b e9 ff 45 a5 00 80 3d 99 7d be 01 00 0f 85 5e ff ff ff
48 c7
[  297.835091] RSP: 0018:bd3506df7e60 EFLAGS: 00010286
[  297.835095] RAX: 0026 RBX: 961b250cbc28 RCX: 
[  297.835097] RDX: 0001 RSI: aa8d07a4 RDI: 
[  297.835100] RBP: 96276a3f5600 R08:  R09: bd3506df7d10
[  297.835102] R10: 0003 R11: 9627ae2fffe8 R12: 96276a3fc800
[  297.835105] R13: 9618c03e6600 R14: 96276a3fc805 R15: 961b250cbc30
[  297.835108] FS:  () GS:96276a20()
knlGS:
[  297.835110] CS:  0010 DS:  ES:  CR0: 80050033
[  297.835113] CR2: 621001e4a000 CR3: 00018d958000 CR4: 00350ee0
[  297.835116] Call Trace:
[  297.835118]  
[  297.835121]  process_one_work+0x2a0/0x600
[  297.835133]  worker_thread+0x4f/0x3a0
[  297.835139]  ? process_one_work+0x600/0x600
[  297.835142]  kthread+0xf5/0x120
[  297.835145]  ? kthread_complete_and_exit+0x20/0x20
[  297.835151]  ret_from_fork+0x22/0x30
[  297.835166]  
[  

Re: [BUG][5.20] refcount_t: underflow; use-after-free

2022-08-16 Thread Mikhail Gavrilov
On Mon, Aug 15, 2022 at 3:37 PM Mikhail Gavrilov
 wrote:
>
> Thanks, I tested this patch.
> But with this patch use-after-free problem happening in another place:

Does anyone have an idea why the second use-after-free happened?
>From the trace I don't understand which code is related.
I don't quite understand what the "Workqueue" entry in the trace means.

[ 408.358737] [ cut here ]
[ 408.358743] refcount_t: underflow; use-after-free.
[ 408.358760] WARNING: CPU: 9 PID: 62 at lib/refcount.c:28
refcount_warn_saturate+0xba/0x110
[ 408.358769] Modules linked in: uinput snd_seq_dummy rfcomm
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep sunrpc binfmt_misc snd_seq_midi snd_seq_midi_event mt76x2u
mt76x2_common snd_hda_codec_realtek mt76x02_usb snd_hda_codec_generic
iwlmvm snd_hda_codec_hdmi mt76_usb intel_rapl_msr snd_hda_intel
mt76x02_lib intel_rapl_common snd_intel_dspcfg snd_intel_sdw_acpi mt76
snd_hda_codec vfat fat snd_usb_audio snd_hda_core edac_mce_amd
mac80211 snd_usbmidi_lib snd_hwdep snd_rawmidi mc snd_seq btusb
kvm_amd iwlwifi snd_seq_device btrtl btbcm libarc4 btintel eeepc_wmi
snd_pcm iwlmei kvm btmtk asus_wmi ledtrig_audio irqbypass joydev
snd_timer sparse_keymap bluetooth platform_profile rapl cfg80211 snd
video wmi_bmof soundcore i2c_piix4 k10temp rfkill mei
[ 408.358853] asus_ec_sensors acpi_cpufreq zram hid_logitech_hidpp
amdgpu igb dca drm_ttm_helper ttm iommu_v2 crct10dif_pclmul gpu_sched
crc32_pclmul ucsi_ccg crc32c_intel drm_buddy nvme typec_ucsi
drm_display_helper ghash_clmulni_intel ccp typec nvme_core sp5100_tco
cec wmi ip6_tables ip_tables fuse
[ 408.358880] Unloaded tainted modules: amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1
[ 408.358953] pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1
fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
[ 408.358967] CPU: 9 PID: 62 Comm: kworker/9:0 Tainted: G W L ---
--- 6.0.0-0.rc1.13.fc38.x86_64+debug #1
[ 408.358971] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[ 408.358974] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]
[ 408.358982] RIP: 0010:refcount_warn_saturate+0xba/0x110
[ 408.358987] Code: 01 01 e8 d9 59 6f 00 0f 0b e9 a2 46 a5 00 80 3d 3e
7e be 01 00 75 85 48 c7 c7 70 99 8e 92 c6 05 2e 7e be 01 01 e8 b6 59
6f 00 <0f> 0b e9 7f 46 a5 00 80 3d 19 7e be 01 00 0f 85 5e ff ff ff 48
c7
[ 408.358990] RSP: 0018:b124003efe60 EFLAGS: 00010286
[ 408.358994] RAX: 0026 RBX: 9987a025d428 RCX: 
[ 408.358997] RDX: 0001 RSI: 928d0754 RDI: 
[ 408.358999] RBP: 9994e4ff5600 R08:  R09: b124003efd10
[ 408.359001] R10: 0003 R11: 99952e2fffe8 R12: 9994e4ffc800
[ 408.359004] R13: 998600228cc0 R14: 9994e4ffc805 R15: 9987a025d430
[ 408.359006] FS: () GS:9994e4e0()
knlGS:
[ 408.359009] CS: 0010 DS:  ES:  CR0: 80050033
[ 408.359012] CR2: 27ac39e78000 CR3: 0001a66d8000 CR4: 00350ee0
[ 408.359015] Call Trace:
[ 408.359017] 
[ 408.359020] process_one_work+0x2a0/0x600
[ 408.359032] worker_thread+0x4f/0x3a0
[ 408.359036] ? process_one_work+0x600/0x600
[ 408.359039] kthread+0xf5/0x120
[ 408.359044] ? kthread_complete_and_exit+0x20/0x20
[ 408.359049] ret_from_fork+0x22/0x30
[ 408.359061] 
[ 408.359063] irq event stamp: 5468
[ 408.359064] hardirqs last enabled at (5467): []
_raw_spin_unlock_irq+0x24/0x50
[ 408.359071] hardirqs last disabled at (5468): []
__schedule+0xe2c/0x16d0
[ 408.359076] softirqs last enabled at (2482): []
rht_deferred_worker+0x708/0xc00
[ 408.359079] softirqs last disabled at (2480): []
rht_deferred_worker+0x1f7/0xc00
[ 408.359082] ---[ end trace  ]---


Full ke

Re: [BUG][5.20] refcount_t: underflow; use-after-free

2022-08-15 Thread Mikhail Gavrilov
On Mon, Aug 15, 2022 at 5:20 AM Maíra Canal  wrote:
>
> Hi Mikhail
>
> Looks like this use-after-free problem was introduced on
> 90af0ca047f3049c4b46e902f432ad6ef1e2ded6. Checking this patch it seems
> like: if amdgpu_cs_vm_handling return r != 0, then it will unlock
> bo_list_mutex inside the function amdgpu_cs_vm_handling and again on
> amdgpu_cs_parser_fini.
>
> Maybe the following patch will help:

Thanks, I tested this patch.
But with this patch use-after-free problem happening in another place:

[  894.012920] [ cut here ]
[  894.012939] refcount_t: underflow; use-after-free.
[  894.012968] WARNING: CPU: 14 PID: 205 at lib/refcount.c:28
refcount_warn_saturate+0xba/0x110
[  894.012999] Modules linked in: tls uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep sunrpc snd_seq_midi snd_seq_midi_event snd_hda_codec_realtek
mt76x2u mt76x2_common snd_hda_codec_generic snd_hda_codec_hdmi
intel_rapl_msr mt76x02_usb intel_rapl_common snd_hda_intel mt76_usb
snd_intel_dspcfg vfat iwlmvm snd_intel_sdw_acpi mt76x02_lib fat
snd_usb_audio snd_hda_codec mt76 edac_mce_amd snd_usbmidi_lib
snd_hda_core btusb snd_rawmidi snd_hwdep mac80211 mc iwlwifi btrtl
eeepc_wmi asus_wmi btbcm snd_seq kvm_amd libarc4 ledtrig_audio
snd_seq_device btintel iwlmei sparse_keymap btmtk kvm snd_pcm
irqbypass platform_profile snd_timer xpad joydev cfg80211 rapl
hid_logitech_hidpp bluetooth ff_memless wmi_bmof video pcspkr snd
k10temp i2c_piix4
[  894.013086]  soundcore rfkill mei asus_ec_sensors acpi_cpufreq zram
amdgpu drm_ttm_helper ttm iommu_v2 crct10dif_pclmul ucsi_ccg gpu_sched
crc32_pclmul crc32c_intel typec_ucsi drm_buddy typec
drm_display_helper ghash_clmulni_intel igb ccp cec nvme sp5100_tco
nvme_core dca wmi ip6_tables ip_tables fuse
[  894.013322] Unloaded tainted modules: amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1
[  894.013455]  pcc_cpufreq():1 pcc_cpufreq():1 fjes():1
pcc_cpufreq():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
[  894.013690] CPU: 14 PID: 205 Comm: kworker/14:1 Tainted: GW
   L---  ---
5.20.0-0.rc0.20220812git7ebfc85e2cd7.11.fc38.x86_64 #1
[  894.013725] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[  894.013756] Workqueue: events drm_sched_entity_kill_jobs_work [gpu_sched]
[  894.013779] RIP: 0010:refcount_warn_saturate+0xba/0x110
[  894.013796] Code: 01 01 e8 79 4a 6f 00 0f 0b e9 42 47 a5 00 80 3d
de 7e be 01 00 75 85 48 c7 c7 f8 98 8e 9c c6 05 ce 7e be 01 01 e8 56
4a 6f 00 <0f> 0b e9 1f 47 a5 00 80 3d b9 7e be 01 00 0f 85 5e ff ff ff
48 c7
[  894.013842] RSP: 0018:b48681153e60 EFLAGS: 00010286
[  894.013858] RAX: 0026 RBX: 9bad16f1f028 RCX: 
[  894.013878] RDX: 0001 RSI: 9c8d06dc RDI: 
[  894.013897] RBP: 9bba663f5600 R08:  R09: b48681153d10
[  894.013916] R10: 0003 R11: 9bbaae2fffe8 R12: 9bba663fc800
[  894.013934] R13: 9bab93fcab40 R14: 9bba663fc805 R15: 9bad16f1f030
[  894.013954] FS:  () GS:9bba6620()
knlGS:
[  894.013975] CS:  0010 DS:  ES:  CR0: 80050033
[  894.013991] CR2: 1aa46b2ec008 CR3: 000101516000 CR4: 00350ee0
[  894.014011] Call Trace:
[  894.014022]  
[  894.014030]  process_one_work+0x2a0/0x600
[  894.014051]  worker_thread+0x4f/0x3a0
[  894.014065]  ? process_one_work+0x600/0x600
[  894.014079]  kthread+0xf5/0x120
[  894.014092]  ? kthread_complete_and_exit+0x20/0x20
[  894.014109]  ret_from_fork+0x22/0x30
[  894.014129]  
[  894.014137] irq event stamp: 5802
[  894.014148] hardirqs last  enabled at (5801): []
_raw_spin_unlock_irq+0x24/0x50
[  894.014178] hardirqs last disabled at (5802): []
__schedule+0xe2c/0x16d0
[  894.014206] softirq

[BUG][5.20] refcount_t: underflow; use-after-free

2022-08-14 Thread Mikhail Gavrilov
Hi folks.
Joined testing 5.20 today (7ebfc85e2cd7).
I encountered a frequently GPU freeze, after which a message appears
in the kernel logs:
[ 220.280990] [ cut here ]
[ 220.281000] refcount_t: underflow; use-after-free.
[ 220.281019] WARNING: CPU: 1 PID: 3746 at lib/refcount.c:28
refcount_warn_saturate+0xba/0x110
[ 220.281029] Modules linked in: uinput rfcomm snd_seq_dummy
snd_hrtimer nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink
qrtr bnep sunrpc snd_seq_midi snd_seq_midi_event vfat intel_rapl_msr
fat intel_rapl_common snd_hda_codec_realtek mt76x2u
snd_hda_codec_generic snd_hda_codec_hdmi mt76x2_common iwlmvm
mt76x02_usb edac_mce_amd mt76_usb snd_hda_intel snd_intel_dspcfg
mt76x02_lib snd_intel_sdw_acpi snd_usb_audio snd_hda_codec mt76
kvm_amd uvcvideo mac80211 snd_hda_core btusb eeepc_wmi snd_usbmidi_lib
videobuf2_vmalloc videobuf2_memops kvm btrtl snd_rawmidi asus_wmi
snd_hwdep videobuf2_v4l2 btbcm iwlwifi ledtrig_audio libarc4 btintel
snd_seq videobuf2_common sparse_keymap btmtk irqbypass videodev
snd_seq_device joydev xpad iwlmei platform_profile bluetooth
ff_memless snd_pcm mc rapl
[ 220.281185] video snd_timer cfg80211 wmi_bmof snd pcspkr soundcore
k10temp i2c_piix4 rfkill mei asus_ec_sensors acpi_cpufreq zram
hid_logitech_hidpp amdgpu igb dca drm_ttm_helper ttm crct10dif_pclmul
iommu_v2 crc32_pclmul gpu_sched crc32c_intel ucsi_ccg drm_buddy nvme
typec_ucsi ghash_clmulni_intel drm_display_helper ccp nvme_core typec
sp5100_tco cec wmi ip6_tables ip_tables fuse
[ 220.281258] Unloaded tainted modules: amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 amd64_edac():1
amd64_edac():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 amd64_edac():1 amd64_edac():1
pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1 amd64_edac():1
pcc_cpufreq():1 pcc_cpufreq():1 amd64_edac():1 pcc_cpufreq():1
amd64_edac():1 pcc_cpufreq():1 pcc_cpufreq():1 pcc_cpufreq():1
[ 220.281388] pcc_cpufreq():1 fjes():1 pcc_cpufreq():1 fjes():1
fjes():1 fjes():1 fjes():1 fjes():1 fjes():1 fjes():1
[ 220.281415] CPU: 1 PID: 3746 Comm: chrome:cs0 Tainted: G W L ---
--- 5.20.0-0.rc0.20220812git7ebfc85e2cd7.10.fc38.x86_64 #1
[ 220.281421] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[ 220.281426] RIP: 0010:refcount_warn_saturate+0xba/0x110
[ 220.281431] Code: 01 01 e8 79 4a 6f 00 0f 0b e9 42 47 a5 00 80 3d de
7e be 01 00 75 85 48 c7 c7 f8 98 8e 98 c6 05 ce 7e be 01 01 e8 56 4a
6f 00 <0f> 0b e9 1f 47 a5 00 80 3d b9 7e be 01 00 0f 85 5e ff ff ff 48
c7
[ 220.281437] RSP: 0018:b4b0d18d7a80 EFLAGS: 00010282
[ 220.281443] RAX: 0026 RBX: 0003 RCX: 
[ 220.281448] RDX: 0001 RSI: 988d06dc RDI: 
[ 220.281452] RBP:  R08:  R09: b4b0d18d7930
[ 220.281457] R10: 0003 R11: a0672e2fffe8 R12: a058ca360400
[ 220.281461] R13: a05846c50a18 R14: fe00 R15: 0003
[ 220.281465] FS: 7f82683e06c0() GS:a066e2e0()
knlGS:
[ 220.281470] CS: 0010 DS:  ES:  CR0: 80050033
[ 220.281475] CR2: 3590005cc000 CR3: 0001fca46000 CR4: 00350ee0
[ 220.281480] Call Trace:
[ 220.281485] 
[ 220.281490] amdgpu_cs_ioctl+0x4e2/0x2070 [amdgpu]
[ 220.281806] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[ 220.282028] drm_ioctl_kernel+0xa4/0x150
[ 220.282043] drm_ioctl+0x21f/0x420
[ 220.282053] ? amdgpu_cs_find_mapping+0xe0/0xe0 [amdgpu]
[ 220.282275] ? lock_release+0x14f/0x460
[ 220.282282] ? _raw_spin_unlock_irqrestore+0x30/0x60
[ 220.282290] ? _raw_spin_unlock_irqrestore+0x30/0x60
[ 220.282297] ? lockdep_hardirqs_on+0x7d/0x100
[ 220.282305] ? _raw_spin_unlock_irqrestore+0x40/0x60
[ 220.282317] amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
[ 220.282534] __x64_sys_ioctl+0x90/0xd0
[ 220.282545] do_syscall_64+0x5b/0x80
[ 220.282551] ? futex_wake+0x6c/0x150
[ 220.282568] ? lock_is_held_type+0xe8/0x140
[ 220.282580] ? do_syscall_64+0x67/0x80
[ 220.282585] ? lockdep_hardirqs_on+0x7d/0x100
[ 220.282592] ? do_syscall_64+0x67/0x80
[ 220.282597] ? do_syscall_64+0x67/0x80
[ 220.282602] ? lockdep_hardi

Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12

2021-01-24 Thread Mikhail Gavrilov
On Thu, 21 Jan 2021 at 18:27, Christian König  wrote:
>
> I still have no idea what's going on here.
>
> The KASAN messages from the DC code are completely unrelated.
>
> Please add the full dmesg to your bug report.
>

I did it.
https://gitlab.freedesktop.org/drm/amd/-/issues/1439#note_776267

-- 
Best Regards,
Mike Gavrilov.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12

2021-01-19 Thread Mikhail Gavrilov
On Fri, 15 Jan 2021 at 03:43, Mikhail Gavrilov
 wrote:
>

In rc4, the number of warnings has dropped dramatically.
No more errors "kasan slab-out-of-bounds" and no "DMA-API device
driver failed to check map error".
But still not fixed "sleeping function called from invalid context at
include/linux/sched/mm.h:196" and "BUG: key 88810b0d9148 has not
been registered!"
Second issue Navi specific because it started to happen in 5.10 kernel
after replacing Radeon VII to 6900XT.

1.
BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:196
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 500, name: systemd-udevd
1 lock held by systemd-udevd/500:
 #0: 888107690258 (&dev->mutex){}-{3:3}, at:
device_driver_attach+0xa3/0x250
CPU: 9 PID: 500 Comm: systemd-udevd Not tainted
5.11.0-0.rc4.129.fc34.x86_64+debug #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 2802 10/21/2020
Call Trace:
 dump_stack+0xae/0xe5
 ___might_sleep.cold+0x150/0x17e
 ? dcn30_clock_source_create+0x53/0x110 [amdgpu]
 kmem_cache_alloc_trace+0x23f/0x270
 dcn30_clock_source_create+0x53/0x110 [amdgpu]
 dcn30_create_resource_pool+0x998/0x4890 [amdgpu]
 ? dcn30_calc_max_scaled_time+0x40/0x40 [amdgpu]
 ? lock_is_held_type+0xb8/0xf0
 ? unpoison_range+0x3a/0x60
 ? kasan_kmalloc.constprop.0+0x84/0xa0
 ? dc_create_resource_pool+0x26e/0x5e0 [amdgpu]
 dc_create_resource_pool+0x26e/0x5e0 [amdgpu]
 dc_create+0x636/0x1bc0 [amdgpu]
 ? lock_acquire+0x2dd/0x7a0
 ? sched_clock+0x5/0x10
 ? sched_clock_cpu+0x18/0x170
 ? find_held_lock+0x33/0x110
 ? dc_create_state+0xa0/0xa0 [amdgpu]
 ? lock_downgrade+0x6b0/0x6b0
 ? module_assert_mutex_or_preempt+0x3e/0x70
 ? lock_is_held_type+0xb8/0xf0
 ? unpoison_range+0x3a/0x60
 ? kasan_kmalloc.constprop.0+0x84/0xa0
 amdgpu_dm_init.isra.0+0x479/0x640 [amdgpu]
 ? vprintk_emit+0x1c0/0x460
 ? dev_vprintk_emit+0x2d8/0x31a
 ? sched_clock+0x5/0x10
 ? dm_resume+0x13b0/0x13b0 [amdgpu]
 ? dev_attr_show.cold+0x35/0x35
 ? lock_downgrade+0x6b0/0x6b0
 ? dev_printk_emit+0x8c/0xa8
 ? dev_vprintk_emit+0x31a/0x31a
 ? wait_for_completion_io+0x240/0x240
 ? __dev_printk+0x71/0xdf
 ? smu_hw_init.cold+0x16b/0x18a [amdgpu]
 ? smu_suspend+0x240/0x240 [amdgpu]
 ? navi10_ih_irq_init+0xea3/0x2420 [amdgpu]
 dm_hw_init+0xe/0x20 [amdgpu]
 amdgpu_device_init.cold+0x3031/0x4940 [amdgpu]
 ? amdgpu_device_cache_pci_state+0xf0/0xf0 [amdgpu]
 ? pci_bus_read_config_byte+0x140/0x140
 ? do_pci_enable_device+0x1f8/0x260
 ? pci_find_saved_ext_cap+0x110/0x110
 ? pci_enable_bridge+0xf9/0x1e0
 ? pci_dev_check_d3cold+0x107/0x250
 ? pci_enable_device_flags+0x201/0x340
 amdgpu_driver_load_kms+0x167/0x8a0 [amdgpu]
 amdgpu_pci_probe+0x235/0x360 [amdgpu]
 ? amdgpu_pci_remove+0xd0/0xd0 [amdgpu]
 local_pci_probe+0xd8/0x170
 pci_device_probe+0x318/0x5c0
 ? kernfs_create_link+0x16c/0x230
 ? pci_device_remove+0x1d0/0x1d0
 really_probe+0x224/0xc40
 driver_probe_device+0x1f2/0x380
 device_driver_attach+0x1df/0x250
 __driver_attach+0xf6/0x260
 ? device_driver_attach+0x250/0x250
 bus_for_each_dev+0x114/0x180
 ? subsys_dev_iter_exit+0x10/0x10
 bus_add_driver+0x352/0x570
 driver_register+0x20f/0x390
 ? __pci_register_driver+0x13a/0x210
 ? 0xc1d8d000
 do_one_initcall+0xfb/0x530
 ? perf_trace_initcall_level+0x3d0/0x3d0
 ? __memset+0x2b/0x30
 ? unpoison_range+0x3a/0x60
 do_init_module+0x1ce/0x7a0
 load_module+0x9841/0xa380
 ? module_frob_arch_sections+0x20/0x20
 ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0
 ? sched_clock_cpu+0x18/0x170
 ? sched_clock+0x5/0x10
 ? lock_acquire+0x2dd/0x7a0
 ? sched_clock+0x5/0x10
 ? lock_is_held_type+0xb8/0xf0
 ? __do_sys_init_module+0x18b/0x220
 __do_sys_init_module+0x18b/0x220
 ? load_module+0xa380/0xa380
 ? ktime_get_coarse_real_ts64+0x12f/0x160
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f2c109da07e
Code: 48 8b 0d f5 1d 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d c2 1d 0c 00 f7 d8 64 89 01 48
RSP: 002b:7ffc84d33f88 EFLAGS: 0246 ORIG_RAX: 00af
RAX: ffda RBX: 55b87f8260a0 RCX: 7f2c109da07e
RDX: 55b87f834060 RSI: 01e2cbf6 RDI: 7f2c0b7e0010
RBP: 7f2c0b7e0010 R08: 55b87f8281e0 R09: 7ffc84d30a26
R10: 55bd2404cc18 R11: 0246 R12: 55b87f834060
R13: 55b87f831ca0 R14:  R15: 55b87f832640
[drm] Display Core initialized with v3.2.116!
[drm] DMUB hardware initialized: version=0x0201
usb 1-3.2: Device not responding to setup address.
usb 1-3.2: device not accepting address 5, error -71
[drm] REG_WAIT timeout 1us * 10 tries - mpc2_assert_idle_mpcc line:480


2.
BUG: key 88810b0d9148 has not been registered!
[ cut here ]
DEBUG_LOCKS_WARN_ON(1)
WARNING: CPU: 25 PID: 500 at kernel/locking/lockdep.c:4618
lockdep_init_map_waits+0x592/0x770
Modules li

Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12

2021-01-14 Thread Mikhail Gavrilov
On Thu, 14 Jan 2021 at 18:56, Christian König  wrote:
> Unfortunately not of hand.
>
> I also don't see any bug reports from other people and can't reproduce
> the last backtrace you send out TTM here.

Because only the most desperate will install kernels with enabled
debug flags and then load the system by opening a huge number of
programs and tabs. So you shouldn't be surprised that I'm the only one
here.
This is what my desktop looks like every day: https://imgur.com/a/Kxlmrem

> Do you have any local modifications or special setup in your system?
> Like bpf scripts or something like that?

No, my I didn't write any bpf scripts, but looks like my distribution
Fedora Rawhide uses some bpf scripts by default out of box:

# bpftool prog
20: cgroup_device  tag 40ddf486530245f5  gpl
loaded_at 2021-01-15T01:30:04+0500  uid 0
xlated 504B  jited 309B  memlock 4096B
21: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:04+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
22: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:04+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
23: cgroup_device  tag ca8e50a3c7fb034b  gpl
loaded_at 2021-01-15T01:30:05+0500  uid 0
xlated 496B  jited 307B  memlock 4096B
24: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:05+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
25: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:05+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
26: cgroup_device  tag be31ae23198a0378  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 464B  jited 288B  memlock 4096B
27: cgroup_device  tag ee0e253c78993a24  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 416B  jited 255B  memlock 4096B
28: cgroup_device  tag 438c5618576e5b0c  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 568B  jited 354B  memlock 4096B
29: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
30: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
31: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
32: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:13+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
33: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:14+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
34: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:14+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
35: cgroup_device  tag ee0e253c78993a24  gpl
loaded_at 2021-01-15T01:30:14+0500  uid 0
xlated 416B  jited 255B  memlock 4096B
38: cgroup_device  tag 3a0ef5414c2f6fca  gpl
loaded_at 2021-01-15T01:30:14+0500  uid 0
xlated 744B  jited 447B  memlock 4096B
39: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:14+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
40: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:14+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
41: cgroup_device  tag ee0e253c78993a24  gpl
loaded_at 2021-01-15T01:30:18+0500  uid 0
xlated 416B  jited 255B  memlock 4096B
42: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:18+0500  uid 0
xlated 64B  jited 54B  memlock 4096B
43: cgroup_skb  tag 6deef7357e7b4530  gpl
loaded_at 2021-01-15T01:30:18+0500  uid 0
xlated 64B  jited 54B  memlock 4096B

I catched yet another couples of leaks , but nothing new:
https://pastebin.com/2EgvYJdz

[1] do_detailed_mode+0x7c1/0x13d0 [drm]
[2] drm_mode_duplicate+0x45/0x220 [drm]
[3] do_seccomp+0x215/0x2280
[4] __vmalloc_node_range+0x464/0x7b0
[5] bpf_prog_alloc_no_stats+0xa2/0x2b0
[6] bpf_prog_store_orig_filter+0x7b/0x1c0
[7] kmemdup+0x1a/0x40

Did the following trace message confuse anyone?
==
BUG: KASAN: slab-out-of-bounds in
kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu]
Read of size 1 at addr 88812a6b4181 by task systemd-udevd/491

CPU: 20 PID: 491 Comm: systemd-udevd Not tainted
5.11.0-0.rc3.20210114git65f0d2414b70.125.fc34.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 2802 10/21/2020
Call Trace:
 dump_stack+0xae/0xe5
 print_address_description.constprop.0+0x18/0x160
 ? kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu]
 kasan_report.cold+0x7f/0x10e
 ? kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu]
 kfd_create_crat_image_virtual+0x12d2/0x1380 [amdgpu]
 ? kfd_create_crat_image_acpi+0x340/0x340 [amdgpu]
 ? __raw_spin_lock_init+0x39/0x110
 kfd_topology_init+0x2ac/0x400 [amdgpu]
 ? kfd_create_topology_device+0x320/0x320 [amdgpu]
 ? __class_register+0x2ad/0x430
 ? __class_create+0xc5/0x130
 kgd2kfd_init+0x95/0xf0 [amdgpu]
 amdgpu_a

Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12

2021-01-13 Thread Mikhail Gavrilov
On Tue, 12 Jan 2021 at 01:45, Christian König  wrote:
>
> But what you have in your logs so far are only unrelated symptoms, the
> root of the problem is that somebody is leaking memory.
>
> What you could do as well is to try to enable kmemleak

I captured some memleaks.
Do they contain any useful information?

[1] https://pastebin.com/n0FE7Hsu
[2] https://pastebin.com/MUX55L1k
[3] https://pastebin.com/a3FT7DVG
[4] https://pastebin.com/1ALvJKz7

--
Best Regards,
Mike Gavrilov.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12

2021-01-11 Thread Mikhail Gavrilov
Hi Christian,

On Tue, 12 Jan 2021 at 01:45, Christian König  wrote:
>
> Hi Mike,
>
> Unfortunately not, that's DC stuff. Easiest is to assign this as a bug
> tracker to our DC team.
Ok

> At least some progress. Any objections that I add your e-mail address as
> tested-by tag?
Yes, feel free add me.

> I can take a look at this one here. Looks like some missing error
> handling when allocating memory.
> Can you decode to which line number ttm_tt_swapin+0x34 points to?
$ /usr/src/kernels/`uname -r`/scripts/faddr2line
/lib/debug/lib/modules/`uname
-r`/kernel/drivers/gpu/drm/ttm/ttm.ko.debug ttm_tt_swapin+0x34
ttm_tt_swapin+0x34/0xd0:
mapping_gfp_mask at
/usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/./include/linux/pagemap.h:105
(discriminator 2)
(inlined by) ttm_tt_swapin at
/usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/drivers/gpu/drm/ttm/ttm_tt.c:210
(discriminator 2)

$ cat -s -n 
/usr/src/debug/kernel-20210108gitf5e6c330254a/linux-5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64/drivers/gpu/drm/ttm/ttm_tt.c
| head -220 | tail -20
   201  struct page *from_page;
   202  struct page *to_page;
   203  gfp_t gfp_mask;
   204  int i, ret;
   205
   206  swap_storage = ttm->swap_storage;
   207  BUG_ON(swap_storage == NULL);
   208
   209  swap_space = swap_storage->f_mapping;
   210  gfp_mask = mapping_gfp_mask(swap_space);
   211
   212  for (i = 0; i < ttm->num_pages; ++i) {
   213  from_page = shmem_read_mapping_page_gfp(swap_space, i,
   214  gfp_mask);
   215  if (IS_ERR(from_page)) {
   216  ret = PTR_ERR(from_page);
   217  goto out_err;
   218  }
   219  to_page = ttm->pages[i];
   220  if (unlikely(to_page == NULL)) {

> Please use this one here:
> https://gitlab.freedesktop.org/drm/amd/-/issues/new
>
> If you can't find the DC guys of hand in the assignee list just assign
> to me and I will forward.
https://gitlab.freedesktop.org/drm/amd/-/issues/1439
Ok, let's continue there.

--
Best Regards,
Mike Gavrilov.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12

2021-01-11 Thread Mikhail Gavrilov
On Mon, 11 Jan 2021 at 19:01, Christian König  wrote:

> Changing the page table attributes while releasing memory might sleep.
> So we can't use a spinlock here.
>
> Thanks for the report, a patch to fix this is on the mailing list now.

Can you look also the first trace?
Here a same error message "sleeping function called from invalid
context" and a lot of [amdgpu] code.

BUG: sleeping function called from invalid context at
include/linux/sched/mm.h:196
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 501, name: systemd-udevd
1 lock held by systemd-udevd/501:
 #0: 978e0278d258 (&dev->mutex){}-{3:3}, at:
device_driver_attach+0x3b/0xb0
CPU: 25 PID: 501 Comm: systemd-udevd Not tainted
5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 2802 10/21/2020
Call Trace:
 dump_stack+0x8b/0xb0
 ___might_sleep.cold+0xb6/0xc6
 ? dcn30_clock_source_create+0x34/0xb0 [amdgpu]
 kmem_cache_alloc_trace+0x204/0x230
 dcn30_clock_source_create+0x34/0xb0 [amdgpu]
 dcn30_create_resource_pool+0x1d9/0x13a0 [amdgpu]
 ? rcu_read_lock_sched_held+0x3f/0x80
 ? trace_kmalloc+0xb2/0xe0
 ? __kmalloc+0x191/0x280
 ? dc_create_resource_pool+0x110/0x1d0 [amdgpu]
 dc_create_resource_pool+0x110/0x1d0 [amdgpu]
 dc_create+0x205/0x790 [amdgpu]
 ? trace_kmalloc+0xb2/0xe0
 ? kmem_cache_alloc_trace+0x174/0x230
 amdgpu_dm_init.isra.0+0x1b9/0x250 [amdgpu]
 ? dev_vprintk_emit+0x171/0x195
 ? dev_printk_emit+0x3e/0x40
 dm_hw_init+0xe/0x20 [amdgpu]
 amdgpu_device_init.cold+0x179f/0x1afd [amdgpu]
 ? pci_conf1_read+0xa4/0x100
 amdgpu_driver_load_kms+0x68/0x280 [amdgpu]
 amdgpu_pci_probe+0x129/0x1b0 [amdgpu]
 local_pci_probe+0x42/0x80
 pci_device_probe+0xd9/0x1a0
 really_probe+0x205/0x460
 driver_probe_device+0xe1/0x150
 device_driver_attach+0xa8/0xb0
 __driver_attach+0x8c/0x150
 ? device_driver_attach+0xb0/0xb0
 ? device_driver_attach+0xb0/0xb0
 bus_for_each_dev+0x67/0x90
 bus_add_driver+0x12e/0x1f0
 driver_register+0x8f/0xe0
 ? 0xc0d9c000
 do_one_initcall+0x67/0x320
 ? rcu_read_lock_sched_held+0x3f/0x80
 ? trace_kmalloc+0xb2/0xe0
 ? kmem_cache_alloc_trace+0x174/0x230
 do_init_module+0x5c/0x270
 __do_sys_init_module+0x130/0x190
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f363661deee
Code: 48 8b 0d 85 1f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 52 1f 0c 00 f7 d8 64 89 01 48
RSP: 002b:7ffeb7191588 EFLAGS: 0246 ORIG_RAX: 00af
RAX: ffda RBX: 561b94563170 RCX: 7f363661deee
RDX: 561b94579df0 RSI: 00b8a356 RDI: 7f3633b9e010
RBP: 7f3633b9e010 R08: 561b94565240 R09: 7ffeb718d786
R10: 561ef5ef1595 R11: 0246 R12: 561b94579df0
R13: 561b9457a3e0 R14:  R15: 561b94576530
[drm] Display Core initialized with v3.2.116!
[drm] DMUB hardware initialized: version=0x0201
usb 1-3.2: new high-speed USB device number 5 using xhci_hcd
[drm] REG_WAIT timeout 1us * 10 tries - mpc2_assert_idle_mpcc line:480

> > -12 is just -ENOMEM. Looks like a memory leak to me, maybe caused by
> > the problem above, maybe something completely unrelated.
> >
> > I will take a look.
>
> The looks like a completely unrelated memory leak to me.
>
> Probably best if you open up a bug report for this.

Yes, the monitor still turns off after applying patch "make the pool
shrinker lock a mutex".
Anyway patch fixed the issue with flood of message "BUG: sleeping
function called from invalid context at mm/vmalloc.c:1756" so kernel
log became cleaner.
Now the issue with turns off monitor looks in logs so:

DMA-API: cacheline tracking ENOMEM, dma-debug disabled
amdgpu :0b:00.0: amdgpu: 6b791523 pin failed
[drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin
framebuffer with error -12
BUG: kernel NULL pointer dereference, address: 0060
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 0 P4D 0
Oops:  [#1] SMP NOPTI
CPU: 20 PID: 3780 Comm: brave:cs0 Tainted: GW-
---  5.11.0-0.rc2.20210108gitf5e6c330254a.120.fc34.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 2802 10/21/2020
RIP: 0010:ttm_tt_swapin+0x34/0x1b0 [ttm]
Code: 55 41 54 55 53 48 83 ec 10 48 8b 47 20 48 89 44 24 08 48 85 c0
0f 84 86 01 00 00 48 8b 44 24 08 49 89 fc 4c 8b a8 e0 01 00 00 <41> 8b
45 60 89 44 24 04 8b 47 0c 85 c0 0f 84 df 00 00 00 31 db 65
RSP: 0018:a7400532b9c0 EFLAGS: 00010286
RAX: 978e2ae25800 RBX: 97910ec12058 RCX: 978e12caac70
RDX: 8010 RSI:  RDI: 97912c3d99c0
RBP: 97912c3d99c0 R08:  R09: 70b3a000
R10: 0002 R11:  R12: 97912c3d99c0
R13:  R14: a7400532ba90 R15: 978e182c6350
FS:  7f070bb1b640(00

[drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin framebuffer with error -12

2021-01-10 Thread Mikhail Gavrilov
Hi folks,
today I joined to testing Kernel 5.11 and saw that the kernel log was
flooded with BUG messages:
BUG: sleeping function called from invalid context at mm/vmalloc.c:1756
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 266, name: kswapd0
INFO: lockdep is turned off.
CPU: 15 PID: 266 Comm: kswapd0 Tainted: GW-
---  5.11.0-0.rc2.20210108gitf5e6c330254a.119.fc34.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 2802 10/21/2020
Call Trace:
 dump_stack+0x8b/0xb0
 ___might_sleep.cold+0xb6/0xc6
 vm_unmap_aliases+0x21/0x40
 change_page_attr_set_clr+0x9e/0x190
 set_memory_wb+0x2f/0x80
 ttm_pool_free_page+0x28/0x90 [ttm]
 ttm_pool_shrink+0x45/0xb0 [ttm]
 ttm_pool_shrinker_scan+0xa/0x20 [ttm]
 do_shrink_slab+0x177/0x3a0
 shrink_slab+0x9c/0x290
 shrink_node+0x2e6/0x700
 balance_pgdat+0x2f5/0x650
 kswapd+0x21d/0x4d0
 ? do_wait_intr_irq+0xd0/0xd0
 ? balance_pgdat+0x650/0x650
 kthread+0x13a/0x150
 ? __kthread_bind_mask+0x60/0x60
 ret_from_fork+0x22/0x30

But the most unpleasant thing is that after a while the monitor turns
off and does not go on again until the restart.
This is accompanied by an entry in the kernel log:

amdgpu :0b:00.0: amdgpu: ff7d8b94 pin failed
[drm:dm_plane_helper_prepare_fb [amdgpu]] *ERROR* Failed to pin
framebuffer with error -12

$ grep "Failed to pin framebuffer with error" -Rn .
./drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:5816:
DRM_ERROR("Failed to pin framebuffer with error %d\n", r);

$ git blame -L 5811,5821 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
Blaming lines:   0% (11/9167), done.
5d43be0ccbc2f (Christian König 2017-10-26 18:06:23 +0200 5811)
 domain = AMDGPU_GEM_DOMAIN_VRAM;
e7b07ceef2a65 (Harry Wentland  2017-08-10 13:29:07 -0400 5812)
7b7c6c81b3a37 (Junwei Zhang2018-06-25 12:51:14 +0800 5813)  r =
amdgpu_bo_pin(rbo, domain);
e7b07ceef2a65 (Harry Wentland  2017-08-10 13:29:07 -0400 5814)  if
(unlikely(r != 0)) {
30b7c6147d18d (Harry Wentland  2017-10-26 15:35:14 -0400 5815)
 if (r != -ERESTARTSYS)
30b7c6147d18d (Harry Wentland  2017-10-26 15:35:14 -0400 5816)
 DRM_ERROR("Failed to pin framebuffer with error %d\n", r);
0f257b09531b4 (Chunming Zhou   2019-05-07 19:45:31 +0800 5817)
 ttm_eu_backoff_reservation(&ticket, &list);
e7b07ceef2a65 (Harry Wentland  2017-08-10 13:29:07 -0400 5818)
 return r;
e7b07ceef2a65 (Harry Wentland  2017-08-10 13:29:07 -0400 5819)  }
e7b07ceef2a65 (Harry Wentland  2017-08-10 13:29:07 -0400 5820)
bb812f1ea87dd (Junwei Zhang2018-06-25 13:32:24 +0800 5821)  r =
amdgpu_ttm_alloc_gart(&rbo->tbo);

Who knows how to fix it?

Full kernel logs is here:
[1] https://pastebin.com/fLasjDHX
[2] https://pastebin.com/g3wR2r9e

--
Best Regards,
Mike Gavrilov.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [bug] Radeon 3900XT not switch to graphic mode on kernel 5.10

2020-12-30 Thread Mikhail Gavrilov
On Tue, 29 Dec 2020 at 20:15, Deucher, Alexander
 wrote:
>
> It looks like the driver is not able to access the firmware for some reason.  
> Please make sure it is available in your initrd or compiled into the kernel 
> depending on your config.

Exactly! Thanks!

# lsinitrd 
/boot/initramfs-5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64.img
| grep sienna_cichlid

# ls /usr/lib/firmware/amdgpu | grep sienna_cichlid
sienna_cichlid_ce.bin
sienna_cichlid_dmcub.bin
sienna_cichlid_me.bin
sienna_cichlid_mec2.bin
sienna_cichlid_mec.bin
sienna_cichlid_pfp.bin
sienna_cichlid_rlc.bin
sienna_cichlid_sdma.bin
sienna_cichlid_smc.bin
sienna_cichlid_sos.bin
sienna_cichlid_ta.bin
sienna_cichlid_vcn.bin

# dracut -f --regenerate-all

# lsinitrd 
/boot/initramfs-5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64.img
| grep sienna_cichlid
-rw-r--r--   1 root root   263296 Dec 15 14:00
usr/lib/firmware/amdgpu/sienna_cichlid_ce.bin
-rw-r--r--   1 root root80244 Dec 15 14:00
usr/lib/firmware/amdgpu/sienna_cichlid_dmcub.bin
-rw-r--r--   1 root root   263424 Dec 15 14:00
usr/lib/firmware/amdgpu/sienna_cichlid_me.bin
-rw-r--r--   2 root root   268592 Dec 15 14:00
usr/lib/firmware/amdgpu/sienna_cichlid_mec2.bin
-rw-r--r--   2 root root0 Dec 15 14:00
usr/lib/firmware/amdgpu/sienna_cichlid_mec.bin
-rw-r--r--   1 root root   263424 Dec 15 14:00
usr/lib/firmware/amdgpu/sienna_cichlid_pfp.bin
-rw-r--r--   1 root root   128592 Dec 15 14:00
usr/lib/firmware/amdgpu/sienna_cichlid_rlc.bin
-rw-r--r--   1 root root34048 Dec 15 14:00
usr/lib/firmware/amdgpu/sienna_cichlid_sdma.bin
-rw-r--r--   1 root root   247396 Dec 15 14:00
usr/lib/firmware/amdgpu/sienna_cichlid_smc.bin
-rw-r--r--   1 root root   215152 Dec 15 14:00
usr/lib/firmware/amdgpu/sienna_cichlid_sos.bin
-rw-r--r--   1 root root   333568 Dec 15 14:00
usr/lib/firmware/amdgpu/sienna_cichlid_ta.bin
-rw-r--r--   1 root root   504224 Dec 15 14:00
usr/lib/firmware/amdgpu/sienna_cichlid_vcn.bin

# grep '20201204git34816d20f173\|linux-firmware-20201218-116'
/var/log/dnf.rpm.log
2020-12-06T12:40:44+0500 SUBDEBUG Installed:
kernel-core-5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64
2020-12-06T12:40:46+0500 SUBDEBUG Installed:
kernel-modules-5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64
2020-12-06T12:41:03+0500 SUBDEBUG Installed:
kernel-5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64
2020-12-06T12:41:03+0500 SUBDEBUG Installed:
kernel-modules-extra-5.10.0-0.rc6.20201204git34816d20f173.92.fc34.x86_64
2020-12-21T10:52:43+0500 SUBDEBUG Upgrade:
linux-firmware-20201218-116.fc34.noarch

I think every update of linux-firmware should regenerate initramfs.
But my downstream report was closed:
https://bugzilla.redhat.com/show_bug.cgi?id=1911745

--
Best Regards,
Mike Gavrilov.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [bug] Radeon 3900XT not switch to graphic mode on kernel 5.10

2020-12-27 Thread Mikhail Gavrilov
On Sun, 27 Dec 2020 at 21:39, Mikhail Gavrilov
 wrote:
> I suppose the root of cause my problem here:
>
> [3.961326] amdgpu :0b:00.0: Direct firmware load for
> amdgpu/sienna_cichlid_sos.bin failed with error -2
> [3.961359] amdgpu :0b:00.0: amdgpu: failed to init sos firmware
> [3.961433] [drm:psp_sw_init [amdgpu]] *ERROR* Failed to load psp firmware!
> [3.961529] [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* sw_init
> of IP block  failed -2
> [3.961549] amdgpu :0b:00.0: amdgpu: amdgpu_device_ip_init failed
> [3.961569] amdgpu :0b:00.0: amdgpu: Fatal error during GPU init
> [3.961911] amdgpu: probe of :0b:00.0 failed with error -2
>

# dnf provides */sienna_cichlid_sos.bin
Last metadata expiration check: 3:01:27 ago on Sun 27 Dec 2020 06:53:25 PM +05.
linux-firmware-20201218-116.fc34.noarch : Firmware files used by the
Linux kernel
Repo: @System
Matched from:
Filename: /usr/lib/firmware/amdgpu/sienna_cichlid_sos.bin

linux-firmware-20201218-116.fc34.noarch : Firmware files used by the
Linux kernel
Repo: rawhide
Matched from:
Filename: /usr/lib/firmware/amdgpu/sienna_cichlid_sos.bin

# dnf install linux-firmware-20201218-116.fc34.noarch
Last metadata expiration check: 3:02:11 ago on Sun 27 Dec 2020 06:53:25 PM +05.
Package linux-firmware-20201218-116.fc34.noarch is already installed.
Dependencies resolved.
Nothing to do.
Complete!

Looks like firmware is present. So I didn't understand why the kernel
cannot read firmware.

--
Best Regards,
Mike Gavrilov.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[bugreport] [5.10-rc1] Oops: 0000 [#1] SMP NOPTI bug which always starts as page allocation failure

2020-11-03 Thread Mikhail Gavrilov
Hi folks.
I observed hard reproductible the set of bugs.
It always started as
1) kworker/u64:2: page allocation failure: order:5,
mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO),
nodemask=(null),cpuset=/,mems_allowed=0
Continious as:
2) WARNING: CPU: 21 PID: 806649 at
drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:7505
amdgpu_dm_atomic_commit_tail+0x23bd/0x24e0 [amdgpu]
And ended as:
3) BUG: unable to handle page fault for address: 00012488
Which annoing because lead to completely computer hang.

Example of one log:

[11561.927250] kworker/u64:10: page allocation failure: order:5,
mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO),
nodemask=(null),cpuset=/,mems_allowed=0
[11561.927472] CPU: 18 PID: 39985 Comm: kworker/u64:10 Not tainted
5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1
[11561.927475] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 2802 10/21/2020
[11561.927485] Workqueue: events_unbound commit_work [drm_kms_helper]
[11561.927489] Call Trace:
[11561.927496]  dump_stack+0x8b/0xb0
[11561.927501]  warn_alloc.cold+0x75/0xd9
[11561.927507]  ? _cond_resched+0x16/0x50
[11561.927512]  ? __alloc_pages_direct_compact+0x159/0x180
[11561.927518]  __alloc_pages_slowpath.constprop.0+0x103f/0x1070
[11561.927531]  __alloc_pages_nodemask+0x37d/0x400
[11561.927538]  kmalloc_order+0x33/0xc0
[11561.927542]  kmalloc_order_trace+0x19/0x110
[11561.927614]  dc_create_state+0x26/0x60 [amdgpu]
[11561.927677]  amdgpu_dm_atomic_commit_tail+0x1cee/0x24e0 [amdgpu]
[11561.927686]  ? find_busiest_group+0x33/0x350
[11561.927698]  ? __lock_acquire+0x3b0/0x21f0
[11561.927707]  ? lock_acquire+0xc8/0x400
[11561.927710]  ? wait_for_completion_timeout+0x3b/0xf0
[11561.927715]  ? mark_held_locks+0x50/0x80
[11561.927719]  ? lockdep_hardirqs_on_prepare+0xff/0x180
[11561.927722]  ? _raw_spin_unlock_irq+0x24/0x40
[11561.927726]  ? _raw_spin_unlock_irq+0x24/0x40
[11561.927729]  ? wait_for_completion_timeout+0xdb/0xf0
[11561.927740]  commit_tail+0x94/0x130 [drm_kms_helper]
[11561.927745]  process_one_work+0x27d/0x5b0
[11561.927753]  worker_thread+0x55/0x3c0
[11561.927756]  ? process_one_work+0x5b0/0x5b0
[11561.927760]  kthread+0x13a/0x150
[11561.927763]  ? __kthread_bind_mask+0x60/0x60
[11561.927769]  ret_from_fork+0x22/0x30
[11561.927809] Mem-Info:
[11561.927816] active_anon:933848 inactive_anon:4558268 isolated_anon:118
active_file:154021 inactive_file:80446 isolated_file:0
unevictable:1586 dirty:32469 writeback:700
slab_reclaimable:185330 slab_unreclaimable:176202
mapped:514440 shmem:592199 pagetables:81732 bounce:0
free:99082 free_pcp:2104 free_cma:0
[11561.927820] Node 0 active_anon:3735392kB inactive_anon:18233072kB
active_file:616084kB inactive_file:321784kB unevictable:6344kB
isolated(anon):472kB isolated(file):0kB mapped:2057760kB
dirty:129876kB writeback:2800kB shmem:2368796kB shmem_thp: 0kB
shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:8kB
kernel_stack:96608kB all_unreclaimable? no
[11561.927824] Node 0 DMA free:11800kB min:32kB low:44kB high:56kB
reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB
active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB
present:15992kB managed:15900kB mlocked:0kB pagetables:0kB bounce:0kB
free_pcp:0kB local_pcp:0kB free_cma:0kB
[11561.927829] lowmem_reserve[]: 0 3136 31809 31809 31809
[11561.927839] Node 0 DMA32 free:142632kB min:26264kB low:29472kB
high:32680kB reserved_highatomic:0KB active_anon:131568kB
inactive_anon:1625184kB active_file:57556kB inactive_file:13532kB
unevictable:0kB writepending:2428kB present:3317760kB
managed:3317572kB mlocked:0kB pagetables:25624kB bounce:0kB
free_pcp:1764kB local_pcp:0kB free_cma:0kB
[11561.927844] lowmem_reserve[]: 0 0 28673 28673 28673
[11561.927854] Node 0 Normal free:241896kB min:240300kB low:269660kB
high:299020kB reserved_highatomic:2048KB active_anon:3603472kB
inactive_anon:16607812kB active_file:558660kB inactive_file:308056kB
unevictable:6344kB writepending:130596kB present:30133248kB
managed:29370624kB mlocked:6344kB pagetables:301304kB bounce:0kB
free_pcp:6656kB local_pcp:60kB free_cma:0kB
[11561.927859] lowmem_reserve[]: 0 0 0 0 0
[11561.927871] Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB
(U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB
(M) = 11800kB
[11561.927900] Node 0 DMA32: 15432*4kB (UME) 4963*8kB (UME) 2169*16kB
(UME) 201*32kB (UM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 142568kB
[11561.927923] Node 0 Normal: 49027*4kB (UMEH) 5656*8kB (MH) 20*16kB
(H) 10*32kB (H) 2*64kB (H) 2*128kB (H) 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 242380kB
[11561.927951] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
[11561.927954] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[11561.927956] 847580 total pagecache pages
[11561.927967] 19862 pages in swap cache
[11561.927970

[BUG] general protection fault, probably for non-canonical address 0xfe5d6f0af7831e5e: 0000 [#1] SMP NOPTI (5.7RC4 GIT 79dede78c057)

2020-05-11 Thread Mikhail Gavrilov
Hi folks.
I didn’t do anything unusual, I just restarted the computer after the
update, launched all the applications that I usually launch and went
to drink tea.
When I returned, I found that the monitor was on (it should have
turned off since I had set the energy-saving mode for 5 minutes in DE)
I tried to move the mouse, after that I realized that the computer was
completely frozen. Even Alt+PrnScr+B did not helped reboot computer.
I decided to fill the bug report here since this is a really serious problem.

general protection fault, probably for non-canonical address
0xfe5d6f0af7831e5e:  [#1] SMP NOPTI
CPU: 16 PID: 6372 Comm: chrome:cs0 Not tainted
5.7.0-0.rc4.20200508git79dede78c057.1.fc33.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 1405 11/19/2019
RIP: 0010:kmem_cache_alloc+0x83/0x310
Code: 02 00 00 4c 8b 45 00 65 49 8b 50 08 65 4c 03 05 5b a3 cc 5e 4d
8b 20 4d 85 e4 0f 84 3e 02 00 00 8b 45 20 48 8b 7d 00 4c 01 e0 <48> 8b
18 48 89 c1 48 33 9d d0 01 00 00 48 0f c9 48 31 cb 40 f6 c7
RSP: 0018:a8398b357b08 EFLAGS: 00010282
RAX: fe5d6f0af7831e5e RBX:  RCX: 
RDX: 62b6 RSI: 0400 RDI: 001f83c0
RBP: 9513740e9200 R08: 95137c3f83c0 R09: 
R10:  R11:  R12: fe5d6f0af7831dee
R13: 0dc0 R14: 9513740e9200 R15: c03a3e92
FS:  7fd77db5c700() GS:95137c20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fea1fe56540 CR3: 00060424a000 CR4: 00340ee0
Call Trace:
 drm_sched_fence_create+0x22/0xc0 [gpu_sched]
 drm_sched_job_init+0x5d/0xa0 [gpu_sched]
 amdgpu_cs_ioctl+0x17d5/0x1eb0 [amdgpu]
 ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
 drm_ioctl_kernel+0x86/0xd0 [drm]
 drm_ioctl+0x206/0x390 [drm]
 ? amdgpu_cs_find_mapping+0xf0/0xf0 [amdgpu]
 amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
 ksys_ioctl+0x82/0xc0
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x5c/0xa0
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x7fd7954654bb
Code: 0f 1e fa 48 8b 05 cd b9 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff
ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 9d b9 0c 00 f7 d8 64 89 01 48
RSP: 002b:7fd77db5b628 EFLAGS: 0246 ORIG_RAX: 0010
RAX: ffda RBX: 7fd77db5b690 RCX: 7fd7954654bb
RDX: 7fd77db5b690 RSI: c0186444 RDI: 0016
RBP: c0186444 R08: 7fd77db5b7a0 R09: 7fd77db5b670
R10:  R11: 0246 R12: 3a732f36f000
R13: 0016 R14: 3a732f5122ec R15: 3a732f50a0f8
Modules linked in: snd_seq_dummy snd_hrtimer uinput rfcomm xt_CHECKSUM
xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp
tun nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat
ip6table_nat ip6table_mangle ip6table_raw ip6table_security
iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
iptable_mangle iptable_raw iptable_security ip_set nf_tables nfnetlink
ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat
hid_logitech_hidpp xpad ff_memless joydev edac_mce_amd kvm_amd kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel
hid_logitech_dj eeepc_wmi asus_wmi sparse_keymap video snd_usb_audio
btusb btrtl wmi_bmof btbcm snd_usbmidi_lib btintel snd_rawmidi
bluetooth mc ecdh_generic ecc pcspkr sp5100_tco k10temp iwlmvm
i2c_piix4 snd_hda_codec_realtek mac80211 snd_hda_codec_generic
ledtrig_audio
 snd_hda_codec_hdmi libarc4 snd_hda_intel snd_intel_dspcfg
snd_hda_codec iwlwifi snd_hda_core snd_hwdep cfg80211 snd_seq
snd_seq_device snd_pcm rfkill snd_timer snd ccp soundcore acpi_cpufreq
binfmt_misc ip_tables xfs libcrc32c amdgpu amd_iommu_v2 gpu_sched ttm
drm_kms_helper cec drm crc32c_intel igb nvme dca nvme_core
i2c_algo_bit wmi pinctrl_amd br_netfilter bridge stp llc fuse
---[ end trace 4528e591387ed399 ]---
RIP: 0010:kmem_cache_alloc+0x83/0x310
Code: 02 00 00 4c 8b 45 00 65 49 8b 50 08 65 4c 03 05 5b a3 cc 5e 4d
8b 20 4d 85 e4 0f 84 3e 02 00 00 8b 45 20 48 8b 7d 00 4c 01 e0 <48> 8b
18 48 89 c1 48 33 9d d0 01 00 00 48 0f c9 48 31 cb 40 f6 c7
RSP: 0018:a8398b357b08 EFLAGS: 00010282
RAX: fe5d6f0af7831e5e RBX:  RCX: 
RDX: 62b6 RSI: 0400 RDI: 001f83c0
RBP: 9513740e9200 R08: 95137c3f83c0 R09: 
R10:  R11:  R12: fe5d6f0af7831dee
R13: 0dc0 R14: 9513740e9200 R15: c03a3e92
FS:  7fd77db5c700() GS:95137c20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7fea1fe56540 CR3: 00060424a000 CR4: 00340ee0

$ /usr/src/kernels/`uname -r`/scripts/faddr2line
/lib/debug/lib/modules/`uname -r`/vmlinux

Re: BUG: kernel NULL pointer dereference, address: 0000000000000026 after switching to 5.7 kernel

2020-04-18 Thread Mikhail Gavrilov
On Sat, 11 Apr 2020 at 14:56, Christian König
 wrote:
>
> Yeah, that is a known issue.
>
> You could try the attached patch, but please be aware that it is not
> even compile tested because of the Easter holidays here.
>

Looks good to me, so it's pity that this patch did not exist in the
pull request https://patchwork.kernel.org/patch/11492083/

--
Best Regards,
Mike Gavrilov.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


BUG: kernel NULL pointer dereference, address: 0000000000000026 after switching to 5.7 kernel

2020-04-10 Thread Mikhail Gavrilov
Hi folks.
After upgrade kernel to 5.7 I see every boot in kernel log following
error messages:

[2.569513] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
[2.569538] [drm] PSP loading UVD firmware
[2.570038] BUG: kernel NULL pointer dereference, address: 0026
[2.570045] #PF: supervisor read access in kernel mode
[2.570050] #PF: error_code(0x) - not-present page
[2.570055] PGD 0 P4D 0
[2.570060] Oops:  [#1] SMP NOPTI
[2.570065] CPU: 5 PID: 667 Comm: uvd_enc_1.1 Not tainted
5.7.0-0.rc0.git6.1.2.fc33.x86_64 #1
[2.570072] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 1405 11/19/2019
[2.570085] RIP: 0010:__kthread_should_park+0x5/0x30
[2.570090] Code: 00 e9 fe fe ff ff e8 ca 3a 08 00 e9 49 fe ff ff
48 89 df e8 dd 38 08 00 84 c0 0f 84 6a ff ff ff e9 a6 fe ff ff 0f 1f
44 00 00  47 26 20 74 12 48 8b 87 88 09 00 00 48 8b 00 48 c1 e8 02
83 e0
[2.570103] RSP: 0018:ad8141723e50 EFLAGS: 00010246
[2.570107] RAX: 7fff RBX: 8a8d1d116ed8 RCX: 
[2.570112] RDX:  RSI:  RDI: 
[2.570116] RBP: 8a8d28c11300 R08:  R09: 
[2.570120] R10:  R11:  R12: 8a8d1d152e40
[2.570125] R13: 8a8d1d117280 R14: 8a8d1d116ed8 R15: 8a8d1ca68000
[2.570131] FS:  () GS:8a8d3aa0()
knlGS:
[2.570137] CS:  0010 DS:  ES:  CR0: 80050033
[2.570142] CR2: 0026 CR3: 0007e3dc6000 CR4: 003406e0
[2.570147] Call Trace:
[2.570157]  drm_sched_get_cleanup_job+0x42/0x130 [gpu_sched]
[2.570166]  drm_sched_main+0x6f/0x530 [gpu_sched]
[2.570173]  ? lockdep_hardirqs_on+0x11e/0x1b0
[2.570179]  ? drm_sched_get_cleanup_job+0x130/0x130 [gpu_sched]
[2.570185]  kthread+0x131/0x150
[2.570189]  ? __kthread_bind_mask+0x60/0x60
[2.570196]  ret_from_fork+0x27/0x50
[2.570203] Modules linked in: fjes(-) amdgpu(+) amd_iommu_v2
gpu_sched ttm drm_kms_helper drm crc32c_intel igb nvme nvme_core dca
i2c_algo_bit wmi pinctrl_amd br_netfilter bridge stp llc fuse
[2.570223] CR2: 0026
[2.570228] ---[ end trace 80c25d326e1e0d7c ]---
[2.570233] RIP: 0010:__kthread_should_park+0x5/0x30
[2.570238] Code: 00 e9 fe fe ff ff e8 ca 3a 08 00 e9 49 fe ff ff
48 89 df e8 dd 38 08 00 84 c0 0f 84 6a ff ff ff e9 a6 fe ff ff 0f 1f
44 00 00  47 26 20 74 12 48 8b 87 88 09 00 00 48 8b 00 48 c1 e8 02
83 e0
[2.570250] RSP: 0018:ad8141723e50 EFLAGS: 00010246
[2.570255] RAX: 7fff RBX: 8a8d1d116ed8 RCX: 
[2.570260] RDX:  RSI:  RDI: 
[2.570265] RBP: 8a8d28c11300 R08:  R09: 
[2.570271] R10:  R11:  R12: 8a8d1d152e40
[2.570276] R13: 8a8d1d117280 R14: 8a8d1d116ed8 R15: 8a8d1ca68000
[2.570281] FS:  () GS:8a8d3aa0()
knlGS:
[2.570287] CS:  0010 DS:  ES:  CR0: 80050033
[2.570292] CR2: 0026 CR3: 0007e3dc6000 CR4: 003406e0
[2.570299] BUG: sleeping function called from invalid context at
include/linux/percpu-rwsem.h:49
[2.570306] in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid:
667, name: uvd_enc_1.1
[2.570311] INFO: lockdep is turned off.
[2.570315] irq event stamp: 14
[2.570319] hardirqs last  enabled at (13): []
_raw_spin_unlock_irqrestore+0x46/0x60
[2.570330] hardirqs last disabled at (14): []
trace_hardirqs_off_thunk+0x1a/0x1c
[2.570338] softirqs last  enabled at (0): []
copy_process+0x706/0x1bc0
[2.570345] softirqs last disabled at (0): [<>] 0x0
[2.570351] CPU: 5 PID: 667 Comm: uvd_enc_1.1 Tainted: G  D
  5.7.0-0.rc0.git6.1.2.fc33.x86_64 #1
[2.570358] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 1405 11/19/2019
[2.570365] Call Trace:
[2.570373]  dump_stack+0x8b/0xc8
[2.570380]  ___might_sleep.cold+0xb6/0xc6
[2.570385]  exit_signals+0x1c/0x2d0
[2.570390]  do_exit+0xb1/0xc30
[2.570395]  ? kthread+0x131/0x150
[2.570400]  rewind_stack_do_exit+0x17/0x20
[2.570559] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
[2.570572] [drm] PSP loading VCE firmware
[3.146462] [drm] reserve 0x40 from 0x83fe80 for PSP TMR

$ /usr/src/kernels/`uname -r`/scripts/faddr2line
/lib/debug/lib/modules/`uname -r`/vmlinux __kthread_should_park+0x5
__kthread_should_park+0x5/0x30:
to_kthread at kernel/kthread.c:75
(inlined by) __kthread_should_park at kernel/kthread.c:109

I think this issue related to amdgpu driver.
Can anyone look into it?

Thanks.

Full kernel log here:
https://pastebin.com/RrSp6KYL

--
Best Regards,
Mike Gavrilov.
_

BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 60s!

2020-01-11 Thread Mikhail Gavrilov
Hi folks, I just wanted to share my logs via paste but didn't look at
what size they are.
I opened the file in Geany and press Ctrl + A, Ctrl + C, and then go
to Chrome in tab with opened pastebin.com and pressed Ctrl + V. I did
not expect that after such action the GUI of the system hangs.
I connected via ssh and saw follow messages:

[  317.662558] nf_conntrack: default automatic helper assignment has
been turned off for security reasons and CT-based  firewall rule not
found. Use the iptables CT target to attach helpers instead.
[ 2003.644286] GpuWatchdog[4339]: segfault at 0 ip 562357dfa40c sp
7fbc6bdc3500 error 6 in chrome[562353e82000+731f000]
[ 2003.644341] Code: 3d bd 02 47 fb be 01 00 00 00 ba 07 00 00 00 e8
3a 9f a6 fe 48 8d 3d 0f 41 48 fb be 01 00 00 00 ba 03 00 00 00 e8 24
9f a6 fe  04 25 00 00 00 00 37 13 00 00 c6 05 82 a8 bd 03 01 80 7d
87 00
[ 2032.449702] GpuWatchdog[10475]: segfault at 0 ip 55ad62b7240c
sp 7f81bc7ff500 error 6 in chrome[55ad5ebfa000+731f000]
[ 2032.449719] Code: 3d bd 02 47 fb be 01 00 00 00 ba 07 00 00 00 e8
3a 9f a6 fe 48 8d 3d 0f 41 48 fb be 01 00 00 00 ba 03 00 00 00 e8 24
9f a6 fe  04 25 00 00 00 00 37 13 00 00 c6 05 82 a8 bd 03 01 80 7d
87 00
[ 2060.726076] GpuWatchdog[10663]: segfault at 0 ip 558ea234c40c
sp 7f26a3d3e500 error 6 in chrome[558e9e3d4000+731f000]
[ 2060.726093] Code: 3d bd 02 47 fb be 01 00 00 00 ba 07 00 00 00 e8
3a 9f a6 fe 48 8d 3d 0f 41 48 fb be 01 00 00 00 ba 03 00 00 00 e8 24
9f a6 fe  04 25 00 00 00 00 37 13 00 00 c6 05 82 a8 bd 03 01 80 7d
87 00
[ 2253.777053] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0
nice=0 stuck for 60s!
[ 2253.777144] Showing busy workqueues and worker pools:
[ 2253.777149] workqueue events: flags=0x0
[ 2253.777313]   pwq 22: cpus=11 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 2253.777849] in-flight: 10359:key_garbage_collector

[ 2253.777856] ==
[ 2253.777856] WARNING: possible circular locking dependency detected
[ 2253.777857] 5.5.0-0.rc5.git3.2.fc32.x86_64 #1 Not tainted
[ 2253.777857] --
[ 2253.777858] WRRende~ckend#1/6583 is trying to acquire lock:
[ 2253.777858] b866aa40 (console_owner){-.-.}, at:
console_unlock+0x197/0x5c0

[ 2253.777860] but task is already holding lock:
[ 2253.777861] 9a5a3b9ee798 (&(&pool->lock)->rlock){-.-.}, at:
show_workqueue_state.cold+0x7c/0x2d1

[ 2253.777863] which lock already depends on the new lock.


[ 2253.777864] the existing dependency chain (in reverse order) is:

[ 2253.777864] -> #1 (&(&pool->lock)->rlock){-.-.}:
[ 2253.777866]_raw_spin_lock+0x31/0x80
[ 2253.777866]__queue_work+0x36b/0x610
[ 2253.777866]queue_work_on+0x85/0x90
[ 2253.777867]soft_cursor+0x19f/0x220
[ 2253.777867]bit_cursor+0x3d4/0x5f0
[ 2253.777868]hide_cursor+0x2a/0x90
[ 2253.777868]vt_console_print+0x3ef/0x400
[ 2253.777868]console_unlock+0x41a/0x5c0
[ 2253.777869]register_framebuffer+0x28f/0x300
[ 2253.777870]
__drm_fb_helper_initial_config_and_unlock+0x32e/0x4e0 [drm_kms_helper]
[ 2253.777870]amdgpu_fbdev_init+0xbc/0xf0 [amdgpu]
[ 2253.777870]amdgpu_device_init.cold+0x1674/0x1acc [amdgpu]
[ 2253.777871]amdgpu_driver_load_kms+0x53/0x1a0 [amdgpu]
[ 2253.777871]drm_dev_register+0x113/0x150 [drm]
[ 2253.777872]amdgpu_pci_probe+0xec/0x150 [amdgpu]
[ 2253.777872]local_pci_probe+0x42/0x80
[ 2253.777872]pci_device_probe+0x107/0x1a0
[ 2253.777873]really_probe+0x147/0x3c0
[ 2253.777873]driver_probe_device+0xb6/0x100
[ 2253.777874]device_driver_attach+0x53/0x60
[ 2253.777874]__driver_attach+0x8c/0x150
[ 2253.777874]bus_for_each_dev+0x7b/0xc0
[ 2253.777875]bus_add_driver+0x150/0x1f0
[ 2253.777875]driver_register+0x6c/0xc0
[ 2253.777875]do_one_initcall+0x5d/0x2f0
[ 2253.777876]do_init_module+0x5c/0x230
[ 2253.777876]load_module+0x2400/0x2650
[ 2253.777877]__do_sys_init_module+0x181/0x1b0
[ 2253.777877]do_syscall_64+0x5c/0xa0
[ 2253.777877]entry_SYSCALL_64_after_hwframe+0x49/0xbe

[ 2253.777878] -> #0 (console_owner){-.-.}:
[ 2253.777879]__lock_acquire+0xe13/0x1a30
[ 2253.777880]lock_acquire+0xa2/0x1b0
[ 2253.777880]console_unlock+0x1f0/0x5c0
[ 2253.777880]vprintk_emit+0x180/0x350
[ 2253.777881]printk+0x58/0x6f
[ 2253.777881]show_pwq+0x6c/0x298
[ 2253.777882]show_workqueue_state.cold+0x91/0x2d1
[ 2253.777882]wq_watchdog_timer_fn+0x1ba/0x240
[ 2253.777882]call_timer_fn+0xaf/0x2c0
[ 2253.777883]run_timer_softirq+0x3a0/0x5e0
[ 2253.777883]__do_softirq+0xe1/0x45d
[ 2253.777884]irq_exit+0xf7/0x100
[ 2253.777884]smp_apic_timer_interrupt+0xa4/0x230
[ 2253.777884]apic_timer_interrupt+0xf/0x20

[ 2253.777885] other info that might help us de

Re: gnome-shell stuck because of amdgpu driver [5.3 RC5]

2019-09-15 Thread Mikhail Gavrilov
On Mon, 9 Sep 2019 at 14:15, Koenig, Christian  wrote:
>
> I agree with Daniels analysis.
>
> It looks like the problem is simply that PM turns of a block before all
> work is done on that block.
>
> Have you opened a bug report yet? If not then that would certainly help
> cause it is really hard to extract all necessary information from that
> mail thread.

https://bugs.freedesktop.org/show_bug.cgi?id=111689
It'll do?

--
Best Regards,
Mike Gavrilov.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: gnome-shell stuck because of amdgpu driver [5.3 RC5]

2019-09-08 Thread Mikhail Gavrilov
On Thu, 5 Sep 2019 at 12:58, Daniel Vetter  wrote:
>
> I think those fences are only emitted for CS, not display related.
> Adding Christian König.

More fresh kernel log with 5.3RC7 - the issue still happens.
https://pastebin.com/tyxkWJYV


--
Best Regards,
Mike Gavrilov.

On Thu, 5 Sep 2019 at 12:58, Daniel Vetter  wrote:
>
> On Thu, Sep 5, 2019 at 12:27 AM Mikhail Gavrilov
>  wrote:
> >
> > On Wed, 4 Sep 2019 at 13:37, Daniel Vetter  wrote:
> > >
> > > Extend your backtrac warning slightly like
> > >
> > > WARN(r, "we're stuck on fence %pS\n", fence->ops);
> > >
> > > Also adding Harry and Alex, I'm not really working on amdgpu ...
> >
> > [ 3511.998320] [ cut here ]
> > [ 3511.998714] we're stuck on fence
> > amdgpu_fence_ops+0x0/0xc220 [amdgpu]$
>
> I think those fences are only emitted for CS, not display related.
> Adding Christian König.
> -Daniel
>
> > [ 3511.998991] WARNING: CPU: 10 PID: 1811 at
> > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:332
> > amdgpu_fence_wait_empty+0x1c6/0x240 [amdgpu]
> > [ 3511.999009] Modules linked in: rfcomm fuse xt_CHECKSUM
> > xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc
> > nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT
> > nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack
> > ebtable_nat ip6table_nat ip6table_mangle ip6table_raw
> > ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw
> > iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
> > ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables
> > iptable_filter cmac bnep sunrpc vfat fat edac_mce_amd kvm_amd
> > snd_hda_codec_realtek rtwpci snd_hda_codec_generic kvm ledtrig_audio
> > snd_hda_codec_hdmi uvcvideo rtw88 videobuf2_vmalloc snd_hda_intel
> > videobuf2_memops videobuf2_v4l2 irqbypass snd_usb_audio snd_hda_codec
> > videobuf2_common crct10dif_pclmul snd_usbmidi_lib crc32_pclmul
> > mac80211 snd_rawmidi videodev snd_hda_core ghash_clmulni_intel btusb
> > snd_hwdep btrtl snd_seq btbcm btintel snd_seq_device eeepc_wmi
> > bluetooth xpad joydev mc snd_pcm
> > [ 3511.999076]  asus_wmi ff_memless cfg80211 sparse_keymap video
> > wmi_bmof ecdh_generic snd_timer ecc sp5100_tco k10temp snd i2c_piix4
> > ccp rfkill soundcore libarc4 gpio_amdpt gpio_generic acpi_cpufreq
> > binfmt_misc ip_tables hid_logitech_hidpp hid_logitech_dj amdgpu
> > amd_iommu_v2 gpu_sched ttm drm_kms_helper drm crc32c_intel igb dca
> > nvme i2c_algo_bit nvme_core wmi pinctrl_amd
> > [ 3511.999126] CPU: 10 PID: 1811 Comm: Xorg Not tainted
> > 5.3.0-0.rc6.git2.1c.fc32.x86_64 #1
> > [ 3511.999131] Hardware name: System manufacturer System Product
> > Name/ROG STRIX X470-I GAMING, BIOS 2703 08/20/2019
> > [ 3511.999253] RIP: 0010:amdgpu_fence_wait_empty+0x1c6/0x240 [amdgpu]
> > [ 3511.999278] Code: fe ff ff 31 c0 c3 48 89 ef e8 36 29 04 cb 84 c0
> > 74 08 48 89 ef e8 8a a9 21 cb 48 8b 75 08 48 c7 c7 2c 16 86 c0 e8 82
> > b8 b9 ca <0f> 0b b8 ea ff ff ff 5d c3 e8 ec 57 c3 ca 84 c0 0f 85 6f ff
> > ff ff
> > [ 3511.999282] RSP: 0018:b9c04170f798 EFLAGS: 00210282
> > [ 3511.999288] RAX:  RBX: 8d2ce5205a80 RCX: 
> > 0006
> > [ 3511.999292] RDX: 0007 RSI: 8d2c5bea4070 RDI: 
> > 8d2cfb5d9e00
> > [ 3511.999296] RBP: 8d28becae480 R08: 0331b36fd503 R09: 
> > 
> > [ 3511.999299] R10:  R11:  R12: 
> > 8d2ce520
> > [ 3511.999303] R13:  R14:  R15: 
> > 8d2ce154
> > [ 3511.999308] FS:  7f59a5bc6f00() GS:8d2cfb40()
> > knlGS:
> > [ 3511.999311] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 3511.999315] CR2: 1108bc475960 CR3: 00075bf32000 CR4: 
> > 003406e0
> > [ 3511.999319] Call Trace:
> > [ 3511.999394]  amdgpu_pm_compute_clocks+0x70/0x5f0 [amdgpu]
> > [ 3511.999503]  dm_pp_apply_display_requirements+0x1a8/0x1c0 [amdgpu]
> > [ 3511.999609]  dce12_update_clocks+0xd8/0x110 [amdgpu]
> > [ 3511.999712]  dc_commit_state+0x414/0x590 [amdgpu]
> > [ 3511.999725]  ? find_held_lock+0x32/0x90
> > [ 3511.999832]  amdgpu_dm_atomic_commit_tail+0xd18/0x1cf0 [amdgpu]
> > [ 3511.999844]  ? reacquire_held_locks+0xed/0x210
> > [ 3511.999859]  ? ttm_eu_backoff_reservation+0xa5/0x160 [ttm]
> > [ 3511.999866]  ? find_held_lock+0x32/0x90
> > [ 3511.999872]  ? find_held_lock+0x32/0x90
> > [ 3511.

Re: gnome-shell stuck because of amdgpu driver [5.3 RC5]

2019-09-04 Thread Mikhail Gavrilov
On Wed, 4 Sep 2019 at 13:37, Daniel Vetter  wrote:
>
> Extend your backtrac warning slightly like
>
> WARN(r, "we're stuck on fence %pS\n", fence->ops);
>
> Also adding Harry and Alex, I'm not really working on amdgpu ...

[ 3511.998320] [ cut here ]
[ 3511.998714] we're stuck on fence
amdgpu_fence_ops+0x0/0xc220 [amdgpu]
[ 3511.998991] WARNING: CPU: 10 PID: 1811 at
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:332
amdgpu_fence_wait_empty+0x1c6/0x240 [amdgpu]
[ 3511.999009] Modules linked in: rfcomm fuse xt_CHECKSUM
xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc
nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT
nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack
ebtable_nat ip6table_nat ip6table_mangle ip6table_raw
ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw
iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter cmac bnep sunrpc vfat fat edac_mce_amd kvm_amd
snd_hda_codec_realtek rtwpci snd_hda_codec_generic kvm ledtrig_audio
snd_hda_codec_hdmi uvcvideo rtw88 videobuf2_vmalloc snd_hda_intel
videobuf2_memops videobuf2_v4l2 irqbypass snd_usb_audio snd_hda_codec
videobuf2_common crct10dif_pclmul snd_usbmidi_lib crc32_pclmul
mac80211 snd_rawmidi videodev snd_hda_core ghash_clmulni_intel btusb
snd_hwdep btrtl snd_seq btbcm btintel snd_seq_device eeepc_wmi
bluetooth xpad joydev mc snd_pcm
[ 3511.999076]  asus_wmi ff_memless cfg80211 sparse_keymap video
wmi_bmof ecdh_generic snd_timer ecc sp5100_tco k10temp snd i2c_piix4
ccp rfkill soundcore libarc4 gpio_amdpt gpio_generic acpi_cpufreq
binfmt_misc ip_tables hid_logitech_hidpp hid_logitech_dj amdgpu
amd_iommu_v2 gpu_sched ttm drm_kms_helper drm crc32c_intel igb dca
nvme i2c_algo_bit nvme_core wmi pinctrl_amd
[ 3511.999126] CPU: 10 PID: 1811 Comm: Xorg Not tainted
5.3.0-0.rc6.git2.1c.fc32.x86_64 #1
[ 3511.999131] Hardware name: System manufacturer System Product
Name/ROG STRIX X470-I GAMING, BIOS 2703 08/20/2019
[ 3511.999253] RIP: 0010:amdgpu_fence_wait_empty+0x1c6/0x240 [amdgpu]
[ 3511.999278] Code: fe ff ff 31 c0 c3 48 89 ef e8 36 29 04 cb 84 c0
74 08 48 89 ef e8 8a a9 21 cb 48 8b 75 08 48 c7 c7 2c 16 86 c0 e8 82
b8 b9 ca <0f> 0b b8 ea ff ff ff 5d c3 e8 ec 57 c3 ca 84 c0 0f 85 6f ff
ff ff
[ 3511.999282] RSP: 0018:b9c04170f798 EFLAGS: 00210282
[ 3511.999288] RAX:  RBX: 8d2ce5205a80 RCX: 0006
[ 3511.999292] RDX: 0007 RSI: 8d2c5bea4070 RDI: 8d2cfb5d9e00
[ 3511.999296] RBP: 8d28becae480 R08: 0331b36fd503 R09: 
[ 3511.999299] R10:  R11:  R12: 8d2ce520
[ 3511.999303] R13:  R14:  R15: 8d2ce154
[ 3511.999308] FS:  7f59a5bc6f00() GS:8d2cfb40()
knlGS:
[ 3511.999311] CS:  0010 DS:  ES:  CR0: 80050033
[ 3511.999315] CR2: 1108bc475960 CR3: 00075bf32000 CR4: 003406e0
[ 3511.999319] Call Trace:
[ 3511.999394]  amdgpu_pm_compute_clocks+0x70/0x5f0 [amdgpu]
[ 3511.999503]  dm_pp_apply_display_requirements+0x1a8/0x1c0 [amdgpu]
[ 3511.999609]  dce12_update_clocks+0xd8/0x110 [amdgpu]
[ 3511.999712]  dc_commit_state+0x414/0x590 [amdgpu]
[ 3511.999725]  ? find_held_lock+0x32/0x90
[ 3511.999832]  amdgpu_dm_atomic_commit_tail+0xd18/0x1cf0 [amdgpu]
[ 3511.999844]  ? reacquire_held_locks+0xed/0x210
[ 3511.999859]  ? ttm_eu_backoff_reservation+0xa5/0x160 [ttm]
[ 3511.999866]  ? find_held_lock+0x32/0x90
[ 3511.999872]  ? find_held_lock+0x32/0x90
[ 3511.999881]  ? __lock_acquire+0x247/0x1910
[ 3511.999893]  ? find_held_lock+0x32/0x90
[ 3511.01]  ? mark_held_locks+0x50/0x80
[ 3511.07]  ? _raw_spin_unlock_irq+0x29/0x40
[ 3511.13]  ? lockdep_hardirqs_on+0xf0/0x180
[ 3511.19]  ? _raw_spin_unlock_irq+0x29/0x40
[ 3511.24]  ? wait_for_completion_timeout+0x75/0x190
[ 3511.52]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
[ 3511.66]  commit_tail+0x3c/0x70 [drm_kms_helper]
[ 3511.79]  drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper]
[ 3512.02]  drm_mode_atomic_ioctl+0x793/0x9b0 [drm]
[ 3512.14]  ? __lock_acquire+0x247/0x1910
[ 3512.44]  ? drm_atomic_set_property+0xa50/0xa50 [drm]
[ 3512.66]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[ 3512.88]  drm_ioctl+0x208/0x390 [drm]
[ 3512.000108]  ? drm_atomic_set_property+0xa50/0xa50 [drm]
[ 3512.000120]  ? lockdep_hardirqs_on+0xf0/0x180
[ 3512.000205]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[ 3512.000216]  do_vfs_ioctl+0x411/0x750
[ 3512.000229]  ksys_ioctl+0x5e/0x90
[ 3512.000237]  __x64_sys_ioctl+0x16/0x20
[ 3512.000242]  do_syscall_64+0x5c/0xb0
[ 3512.000249]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 3512.000254] RIP: 0033:0x7f59a603d00b
[ 3512.000259] Code: 0f 1e fa 48 8b 05 7d 9e 0c 00 64 c7 00 26 00 00
00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00
00 0f 05 <48> 3d 

Re: gnome-shell stuck because of amdgpu driver [5.3 RC5]

2019-09-03 Thread Mikhail Gavrilov
On Tue, 3 Sep 2019 at 13:21, Hillf Danton  wrote:
>
> Describe the problems you are experiencing please.
> Say is the screen locked up? Machine lockedup?
> Anything unnormal after you see the warning?
>

According to my observations, all "gnome shell stuck warning" happened
when me not sitting on the computer and the computer was locked.

I did not notice any problems at the morning (I did not even look at
the kernel logs), I found that the problem happened when I remotely
connected to my computer via ssh from work and accidently look dmesg
output.

At the evening after work, I even played in the "Division", and still
not noted any problems.

Now 11:01pm and "gnome shell stuck warning" not appear since 19:17. So
looks like issue happens only when computer blocked and monitor in
power save mode.


$ dmesg -T | grep gnome

---> I am goto sleep
[Tue Sep  3 01:00:10 2019] gnome shell stuck warning
[Tue Sep  3 01:00:55 2019] gnome shell stuck warning
[Tue Sep  3 06:54:50 2019] gnome shell stuck warning
<--- I am wake up at 8:00 am and sitting again on the computer
---> I am went to work at 9:30
[Tue Sep  3 10:00:05 2019] gnome shell stuck warning
[Tue Sep  3 10:10:01 2019] gnome shell stuck warning
[Tue Sep  3 10:13:43 2019] gnome shell stuck warning
[Tue Sep  3 10:23:37 2019] gnome shell stuck warning
[Tue Sep  3 10:42:07 2019] gnome shell stuck warning
[Tue Sep  3 10:42:57 2019] gnome shell stuck warning
[Tue Sep  3 10:59:25 2019] gnome shell stuck warning
[Tue Sep  3 11:08:35 2019] gnome shell stuck warning
[Tue Sep  3 11:13:19 2019] gnome shell stuck warning
[Tue Sep  3 11:15:20 2019] gnome shell stuck warning
[Tue Sep  3 11:26:20 2019] gnome shell stuck warning
[Tue Sep  3 11:26:20 2019] gnome shell stuck warning
[Tue Sep  3 11:36:30 2019] gnome shell stuck warning
[Tue Sep  3 11:46:08 2019] gnome shell stuck warning
[Tue Sep  3 11:53:52 2019] gnome shell stuck warning
[Tue Sep  3 11:56:36 2019] gnome shell stuck warning
[Tue Sep  3 12:17:10 2019] gnome shell stuck warning
[Tue Sep  3 12:20:20 2019] gnome shell stuck warning
[Tue Sep  3 12:20:20 2019] gnome shell stuck warning
[Tue Sep  3 12:30:46 2019] gnome shell stuck warning
[Tue Sep  3 12:40:52 2019] gnome shell stuck warning
[Tue Sep  3 12:55:30 2019] gnome shell stuck warning
[Tue Sep  3 12:57:52 2019] gnome shell stuck warning
[Tue Sep  3 13:04:00 2019] gnome shell stuck warning
[Tue Sep  3 13:12:38 2019] gnome shell stuck warning
[Tue Sep  3 13:14:32 2019] gnome shell stuck warning
[Tue Sep  3 13:53:12 2019] gnome shell stuck warning
[Tue Sep  3 14:12:52 2019] gnome shell stuck warning
[Tue Sep  3 14:15:54 2019] gnome shell stuck warning
[Tue Sep  3 14:17:04 2019] gnome shell stuck warning
[Tue Sep  3 14:21:57 2019] gnome shell stuck warning
[Tue Sep  3 14:22:10 2019] gnome shell stuck warning
[Tue Sep  3 14:37:42 2019] gnome shell stuck warning
[Tue Sep  3 14:41:51 2019] gnome shell stuck warning
[Tue Sep  3 14:42:52 2019] gnome shell stuck warning
[Tue Sep  3 14:46:35 2019] gnome shell stuck warning
[Tue Sep  3 15:03:18 2019] gnome shell stuck warning
[Tue Sep  3 15:16:50 2019] gnome shell stuck warning
[Tue Sep  3 15:27:30 2019] gnome shell stuck warning
[Tue Sep  3 15:27:41 2019] gnome shell stuck warning
[Tue Sep  3 16:08:06 2019] gnome shell stuck warning
[Tue Sep  3 16:24:16 2019] gnome shell stuck warning
[Tue Sep  3 16:33:04 2019] gnome shell stuck warning
[Tue Sep  3 16:52:10 2019] gnome shell stuck warning
[Tue Sep  3 17:18:27 2019] gnome shell stuck warning
[Tue Sep  3 17:25:30 2019] gnome shell stuck warning
[Tue Sep  3 17:41:16 2019] gnome shell stuck warning
[Tue Sep  3 17:43:32 2019] gnome shell stuck warning
[Tue Sep  3 17:51:10 2019] gnome shell stuck warning
[Tue Sep  3 18:41:44 2019] gnome shell stuck warning
[Tue Sep  3 18:44:18 2019] gnome shell stuck warning
[Tue Sep  3 19:03:07 2019] gnome shell stuck warning
[Tue Sep  3 19:17:58 2019] gnome shell stuck warning
<--- Returned to home and sitting again on the computer

--
Best Regards,
Mike Gavrilov.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: gnome-shell stuck because of amdgpu driver [5.3 RC5]

2019-09-02 Thread Mikhail Gavrilov
On Fri, 30 Aug 2019 at 08:30, Hillf Danton  wrote:
>
> Add a warning to show if it makes sense in field: neither regression nor
> problem will have been observed with the warning printed.
>

I caught the problem.

[21793.094289] [ cut here ]
[21793.094296] gnome shell stuck warning
[21793.094391] WARNING: CPU: 14 PID: 1768 at
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:332
amdgpu_fence_wait_empty+0x1c2/0x230 [amdgpu]
[21793.094394] Modules linked in: rfcomm fuse xt_CHECKSUM
xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc
nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT
nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack
ebtable_nat ip6table_nat ip6table_mangle ip6table_raw
ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw
iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c
ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter cmac bnep sunrpc vfat fat edac_mce_amd kvm_amd
snd_hda_codec_realtek rtwpci rtw88 snd_hda_codec_generic snd_usb_audio
kvm ledtrig_audio snd_hda_codec_hdmi snd_hda_intel mac80211
snd_hda_codec snd_usbmidi_lib irqbypass uvcvideo snd_rawmidi
snd_hda_core videobuf2_vmalloc videobuf2_memops crct10dif_pclmul btusb
videobuf2_v4l2 snd_hwdep crc32_pclmul btrtl videobuf2_common snd_seq
eeepc_wmi btbcm xpad asus_wmi btintel snd_seq_device
ghash_clmulni_intel cfg80211 sparse_keymap
[21793.094426]  ff_memless joydev bluetooth videodev video snd_pcm
wmi_bmof mc ecdh_generic snd_timer ecc snd ccp rfkill libarc4
soundcore sp5100_tco k10temp i2c_piix4 gpio_amdpt gpio_generic
acpi_cpufreq binfmt_misc ip_tables hid_logitech_hidpp hid_logitech_dj
amdgpu amd_iommu_v2 gpu_sched ttm drm_kms_helper igb drm nvme dca
crc32c_intel i2c_algo_bit nvme_core wmi pinctrl_amd
[21793.094449] CPU: 14 PID: 1768 Comm: Xorg Tainted: GW
 5.3.0-0.rc6.git2.1b.fc32.x86_64 #1
[21793.094452] Hardware name: System manufacturer System Product
Name/ROG STRIX X470-I GAMING, BIOS 2406 06/21/2019
[21793.094499] RIP: 0010:amdgpu_fence_wait_empty+0x1c2/0x230 [amdgpu]
[21793.094502] Code: b5 f4 e9 c1 fe ff ff 31 c0 c3 48 89 ef e8 36 69
f8 f4 84 c0 74 08 48 89 ef e8 8a e9 15 f5 48 c7 c7 2c d6 91 c0 e8 86
f8 ad f4 <0f> 0b b8 ea ff ff ff 5d c3 e8 f0 97 b7 f4 84 c0 0f 85 73 ff
ff ff
[21793.094505] RSP: 0018:ae13418c3798 EFLAGS: 00010282
[21793.094508] RAX:  RBX: 8aa065f85a80 RCX: 0006
[21793.094511] RDX: 0007 RSI: 8a9fe32ec070 RDI: 8aa07bdd9e00
[21793.094513] RBP: 8aa069469d00 R08: 13d219a4ead6 R09: 
[21793.094516] R10:  R11:  R12: 8aa065f8
[21793.094518] R13:  R14:  R15: 8aa065fb
[21793.094521] FS:  7f586201cf00() GS:8aa07bc0()
knlGS:
[21793.094524] CS:  0010 DS:  ES:  CR0: 80050033
[21793.094526] CR2: 7f57fc5b5000 CR3: 00076334 CR4: 003406e0
[21793.094528] Call Trace:
[21793.094580]  amdgpu_pm_compute_clocks+0x70/0x5f0 [amdgpu]
[21793.094655]  dm_pp_apply_display_requirements+0x1a8/0x1c0 [amdgpu]
[21793.094728]  dce12_update_clocks+0xd8/0x110 [amdgpu]
[21793.094799]  dc_commit_state+0x414/0x590 [amdgpu]
[21793.094807]  ? find_held_lock+0x32/0x90
[21793.094880]  amdgpu_dm_atomic_commit_tail+0xd18/0x1cf0 [amdgpu]
[21793.094888]  ? reacquire_held_locks+0xed/0x210
[21793.094898]  ? ttm_eu_backoff_reservation+0xa5/0x160 [ttm]
[21793.094903]  ? find_held_lock+0x32/0x90
[21793.094906]  ? find_held_lock+0x32/0x90
[21793.094912]  ? __lock_acquire+0x247/0x1910
[21793.094920]  ? find_held_lock+0x32/0x90
[21793.094925]  ? mark_held_locks+0x50/0x80
[21793.094929]  ? _raw_spin_unlock_irq+0x29/0x40
[21793.094933]  ? lockdep_hardirqs_on+0xf0/0x180
[21793.094937]  ? _raw_spin_unlock_irq+0x29/0x40
[21793.094941]  ? wait_for_completion_timeout+0x75/0x190
[21793.094954]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
[21793.094962]  commit_tail+0x3c/0x70 [drm_kms_helper]
[21793.094971]  drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper]
[21793.094986]  drm_mode_atomic_ioctl+0x793/0x9b0 [drm]
[21793.094994]  ? __lock_acquire+0x247/0x1910
[21793.095013]  ? drm_atomic_set_property+0xa50/0xa50 [drm]
[21793.095025]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[21793.095039]  drm_ioctl+0x208/0x390 [drm]
[21793.095053]  ? drm_atomic_set_property+0xa50/0xa50 [drm]
[21793.095060]  ? lockdep_hardirqs_on+0xf0/0x180
[21793.095108]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[21793.095114]  do_vfs_ioctl+0x411/0x750
[21793.095121]  ksys_ioctl+0x5e/0x90
[21793.095126]  __x64_sys_ioctl+0x16/0x20
[21793.095130]  do_syscall_64+0x5c/0xb0
[21793.095135]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[21793.095138] RIP: 0033:0x7f586249300b
[21793.095142] Code: 0f 1e fa 48 8b 05 7d 9e 0c 00 64 c7 00 26 00 00
00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4d 9e 0c 00 f7 d8 64 8

gnome-shell stuck because of amdgpu driver [5.3 RC5]

2019-08-25 Thread Mikhail Gavrilov
Hi folks,
I left unblocked gnome-shell at noon, and when I returned at the
evening I discovered than monitor not sleeping and show open gnome
activity. At first, I thought that some application did not let fall
asleep the system. But when I try to move the mouse, I realized that
the system hanged. So I connect via ssh and tried to investigate the
problem. I did not see anything strange in kernel logs. And my last
idea before trying to kill the gnome-shell process was dumps tasks
that are in uninterruptable (blocked) state.

After [Alt + PrnScr + W] I saw this:

[32840.701909] sysrq: Show Blocked State
[32840.701976]   taskPC stack   pid father
[32840.702407] gnome-shell D11240  1900   1830 0x
[32840.702438] Call Trace:
[32840.702446]  ? __schedule+0x352/0x900
[32840.702453]  schedule+0x3a/0xb0
[32840.702457]  schedule_timeout+0x289/0x3c0
[32840.702461]  ? find_held_lock+0x32/0x90
[32840.702464]  ? find_held_lock+0x32/0x90
[32840.702469]  ? mark_held_locks+0x50/0x80
[32840.702473]  ? _raw_spin_unlock_irqrestore+0x4b/0x60
[32840.702478]  dma_fence_default_wait+0x1f5/0x340
[32840.702482]  ? dma_fence_free+0x20/0x20
[32840.702487]  dma_fence_wait_timeout+0x182/0x1e0
[32840.702533]  amdgpu_fence_wait_empty+0xe7/0x210 [amdgpu]
[32840.702577]  amdgpu_pm_compute_clocks+0x70/0x5f0 [amdgpu]
[32840.702641]  dm_pp_apply_display_requirements+0x19e/0x1c0 [amdgpu]
[32840.702705]  dce12_update_clocks+0xd8/0x110 [amdgpu]
[32840.702766]  dc_commit_state+0x414/0x590 [amdgpu]
[32840.702834]  amdgpu_dm_atomic_commit_tail+0xd1e/0x1cf0 [amdgpu]
[32840.702840]  ? reacquire_held_locks+0xed/0x210
[32840.702848]  ? ttm_eu_backoff_reservation+0xa5/0x160 [ttm]
[32840.702853]  ? find_held_lock+0x32/0x90
[32840.702855]  ? find_held_lock+0x32/0x90
[32840.702860]  ? __lock_acquire+0x247/0x1910
[32840.702867]  ? find_held_lock+0x32/0x90
[32840.702871]  ? mark_held_locks+0x50/0x80
[32840.702874]  ? _raw_spin_unlock_irq+0x29/0x40
[32840.702877]  ? lockdep_hardirqs_on+0xf0/0x180
[32840.702881]  ? _raw_spin_unlock_irq+0x29/0x40
[32840.702884]  ? wait_for_completion_timeout+0x75/0x190
[32840.702895]  ? commit_tail+0x3c/0x70 [drm_kms_helper]
[32840.702902]  commit_tail+0x3c/0x70 [drm_kms_helper]
[32840.702909]  drm_atomic_helper_commit+0xe3/0x150 [drm_kms_helper]
[32840.702922]  drm_atomic_connector_commit_dpms+0xd7/0x100 [drm]
[32840.702936]  set_property_atomic+0xcc/0x140 [drm]
[32840.702955]  drm_mode_obj_set_property_ioctl+0xcb/0x1c0 [drm]
[32840.702968]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[32840.702978]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[32840.702990]  drm_ioctl+0x208/0x390 [drm]
[32840.703003]  ? drm_mode_obj_find_prop_id+0x40/0x40 [drm]
[32840.703007]  ? sched_clock_cpu+0xc/0xc0
[32840.703012]  ? lockdep_hardirqs_on+0xf0/0x180
[32840.703053]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[32840.703058]  do_vfs_ioctl+0x411/0x750
[32840.703065]  ksys_ioctl+0x5e/0x90
[32840.703069]  __x64_sys_ioctl+0x16/0x20
[32840.703072]  do_syscall_64+0x5c/0xb0
[32840.703076]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[32840.703079] RIP: 0033:0x7f8bcab0f00b
[32840.703084] Code: Bad RIP value.
[32840.703086] RSP: 002b:7ffe76c62338 EFLAGS: 0246 ORIG_RAX:
0010
[32840.703089] RAX: ffda RBX: 7ffe76c62370 RCX: 7f8bcab0f00b
[32840.703092] RDX: 7ffe76c62370 RSI: c01864ba RDI: 0009
[32840.703094] RBP: c01864ba R08: 0003 R09: c0c0c0c0
[32840.703096] R10: 56476c86a018 R11: 0246 R12: 56476c8ad940
[32840.703098] R13: 0009 R14: 0002 R15: 0003
[root@localhost ~]#
[root@localhost ~]# ps aux | grep gnome-shell
mikhail 1900  0.3  1.1 6447496 378696 tty2   Dl+  Aug24   2:10
/usr/bin/gnome-shell
mikhail 2099  0.0  0.0 519984 23392 ?Ssl  Aug24   0:00
/usr/libexec/gnome-shell-calendar-server
mikhail12214  0.0  0.0 399484 29660 pts/2Sl+  Aug24   0:00
/usr/bin/python3 /usr/bin/chrome-gnome-shell
chrome-extension://gphhapmejobijbbhgpjhcjognlahblep/
root   22957  0.0  0.0 216120  2456 pts/10   S+   03:59   0:00
grep --color=auto gnome-shell

After it, I tried to kill gnome-shell process with signal 9, but the
process won't terminate after several unsuccessful attempts.

Only [Alt + PrnScr + B] helped reboot the hanging system.
I am writing here because I hope some ampgpu hackers cal look in the
trace and understand that is happening.

Sorry, I don’t know how to reproduce this bug. But the problem itself
is very annoying.

Thanks.

GPU: AMD Radeon VII
Kernel: 5.3 RC5


--
Best Regards,
Mike Gavrilov.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: The issue with page allocation 5.3 rc1-rc2 (seems drm culprit here)

2019-08-10 Thread Mikhail Gavrilov
On Fri, 9 Aug 2019 at 23:55, Mikhail Gavrilov
 wrote:
> Finally initial problem "gnome-shell: page allocation failure:
> order:4, mode:0x40cc0(GFP_KERNEL|__GFP_COMP),
> nodemask=(null),cpuset=/,mems_allowed=0" did not happens anymore with
> latest version of the patch (I tested more than 23 hours)
>
> But I hit a new problem:
>
> [73808.088801] [ cut here ]
> [73808.088806] DEBUG_LOCKS_WARN_ON(ww_ctx->contending_lock)
> [73808.088813] WARNING: CPU: 8 PID: 1348877 at
> kernel/locking/mutex.c:757 __ww_mutex_lock.constprop.0+0xb0f/0x10c0

[pruned]

> So I needed to report it separately (in another thread) or we continue here?

Today after reboot issue "DEBUG LOCKS
WARN_ON(ww_ctx->contending_lock)" happened again.

--
Best Regards,
Mike Gavrilov.

[ 5406.584851] [ cut here ]
[ 5406.584855] DEBUG_LOCKS_WARN_ON(ww_ctx->contending_lock)
[ 5406.584862] WARNING: CPU: 2 PID: 4865 at kernel/locking/mutex.c:757 
__ww_mutex_lock.constprop.0+0xb0f/0x10c0
[ 5406.584865] Modules linked in: macvtap macvlan tap rfcomm xt_CHECKSUM 
xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc 
nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 
ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ip6table_nat 
ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat 
iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat 
snd_hda_codec_realtek snd_hda_codec_generic edac_mce_amd ledtrig_audio kvm_amd 
snd_hda_codec_hdmi snd_hda_intel kvm rtwpci snd_hda_codec rtw88 irqbypass 
snd_hda_core snd_usb_audio mac80211 snd_usbmidi_lib crct10dif_pclmul uvcvideo 
snd_hwdep snd_rawmidi crc32_pclmul btusb videobuf2_vmalloc videobuf2_memops 
snd_seq videobuf2_v4l2 btrtl btbcm ghash_clmulni_intel snd_seq_device btintel 
videobuf2_common xpad eeepc_wmi joydev ff_memless
[ 5406.584895]  bluetooth cfg80211 snd_pcm asus_wmi videodev snd_timer 
sparse_keymap video wmi_bmof snd ecdh_generic mc ecc soundcore ccp k10temp 
sp5100_tco rfkill libarc4 i2c_piix4 gpio_amdpt gpio_generic acpi_cpufreq 
binfmt_misc ip_tables hid_logitech_hidpp amdgpu crc32c_intel amd_iommu_v2 
gpu_sched ttm drm_kms_helper igb drm nvme dca hid_logitech_dj i2c_algo_bit 
nvme_core wmi pinctrl_amd
[ 5406.584915] CPU: 2 PID: 4865 Comm: firefox:cs0 Not tainted 
5.3.0-0.rc3.git1.2.fc31.x86_64 #1
[ 5406.584917] Hardware name: System manufacturer System Product Name/ROG STRIX 
X470-I GAMING, BIOS 2406 06/21/2019
[ 5406.584920] RIP: 0010:__ww_mutex_lock.constprop.0+0xb0f/0x10c0
[ 5406.584922] Code: 28 00 74 28 e8 42 29 a6 ff 85 c0 74 1f 8b 05 f8 6a e0 00 
85 c0 75 15 48 c7 c6 70 35 32 92 48 c7 c7 f0 67 30 92 e8 e9 84 5c ff <0f> 0b 4d 
89 74 24 28 b8 dd ff ff ff 65 48 8b 14 25 40 8e 01 00 48
[ 5406.584924] RSP: 0018:b738cca4f760 EFLAGS: 00010286
[ 5406.584926] RAX:  RBX: 8e1732e13300 RCX: 
[ 5406.584927] RDX: 0002 RSI: 0001 RDI: 0246
[ 5406.584929] RBP: b738cca4f820 R08:  R09: 
[ 5406.584931] R10: 93d3f740 R11: 93d3f373 R12: b738cca4fb90
[ 5406.584932] R13: b738cca4f7c0 R14: 8e172e0fb258 R15: 8e172e0fb260
[ 5406.584934] FS:  7fc2d5c6b700() GS:8e18ba40() 
knlGS:
[ 5406.584935] CS:  0010 DS:  ES:  CR0: 80050033
[ 5406.584937] CR2: 7ff54bbd CR3: 0005ad12a000 CR4: 003406e0
[ 5406.584938] Call Trace:
[ 5406.584943]  ? _raw_spin_unlock_irq+0x29/0x40
[ 5406.584951]  ? ttm_mem_evict_first+0x1ed/0x4f0 [ttm]
[ 5406.584955]  ? ww_mutex_lock_interruptible+0x43/0xb0
[ 5406.584957]  ww_mutex_lock_interruptible+0x43/0xb0
[ 5406.584961]  ttm_mem_evict_first+0x1ed/0x4f0 [ttm]
[ 5406.584969]  ttm_bo_mem_space+0x229/0x2c0 [ttm]
[ 5406.584974]  ttm_bo_validate+0xe5/0x190 [ttm]
[ 5406.584979]  ? lockdep_hardirqs_on+0xf0/0x180
[ 5406.585033]  amdgpu_cs_bo_validate+0xaa/0x1b0 [amdgpu]
[ 5406.585082]  amdgpu_cs_validate+0x3b/0x260 [amdgpu]
[ 5406.585131]  amdgpu_cs_list_validate+0x110/0x180 [amdgpu]
[ 5406.585184]  amdgpu_cs_ioctl+0x5a9/0x1d10 [amdgpu]
[ 5406.585189]  ? sched_clock+0x5/0x10
[ 5406.585247]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[ 5406.585260]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[ 5406.585271]  drm_ioctl+0x208/0x390 [drm]
[ 5406.585316]  ? amdgpu_cs_find_mapping+0x120/0x120 [amdgpu]
[ 5406.585319]  ? sched_clock_cpu+0xc/0xc0
[ 5406.585322]  ? lockdep_hardirqs_on+0xf0/0x180
[ 5406.585366]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[ 5406.585371]  do_vfs_ioctl+0x411/0x750
[ 5406.585375]  ksys_ioctl+0x5e/0x90
[ 5406.585378]  __x64_sys_ioctl+0x16/0x20
[ 5406.585381]  do_syscall_64+0x5c/0xb0
[ 5406.585385]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 5406.585387] RIP:

Re: The issue with page allocation 5.3 rc1-rc2 (seems drm culprit here)

2019-08-05 Thread Mikhail Gavrilov
On Mon, 5 Aug 2019 at 08:21, Hillf Danton  wrote:
>
>
>
> Try to fix the failure above using vmalloc + kmalloc.
>
> --- a/drivers/gpu/drm/amd/display/dc/core/dc.c
> +++ b/drivers/gpu/drm/amd/display/dc/core/dc.c
> @@ -1174,8 +1174,12 @@ struct dc_state *dc_create_state(struct
> struct dc_state *context = kzalloc(sizeof(struct dc_state),
>GFP_KERNEL);
>
> -   if (!context)
> -   return NULL;
> +   if (!context) {
> +   context = kvzalloc(sizeof(struct dc_state),
> +  GFP_KERNEL);
> +   if (!context)
> +   return NULL;
> +   }
> /* Each context must have their own instance of VBA and in order to
>  * initialize and obtain IP and SOC the base DML instance from DC is
>  * initially copied into every context
> @@ -1195,8 +1199,13 @@ struct dc_state *dc_copy_state(struct dc
> struct dc_state *new_ctx = kmemdup(src_ctx,
> sizeof(struct dc_state), GFP_KERNEL);
>
> -   if (!new_ctx)
> -   return NULL;
> +   if (!new_ctx) {
> +   new_ctx = kvmalloc(sizeof(*new_ctx), GFP_KERNEL);
> +   if (new_ctx)
> +   *new_ctx = *src_ctx;
> +   else
> +   return NULL;
> +   }
>
> for (i = 0; i < MAX_PIPES; i++) {
> struct pipe_ctx *cur_pipe = 
> &new_ctx->res_ctx.pipe_ctx[i];
> @@ -1230,7 +1239,7 @@ static void dc_state_free(struct kref *k
>  {
> struct dc_state *context = container_of(kref, struct dc_state, 
> refcount);
> dc_resource_state_destruct(context);
> -   kfree(context);
> +   kvfree(context);
>  }
>
>  void dc_release_state(struct dc_state *context)
> --

Unfortunately couldn't check this patch because, with the patch, the
kernel did not compile.
Here is compile error messages:

drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c: In function
'dc_create_state':
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1178:13: error:
implicit declaration of function 'kvzalloc'; did you mean 'kzalloc'?
[-Werror=implicit-function-declaration]
 1178 |   context = kvzalloc(sizeof(struct dc_state),
  | ^~~~
  | kzalloc
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1178:11: warning:
assignment to 'struct dc_state *' from 'int' makes pointer from
integer without a cast [-Wint-conversion]
 1178 |   context = kvzalloc(sizeof(struct dc_state),
  |   ^
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c: In function 'dc_copy_state':
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1203:13: error:
implicit declaration of function 'kvmalloc'; did you mean 'kmalloc'?
[-Werror=implicit-function-declaration]
 1203 |   new_ctx = kvmalloc(sizeof(*new_ctx), GFP_KERNEL);
  | ^~~~
  | kmalloc
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1203:11: warning:
assignment to 'struct dc_state *' from 'int' makes pointer from
integer without a cast [-Wint-conversion]
 1203 |   new_ctx = kvmalloc(sizeof(*new_ctx), GFP_KERNEL);
  |   ^
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c: In function 'dc_state_free':
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.c:1242:2: error:
implicit declaration of function 'kvfree'; did you mean 'kzfree'?
[-Werror=implicit-function-declaration]
 1242 |  kvfree(context);
  |  ^~
  |  kzfree
cc1: some warnings being treated as errors
make[4]: *** [scripts/Makefile.build:274:
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc.o] Error 1
make[4]: *** Waiting for unfinished jobs
make[3]: *** [scripts/Makefile.build:490: drivers/gpu/drm/amd/amdgpu] Error 2
make[3]: *** Waiting for unfinished jobs
make: *** [Makefile:1084: drivers] Error 2


--
Best Regards,
Mike Gavrilov.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel