On Fri, Feb 17, 2023 at 8:30 PM Alex Deucher <alexdeuc...@gmail.com> wrote:
>
> On Fri, Feb 17, 2023 at 1:10 AM Mikhail Gavrilov
> <mikhail.v.gavri...@gmail.com> wrote:
> >
> > On Fri, Dec 9, 2022 at 7:37 PM Leo Liu <leo....@amd.com> wrote:
> > >
> > > Please try the latest AMDGPU driver:
> > >
> > > https://gitlab.freedesktop.org/agd5f/linux/-/commits/amd-staging-drm-next/
> > >
> >
> > Sorry Leo, I miss your message.
> > This issue is still actual for 6.2-rc8.
> >
> > In my first message I was mistaken.
> >
> > > Before kernel 5.16 this only led to an artifact in the form of
> > > a green bar at the top of the screen, then starting from 5.17
> > > the GPU began to freeze.
> >
> > The real behaviour before 5.18:
> > - vlc could plays video with small artifacts in the form of a green
> > bar on top of the video
> > - after playing video process vlc correctly exiting
> >
> > On 5.18 this behaviour changed:
> > - vlc show black screen instead of playing video
> > - after playing the process not exiting
> > - if I tries kill vlc process with 'kill -9' vlc became zombi process
> > and many other processes start hangs (in kernel log appears follow
> > lines after 2 minutes)
> >
> > INFO: task vlc:sh8:5248 blocked for more than 122 seconds.
> >       Tainted: G        W    L   --------  ---  5.18.0-60.fc37.x86_64+debug 
> > #1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > task:vlc:sh8         state:D stack:13616 pid: 5248 ppid:  1934 
> > flags:0x00004006
> > Call Trace:
> >  <TASK>
> >  __schedule+0x492/0x1650
> >  ? _raw_spin_unlock_irqrestore+0x40/0x60
> >  ? debug_check_no_obj_freed+0x12d/0x250
> >  schedule+0x4e/0xb0
> >  schedule_timeout+0xe1/0x120
> >  ? lock_release+0x215/0x460
> >  ? trace_hardirqs_on+0x1a/0xf0
> >  ? _raw_spin_unlock_irqrestore+0x40/0x60
> >  dma_fence_default_wait+0x197/0x240
> >  ? __bpf_trace_dma_fence+0x10/0x10
> >  dma_fence_wait_timeout+0x229/0x260
> >  drm_sched_entity_fini+0x101/0x270 [gpu_sched]
> >  amdgpu_vm_fini+0x2b5/0x460 [amdgpu]
> >  ? idr_destroy+0x70/0xb0
> >  ? mutex_destroy+0x1e/0x50
> >  amdgpu_driver_postclose_kms+0x1ec/0x2c0 [amdgpu]
> >  drm_file_free.part.0+0x20d/0x260
> >  drm_release+0x6a/0x120
> >  __fput+0xab/0x270
> >  task_work_run+0x5c/0xa0
> >  do_exit+0x394/0xc40
> >  ? rcu_read_lock_sched_held+0x10/0x70
> >  do_group_exit+0x33/0xb0
> >  get_signal+0xbbc/0xbc0
> >  arch_do_signal_or_restart+0x30/0x770
> >  ? do_futex+0xfd/0x190
> >  ? __x64_sys_futex+0x63/0x190
> >  exit_to_user_mode_prepare+0x172/0x270
> >  syscall_exit_to_user_mode+0x16/0x50
> >  do_syscall_64+0x67/0x80
> >  ? do_syscall_64+0x67/0x80
> >  ? rcu_read_lock_sched_held+0x10/0x70
> >  ? trace_hardirqs_on_prepare+0x5e/0x110
> >  ? do_syscall_64+0x67/0x80
> >  ? rcu_read_lock_sched_held+0x10/0x70
> >  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > RIP: 0033:0x7f82c2364529
> > RSP: 002b:00007f8210ff8c00 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
> > RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f82c2364529
> > RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007f823022542c
> > RBP: 00007f8210ff8c30 R08: 0000000000000000 R09: 00000000ffffffff
> > R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > R13: 0000000000000000 R14: 0000000000000001 R15: 00007f823022542c
> >  </TASK>
> > INFO: lockdep is turned off.
> >
> > I bisected this issue and problematic commit is
> >
> > ❯ git bisect bad
> > 5f3854f1f4e211f494018160b348a1c16e58013f is the first bad commit
> > commit 5f3854f1f4e211f494018160b348a1c16e58013f
> > Author: Alex Deucher <alexander.deuc...@amd.com>
> > Date:   Thu Mar 24 18:04:00 2022 -0400
> >
> >     drm/amdgpu: add more cases to noretry=1
> >
> >     Port current list from amd-staging-drm-next.
> >
> >     Signed-off-by: Alex Deucher <alexander.deuc...@amd.com>
> >
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > Unfortunately I couldn't simply revert this commit on 6.2-rc8 for
> > checking, because it leads to conflicts.
> >
> > Alex, you as author of this commit could help me with it?
>
> append amdgpu.noretry=0 to the kernel command line in grub.

Thanks, I checked the "amdgpu.noretry=0" and after the page fault
occurs vlc could play video with little artifacts.

So I have some questions:

1. Why retrys was disabled by default if it really stills needed for
recoverable page faults? As Christian answered me before here:
https://lore.kernel.org/all/f253ff1f-3c5c-c785-1272-e4fe69a36...@amd.com/T/#m73a0a6eb7b2531eacf24fd498e8d2eec675f05a6

The page faults (Not to be confused with kernel panic) it's absolutely
normal phenomenon for a buggy userspace. And if it "normal" I wold
prefer what is not had affect on system reliability. But as we can see
it leads to appears zombie processes with follow hang.

2.If recoverable page faults is not an option, is it possible to
somehow fix this issue or not?

P.S. I also see page faults in other scenarios (for example when
playing in "Division 2" or "The Callisto Protocol". I attached my
kernel log for show it) but it not leads to zombie processes.

-- 
Best Regards,
Mike Gavrilov.

Attachment: dmesg.tar.xz
Description: application/xz

Reply via email to