[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-19 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

Andrey Grodzovsky  changed:

   What|Removed |Added

   Assignee|dri-devel@lists.freedesktop |andrey.grodzov...@amd.com
   |.org|

--- Comment #27 from Andrey Grodzovsky  ---
Created attachment 140715
  --> https://bugs.freedesktop.org/attachment.cgi?id=140715=edit
0001-drm-amdgpu-Fix-S3-resume-failre.patch

Please try the attached patch for the S3 issue, it's might still not be the
final fix but still. It's not a fix for your CPU page table updates fault.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-16 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #26 from Andrey Grodzovsky  ---
(In reply to dwagner from comment #25)
> Created attachment 140634 [details]
> dmesg before and after S3 sleep with commit "updating plane ..." reverted

Reverting the patch makes the TTM eviction failure + following driver resume
failure go away. So that one issue. Another issue Is that you still experience
page table updates realated fault during S3. I can't reproduce that issue. 

I am currently looking into how this patch broke S3, this is more burning issue
as other people experience it to. Later i will try to give you some debug
printk patch to sort out your page fault issue.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-14 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #25 from dwagner  ---
Created attachment 140634
  --> https://bugs.freedesktop.org/attachment.cgi?id=140634=edit
dmesg before and after S3 sleep with commit "updating plane ..." reverted

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-14 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #24 from dwagner  ---
> > Reverting the commit "drm: Stop updating plane->crtc/fb/old_fb on atomic
> > drivers" for me only changes that after S3 resume, the very picture that was
> > visible before S3 sleep is displayed again - but the kernel crash at
> > "amdgpu_vm_cpu_set_ptes+0x76" still happenes, so the "resumed picture" is as
> > frozen as the system is dead.
> 
> Can you attach dmesg from the system with reverted patch ?

Sure, will do

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-13 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #23 from Andrey Grodzovsky  ---
(In reply to dwagner from comment #22)
> (In reply to Andrey Grodzovsky from comment #21)
> > I found the offending patch - drm: Stop updating plane->crtc/fb/old_fb on
> > atomic drivers
> > Not sure yet what's going on there and not sure it will fix you issue with
> > amdgpu_vm_cpu_set_ptes page fault after S3 since I haven't observe it here.
> > Still worth a try on your side to revert it and see what happens.
> Reverting the commit "drm: Stop updating plane->crtc/fb/old_fb on atomic
> drivers" for me only changes that after S3 resume, the very picture that was
> visible before S3 sleep is displayed again - but the kernel crash at
> "amdgpu_vm_cpu_set_ptes+0x76" still happenes, so the "resumed picture" is as
> frozen as the system is dead.

Can you attach dmesg from the system with reverted patch ?

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-13 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #22 from dwagner  ---
(In reply to Andrey Grodzovsky from comment #21)
> I found the offending patch - drm: Stop updating plane->crtc/fb/old_fb on
> atomic drivers
> Not sure yet what's going on there and not sure it will fix you issue with
> amdgpu_vm_cpu_set_ptes page fault after S3 since I haven't observe it here.
> Still worth a try on your side to revert it and see what happens.
Reverting the commit "drm: Stop updating plane->crtc/fb/old_fb on atomic
drivers" for me only changes that after S3 resume, the very picture that was
visible before S3 sleep is displayed again - but the kernel crash at
"amdgpu_vm_cpu_set_ptes+0x76" still happenes, so the "resumed picture" is as
frozen as the system is dead.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-13 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #21 from Andrey Grodzovsky  ---
(In reply to dwagner from comment #20)
> (In reply to Andrey Grodzovsky from comment #19)
> > I was able to reproduce this instantly without even using page tables CPU
> > update mode. Looks like a regression since S3 was working fine for long
> > time. Were you able to find a regression point for this ?
> 
> Not for the exact symptom described in this report, but for an older S3
> resume issue that was partially resolved -
> https://bugs.freedesktop.org/show_bug.cgi?id=103277 - I did once find the
> regression caused by the "drm/amd/display: Match actual state during S3
> resume" commit.
> 
> Unluckily, the many changes that followed thereafter do no longer allow to
> bisect the symptom there to one specific commit, but given that it still
> occurs if I use the option "drm.edid_firmware=edid/LG_EG9609_edid.bin", I
> think there is still some bug in the order of things done during
> re-initialization upon S3 resumes, and setting some fixed EDID seems to
> expose it as crash.

I found the offending patch - drm: Stop updating plane->crtc/fb/old_fb on
atomic drivers
Not sure yet what's going on there and not sure it will fix you issue with
amdgpu_vm_cpu_set_ptes page fault after S3 since I haven't observe it here.
Still worth a try on your side to revert it and see what happens.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-11 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #20 from dwagner  ---
(In reply to Andrey Grodzovsky from comment #19)
> I was able to reproduce this instantly without even using page tables CPU
> update mode. Looks like a regression since S3 was working fine for long
> time. Were you able to find a regression point for this ?

Not for the exact symptom described in this report, but for an older S3 resume
issue that was partially resolved -
https://bugs.freedesktop.org/show_bug.cgi?id=103277 - I did once find the
regression caused by the "drm/amd/display: Match actual state during S3 resume"
commit.

Unluckily, the many changes that followed thereafter do no longer allow to
bisect the symptom there to one specific commit, but given that it still occurs
if I use the option "drm.edid_firmware=edid/LG_EG9609_edid.bin", I think there
is still some bug in the order of things done during re-initialization upon S3
resumes, and setting some fixed EDID seems to expose it as crash.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-11 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #19 from Andrey Grodzovsky  ---
(In reply to Andrey Grodzovsky from comment #18)
> (In reply to dwagner from comment #17)
> > Interesting observation: If I first switch from the X11 display to the
> > console display (with Alt-F2), and then enter "echo mem >/sys/power/state"
> > on the console, above described crashes upon S3 resume do not occur, and I
> > do not see the "[TTM] Buffer eviction failed" in the kernel log, neither
> > with vm_update_mode=0, nor with vm_update_mode=3.
> > 
> > Switching back to the X11 display after a successful S3 resume to the
> > console also works fine.
> > 
> > What could be the relevant difference here?
> 
> Well, there is no acceleration involved when in console mode. So maybe this
> has something to do with it.
> 
> Anyway, i am sidetracked a bit by an internal requirement but once i finish
> I will get back to this issue especially because I got another report with
> the same failure as you describe.

I was able to reproduce this instantly without even using page tables CPU
update mode. Looks like a regression since S3 was working fine for long time.
Were you able to find a regression point for this ?

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-09 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #18 from Andrey Grodzovsky  ---
(In reply to dwagner from comment #17)
> Interesting observation: If I first switch from the X11 display to the
> console display (with Alt-F2), and then enter "echo mem >/sys/power/state"
> on the console, above described crashes upon S3 resume do not occur, and I
> do not see the "[TTM] Buffer eviction failed" in the kernel log, neither
> with vm_update_mode=0, nor with vm_update_mode=3.
> 
> Switching back to the X11 display after a successful S3 resume to the
> console also works fine.
> 
> What could be the relevant difference here?

Well, there is no acceleration involved when in console mode. So maybe this has
something to do with it.

Anyway, i am sidetracked a bit by an internal requirement but once i finish I
will get back to this issue especially because I got another report with the
same failure as you describe.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-06 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #17 from dwagner  ---
Interesting observation: If I first switch from the X11 display to the console
display (with Alt-F2), and then enter "echo mem >/sys/power/state" on the
console, above described crashes upon S3 resume do not occur, and I do not see
the "[TTM] Buffer eviction failed" in the kernel log, neither with
vm_update_mode=0, nor with vm_update_mode=3.

Switching back to the X11 display after a successful S3 resume to the console
also works fine.

What could be the relevant difference here?

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-04 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #16 from dwagner  ---
(In reply to Andrey Grodzovsky from comment #15)
> We have only minor differences but I can't reproduce it. Maybe the resume
> failure is indeed due the eviction failure during suspend. Is S3 failure is
> happening only when you switch to CPU update mode ?

No, when I boot amd-staging-drm-next with amdgpu.vm_update_mode=0
and suspend to S3 then resuming does also crash, but with different
messages - _not_ with
 "BUG: unable to handle kernel paging request at 2000"
like in the vm_update_mode=3 case.

In the journal, I can see see after a vm_update_mode=0 S3 resume attempt: 

Jul 05 00:41:59 ryzen kernel: [TTM] Buffer eviction failed
Jul 05 00:41:59 ryzen kernel: ACPI: Preparing to enter system sleep state S3
...
Jul 05 00:42:00 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR*
amdgpu: ring 0 test failed (scratch(0xC040)=0xC>
Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]]
*ERROR* resume of IP block  failed -22
Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR*
amdgpu_device_ip_resume failed (-22).
Jul 05 00:42:00 ryzen kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0
returns -22
Jul 05 00:42:00 ryzen kernel: PM: Device :0a:00.0 failed to resume async:
error -22
...
Jul 05 00:42:00 ryzen kernel: amdgpu :0a:00.0: couldn't schedule ib on ring

Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error
scheduling IBs (-22)
Jul 05 00:42:00 ryzen kernel: amdgpu :0a:00.0: couldn't schedule ib on ring

Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error
scheduling IBs (-22)
Jul 05 00:42:00 ryzen kernel: amdgpu :0a:00.0: couldn't schedule ib on ring

Jul 05 00:42:00 ryzen kernel: [drm:amdgpu_job_run [amdgpu]] *ERROR* Error
scheduling IBs (-22)
Jul 05 00:42:00 ryzen kernel: amdgpu :0a:00.0: couldn't schedule ib on ring

... many more of this... but no kernel BUG or Oops.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-03 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #15 from Andrey Grodzovsky  ---
(In reply to dwagner from comment #14)
> (In reply to Andrey Grodzovsky from comment #13)
> > What ASIC are you using ? I also tested with
> > gfx8 ASIC and haven't observed any issues with resume. Did you update the
> > firmware for this ASIC to latest #
> 
> The GPU is an RX460 "POLARIS11 0x1002:0x67EF 0x1682:0x9460 0xCF",
> with the latest firmware from the kernel git, you can see the
> details from https://bugs.freedesktop.org/attachment.cgi?id=140383
> uploaded earlier.

We have only minor differences but I can't reproduce it. Maybe the resume
failure is indeed due the eviction failure during suspend. Is S3 failure is
happening only when you switch to CPU update mode ?

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-03 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #14 from dwagner  ---
(In reply to Andrey Grodzovsky from comment #13)
> What ASIC are you using ? I also tested with
> gfx8 ASIC and haven't observed any issues with resume. Did you update the
> firmware for this ASIC to latest #

The GPU is an RX460 "POLARIS11 0x1002:0x67EF 0x1682:0x9460 0xCF",
with the latest firmware from the kernel git, you can see the
details from https://bugs.freedesktop.org/attachment.cgi?id=140383
uploaded earlier.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-02 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #13 from Andrey Grodzovsky  ---
(In reply to dwagner from comment #12)
> (In reply to Andrey Grodzovsky from comment #10)
> > Created attachment 140418 [details] [review] [review]
> > drm/amdgpu: Verify root PD is mapped into kernel address space.
> > 
> > dwagner, please try this patch. Fixes the issue for me and I observed no
> > suspend/resume issues.
> 
> While I can start X11 with this patch applied to current
> amd-staging-drm-next, attempts to resume from S3 fail consistently.
> 
> The following related output is emitted right before the suspend:
> 
> Jul 02 21:31:32 ryzen kernel: Freezing remaining freezable tasks ...
> (elapsed 0.000 seconds) done.
> Jul 02 21:31:32 ryzen kernel: Suspending console(s) (use no_console_suspend
> to debug)
> Jul 02 21:31:32 ryzen kernel: sd 9:0:0:0: [sda] Synchronizing SCSI cache
> Jul 02 21:31:32 ryzen kernel: [TTM] Buffer eviction failed
> Jul 02 21:31:32 ryzen kernel: ACPI: Preparing to enter system sleep state S3
> Jul 02 21:31:32 ryzen kernel: PM: Saving platform NVS memory
> Jul 02 21:31:32 ryzen kernel: Disabling non-boot CPUs ...
> 
> (I wonder if that "[TTM] Buffer eviction failed" is a bad sign - as I have
> seen it some other times in conjunction with heavy uses of the amdgpu
> driver.)
> 
> 
> Then, upon resume, the following messages are emitted:
> 
> Jul 02 21:31:33 ryzen kernel: ACPI: Low-level resume complete
> Jul 02 21:31:33 ryzen kernel: [drm] PCIE GART of 256M enabled (table at
> 0x00F40030).
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 146 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 148 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 145 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 146 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 189 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 306 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 5e ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 18a ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 145 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 146 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 148 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 145 ret is 0 
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>last message was failed ret is 0
> Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
>failed to send message 146 ret is 0 
> Jul 02 21:31:33 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR*
> amdgpu: ring 0 test failed (scratch(0xC040)=0xC>
> Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]]
> *ERROR* resume of IP block  failed -22
> Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR*
> amdgpu_device_ip_resume failed (-22).
> Jul 02 21:31:33 ryzen kernel: dpm_run_callback(): 

[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-02 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #12 from dwagner  ---
(In reply to Andrey Grodzovsky from comment #10)
> Created attachment 140418 [details] [review]
> drm/amdgpu: Verify root PD is mapped into kernel address space.
> 
> dwagner, please try this patch. Fixes the issue for me and I observed no
> suspend/resume issues.

While I can start X11 with this patch applied to current amd-staging-drm-next,
attempts to resume from S3 fail consistently.

The following related output is emitted right before the suspend:

Jul 02 21:31:32 ryzen kernel: Freezing remaining freezable tasks ... (elapsed
0.000 seconds) done.
Jul 02 21:31:32 ryzen kernel: Suspending console(s) (use no_console_suspend to
debug)
Jul 02 21:31:32 ryzen kernel: sd 9:0:0:0: [sda] Synchronizing SCSI cache
Jul 02 21:31:32 ryzen kernel: [TTM] Buffer eviction failed
Jul 02 21:31:32 ryzen kernel: ACPI: Preparing to enter system sleep state S3
Jul 02 21:31:32 ryzen kernel: PM: Saving platform NVS memory
Jul 02 21:31:32 ryzen kernel: Disabling non-boot CPUs ...

(I wonder if that "[TTM] Buffer eviction failed" is a bad sign - as I have seen
it some other times in conjunction with heavy uses of the amdgpu driver.)


Then, upon resume, the following messages are emitted:

Jul 02 21:31:33 ryzen kernel: ACPI: Low-level resume complete
Jul 02 21:31:33 ryzen kernel: [drm] PCIE GART of 256M enabled (table at
0x00F40030).
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 146 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 148 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 145 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 146 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 189 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 306 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 5e ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 18a ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 145 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 146 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 148 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 145 ret is 0 
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   last message was failed ret is 0
Jul 02 21:31:33 ryzen kernel: amdgpu: [powerplay] 
   failed to send message 146 ret is 0 
Jul 02 21:31:33 ryzen kernel: [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR*
amdgpu: ring 0 test failed (scratch(0xC040)=0xC>
Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]]
*ERROR* resume of IP block  failed -22
Jul 02 21:31:33 ryzen kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR*
amdgpu_device_ip_resume failed (-22).
Jul 02 21:31:33 ryzen kernel: dpm_run_callback(): pci_pm_resume+0x0/0xa0
returns -22
Jul 02 21:31:33 ryzen kernel: PM: Device :0a:00.0 failed to resume async:
error -22
Jul 02 21:31:33 ryzen kernel: OOM killer enabled.
Jul 02 21:31:33 ryzen kernel: Restarting tasks ... done.
Jul 02 

[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-02 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #11 from Christian König  ---
(In reply to Andrey Grodzovsky from comment #10)
> Created attachment 140418 [details] [review]
> drm/amdgpu: Verify root PD is mapped into kernel address space.
> 
> dwagner, please try this patch. Fixes the issue for me and I observed no
> suspend/resume issues.
> 
> Christian, please take a look at the patch, problem was that in
> amdgpu_vm_update_directories the parent BO didn't have a kernel mapping and
> so later inside amdgpu_vm_cpu_set_ptes 
> pe += (unsigned long)amdgpu_bo_kptr(bo); would equal to  2000
> since 
> parent amdgpu_bo_kptr woudld return NULL. The parent was the root PD. 
> 
> This was still working in 67b8d5c Linus Torvalds  7 weeks agoLinux
> 4.17-rc5   (tag: v4.17-rc5) but I wasn't able to exactly pinpoint which
> change broke it. I am not sure my fix is the right one so please advise.

No idea when that broke either, CPU based updates is not something we usually
test.

Anyway it's a good catch, but I would rather add that to
amdgpu_vm_bo_base_init() (with the appropriate checks).

That would also allow us to remove the duplicated code from
amdgpu_vm_alloc_levels().

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-07-01 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #10 from Andrey Grodzovsky  ---
Created attachment 140418
  --> https://bugs.freedesktop.org/attachment.cgi?id=140418=edit
drm/amdgpu: Verify root PD is mapped into kernel address space.

dwagner, please try this patch. Fixes the issue for me and I observed no
suspend/resume issues.

Christian, please take a look at the patch, problem was that in
amdgpu_vm_update_directories the parent BO didn't have a kernel mapping and so
later inside amdgpu_vm_cpu_set_ptes 
pe += (unsigned long)amdgpu_bo_kptr(bo); would equal to  2000 since 
parent amdgpu_bo_kptr woudld return NULL. The parent was the root PD. 

This was still working in 67b8d5c Linus Torvalds  7 weeks agoLinux
4.17-rc5   (tag: v4.17-rc5) but I wasn't able to exactly pinpoint which change
broke it. I am not sure my fix is the right one so please advise.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-06-29 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #9 from Andrey Grodzovsky  ---
(In reply to Andrey Grodzovsky from comment #8)
> (In reply to dwagner from comment #7)
> > (In reply to Andrey Grodzovsky from comment #6)
> > > So with Arch Linux kernel it happens only during S3 but with
> > > amd-staging-drm-next it happens once you start X ?
> > 
> > Yes. I know it sounds strange, but it's currently 100% reproducible to me:
> > 
> > Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=0:
> >  X11 starts fine, but system crashes after minutes of firefox browsing
> > 
> > Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=3:
> >  X11 starts fine, system does not crash (for at least hours of use)
> >  but crashes as above if resumed from S3 sleep
> > 
> > Booting linux compiled from amd-staging-drm-next, as of commit
> > 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with
> > amdgpu.vm_update_mode=0:
> >  X11 starts fine, but system crashes after minutes of firefox browsing
> > 
> > Booting linux compiled from amd-staging-drm-next, as of commit
> > 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with
> > amdgpu.vm_update_mode=3:
> >  X11 does not start, crashes immediately with the same above pasted kernel
> > BUG message and backtrace
> > 
> > 
> > So something with CPU-based vm_update_mode is broken, but in a different way
> > than the SDMA-based method.
> > 
> > I will change the subject of this report to reflect that this crash is not
> > necessarily S3-resume-related.
> 
> I am going to try and reproduce the crash with CPU update mode here, please
> describe exactly what ASIC are you using ?

Got it already.

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-06-29 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

--- Comment #8 from Andrey Grodzovsky  ---
(In reply to dwagner from comment #7)
> (In reply to Andrey Grodzovsky from comment #6)
> > So with Arch Linux kernel it happens only during S3 but with
> > amd-staging-drm-next it happens once you start X ?
> 
> Yes. I know it sounds strange, but it's currently 100% reproducible to me:
> 
> Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=0:
>  X11 starts fine, but system crashes after minutes of firefox browsing
> 
> Booting linux-4.17.2-ARCH with amdgpu.vm_update_mode=3:
>  X11 starts fine, system does not crash (for at least hours of use)
>  but crashes as above if resumed from S3 sleep
> 
> Booting linux compiled from amd-staging-drm-next, as of commit
> 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with
> amdgpu.vm_update_mode=0:
>  X11 starts fine, but system crashes after minutes of firefox browsing
> 
> Booting linux compiled from amd-staging-drm-next, as of commit
> 527d6e839a0e52b744fd092453544e4f58977334 from yesterday, with
> amdgpu.vm_update_mode=3:
>  X11 does not start, crashes immediately with the same above pasted kernel
> BUG message and backtrace
> 
> 
> So something with CPU-based vm_update_mode is broken, but in a different way
> than the SDMA-based method.
> 
> I will change the subject of this report to reflect that this crash is not
> necessarily S3-resume-related.

I am going to try and reproduce the crash with CPU update mode here, please
describe exactly what ASIC are you using ?

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[Bug 107065] "BUG: unable to handle kernel paging request at 0000000000002000" in amdgpu_vm_cpu_set_ptes at amdgpu_vm.c:921

2018-06-29 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=107065

dwagner  changed:

   What|Removed |Added

Summary|"BUG: unable to handle  |"BUG: unable to handle
   |kernel paging request at|kernel paging request at
   |2000" at|2000" in
   |amdgpu_vm_cpu_set_ptes at   |amdgpu_vm_cpu_set_ptes at
   |S3 resume   |amdgpu_vm.c:921

-- 
You are receiving this mail because:
You are the assignee for the bug.___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel