Re: [PATCH 2/5] drm/amdgpu: add ring soft recovery v2

2018-08-24 Thread Christian König

Am 23.08.2018 um 22:21 schrieb Marek Olšák:

On Thu, Aug 23, 2018 at 2:51 AM Christian König
 wrote:

Am 22.08.2018 um 21:32 schrieb Marek Olšák:

On Wed, Aug 22, 2018 at 12:56 PM Alex Deucher  wrote:

On Wed, Aug 22, 2018 at 6:05 AM Christian König
 wrote:

Instead of hammering hard on the GPU try a soft recovery first.

v2: reorder code a bit

Signed-off-by: Christian König 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 
   3 files changed, 34 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 265ff90f4e01..d93e31a5c4e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
  struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
  struct amdgpu_job *job = to_amdgpu_job(s_job);

+   if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) 
{
+   DRM_ERROR("ring %s timeout, but soft recovered\n",
+ s_job->sched->name);
+   return;
+   }

I think we should still bubble up the error to userspace even if we
can recover.  Data is lost when the wave is killed.  We should treat
it like a GPU reset.

Yes, please increment gpu_reset_counter, so that we are compliant with
OpenGL. Being able to recover from infinite loops is great, but test
suites also expect this to be properly reported to userspace via the
per-context query.

Sure that shouldn't be a problem.


Also please bump the deadline to 1 second. Even you if you kill all
shaders, the IB can also contain CP DMA, which may take longer than 1
ms.

Is there any way we can get a feedback from the SQ if the kill was
successfully?

I don't think so. The kill should be finished pretty quickly, but more
waves with infinite loops may be waiting to be launched, so you still
need to repeat the kill command. And we should ideally repeat it for 1
second.

The reason is that vertex shader waves take a lot of time to launch. A
very very very large draw call can keep launching new waves for 1
second with the same infinite loop. You would have to soft-reset all
VGTs to stop that.


1 second is way to long, since in the case of a blocked MC we need to
start up hard reset relative fast.

10 seconds have already passed.


That is a good argument, but I still find 1 second way to long.

Keep in mind that we hammer on the SQ here to kill the waves as soon as 
they are launched.



I think that some hangs from corrupted descriptors may still be
recoverable just by killing waves.


Agreed, especially since Vega10 the memory paths became much more robust 
to error handling.


Christian.



Marek


Regards,
Christian.


Marek

Marek


Alex


+
  DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
job->base.sched->name, 
atomic_read(>fence_drv.last_seq),
ring->fence_drv.sync_seq);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 5dfd26be1eec..c045a4e38ad1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct 
amdgpu_ring *ring,
  amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
   }

+/**
+ * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
+ *
+ * @ring: ring to try the recovery on
+ * @vmid: VMID we try to get going again
+ * @fence: timedout fence
+ *
+ * Tries to get a ring proceeding again when it is stuck.
+ */
+bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
+  struct dma_fence *fence)
+{
+   ktime_t deadline = ktime_add_us(ktime_get(), 1000);
+
+   if (!ring->funcs->soft_recovery)
+   return false;
+
+   while (!dma_fence_is_signaled(fence) &&
+  ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
+   ring->funcs->soft_recovery(ring, vmid);
+
+   return dma_fence_is_signaled(fence);
+}
+
   /*
* Debugfs info
*/
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 409fdd9b9710..9cc239968e40 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
  /* priority functions */
  void (*set_priority) (struct amdgpu_ring *ring,
enum drm_sched_priority priority);
+   /* Try to soft recover the ring to make the fence signal */
+   void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
   };

   struct amdgpu_ring {
@@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
   void 

Re: [PATCH 2/5] drm/amdgpu: add ring soft recovery v2

2018-08-23 Thread Marek Olšák
On Thu, Aug 23, 2018 at 2:51 AM Christian König
 wrote:
>
> Am 22.08.2018 um 21:32 schrieb Marek Olšák:
> > On Wed, Aug 22, 2018 at 12:56 PM Alex Deucher  wrote:
> >> On Wed, Aug 22, 2018 at 6:05 AM Christian König
> >>  wrote:
> >>> Instead of hammering hard on the GPU try a soft recovery first.
> >>>
> >>> v2: reorder code a bit
> >>>
> >>> Signed-off-by: Christian König 
> >>> ---
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 
> >>>   3 files changed, 34 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> index 265ff90f4e01..d93e31a5c4e7 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >>> @@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job 
> >>> *s_job)
> >>>  struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
> >>>  struct amdgpu_job *job = to_amdgpu_job(s_job);
> >>>
> >>> +   if (amdgpu_ring_soft_recovery(ring, job->vmid, 
> >>> s_job->s_fence->parent)) {
> >>> +   DRM_ERROR("ring %s timeout, but soft recovered\n",
> >>> + s_job->sched->name);
> >>> +   return;
> >>> +   }
> >> I think we should still bubble up the error to userspace even if we
> >> can recover.  Data is lost when the wave is killed.  We should treat
> >> it like a GPU reset.
> > Yes, please increment gpu_reset_counter, so that we are compliant with
> > OpenGL. Being able to recover from infinite loops is great, but test
> > suites also expect this to be properly reported to userspace via the
> > per-context query.
>
> Sure that shouldn't be a problem.
>
> > Also please bump the deadline to 1 second. Even you if you kill all
> > shaders, the IB can also contain CP DMA, which may take longer than 1
> > ms.
>
> Is there any way we can get a feedback from the SQ if the kill was
> successfully?

I don't think so. The kill should be finished pretty quickly, but more
waves with infinite loops may be waiting to be launched, so you still
need to repeat the kill command. And we should ideally repeat it for 1
second.

The reason is that vertex shader waves take a lot of time to launch. A
very very very large draw call can keep launching new waves for 1
second with the same infinite loop. You would have to soft-reset all
VGTs to stop that.

>
> 1 second is way to long, since in the case of a blocked MC we need to
> start up hard reset relative fast.

10 seconds have already passed.

I think that some hangs from corrupted descriptors may still be
recoverable just by killing waves.

Marek

>
> Regards,
> Christian.
>
> >
> > Marek
> >
> > Marek
> >
> >> Alex
> >>
> >>> +
> >>>  DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
> >>>job->base.sched->name, 
> >>> atomic_read(>fence_drv.last_seq),
> >>>ring->fence_drv.sync_seq);
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> index 5dfd26be1eec..c045a4e38ad1 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> >>> @@ -383,6 +383,30 @@ void 
> >>> amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
> >>>  amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
> >>>   }
> >>>
> >>> +/**
> >>> + * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
> >>> + *
> >>> + * @ring: ring to try the recovery on
> >>> + * @vmid: VMID we try to get going again
> >>> + * @fence: timedout fence
> >>> + *
> >>> + * Tries to get a ring proceeding again when it is stuck.
> >>> + */
> >>> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int 
> >>> vmid,
> >>> +  struct dma_fence *fence)
> >>> +{
> >>> +   ktime_t deadline = ktime_add_us(ktime_get(), 1000);
> >>> +
> >>> +   if (!ring->funcs->soft_recovery)
> >>> +   return false;
> >>> +
> >>> +   while (!dma_fence_is_signaled(fence) &&
> >>> +  ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
> >>> +   ring->funcs->soft_recovery(ring, vmid);
> >>> +
> >>> +   return dma_fence_is_signaled(fence);
> >>> +}
> >>> +
> >>>   /*
> >>>* Debugfs info
> >>>*/
> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
> >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> index 409fdd9b9710..9cc239968e40 100644
> >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> >>> @@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
> >>>  /* priority functions */
> >>>  void (*set_priority) (struct amdgpu_ring *ring,
> >>>enum drm_sched_priority priority);
> >>> +   /* Try 

Re: [PATCH 2/5] drm/amdgpu: add ring soft recovery v2

2018-08-23 Thread Christian König

Am 23.08.2018 um 09:17 schrieb Huang Rui:

On Wed, Aug 22, 2018 at 12:55:43PM -0400, Alex Deucher wrote:

On Wed, Aug 22, 2018 at 6:05 AM Christian König
 wrote:

Instead of hammering hard on the GPU try a soft recovery first.

v2: reorder code a bit

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 
  3 files changed, 34 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 265ff90f4e01..d93e31a5c4e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
 struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
 struct amdgpu_job *job = to_amdgpu_job(s_job);

+   if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) 
{
+   DRM_ERROR("ring %s timeout, but soft recovered\n",
+ s_job->sched->name);
+   return;
+   }

I think we should still bubble up the error to userspace even if we
can recover.  Data is lost when the wave is killed.  We should treat
it like a GPU reset.


May I know what does the wavefront stand for? Why we can do the "light"
recover than reset here?


Wavefront means a running shader in the SQ.

Basically this only covers the case when the application sends down a 
shader with an endless loop to the hardware. Here we just kill the 
shader and try to continue.


When you run into a hang because of a corrupted resource descriptor you 
need usually need a full ASIC reset to get out of that again.


Regards,
Christian.



Thanks,
Ray


Alex


+
 DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
   job->base.sched->name, 
atomic_read(>fence_drv.last_seq),
   ring->fence_drv.sync_seq);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 5dfd26be1eec..c045a4e38ad1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct 
amdgpu_ring *ring,
 amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
  }

+/**
+ * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
+ *
+ * @ring: ring to try the recovery on
+ * @vmid: VMID we try to get going again
+ * @fence: timedout fence
+ *
+ * Tries to get a ring proceeding again when it is stuck.
+ */
+bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
+  struct dma_fence *fence)
+{
+   ktime_t deadline = ktime_add_us(ktime_get(), 1000);
+
+   if (!ring->funcs->soft_recovery)
+   return false;
+
+   while (!dma_fence_is_signaled(fence) &&
+  ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
+   ring->funcs->soft_recovery(ring, vmid);
+
+   return dma_fence_is_signaled(fence);
+}
+
  /*
   * Debugfs info
   */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 409fdd9b9710..9cc239968e40 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
 /* priority functions */
 void (*set_priority) (struct amdgpu_ring *ring,
   enum drm_sched_priority priority);
+   /* Try to soft recover the ring to make the fence signal */
+   void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
  };

  struct amdgpu_ring {
@@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
  void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
 uint32_t reg0, uint32_t val0,
 uint32_t reg1, uint32_t val1);
+bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
+  struct dma_fence *fence);

  static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
  {
--
2.14.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/5] drm/amdgpu: add ring soft recovery v2

2018-08-23 Thread Huang Rui
On Wed, Aug 22, 2018 at 12:55:43PM -0400, Alex Deucher wrote:
> On Wed, Aug 22, 2018 at 6:05 AM Christian König
>  wrote:
> >
> > Instead of hammering hard on the GPU try a soft recovery first.
> >
> > v2: reorder code a bit
> >
> > Signed-off-by: Christian König 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 
> >  3 files changed, 34 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 265ff90f4e01..d93e31a5c4e7 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job 
> > *s_job)
> > struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
> > struct amdgpu_job *job = to_amdgpu_job(s_job);
> >
> > +   if (amdgpu_ring_soft_recovery(ring, job->vmid, 
> > s_job->s_fence->parent)) {
> > +   DRM_ERROR("ring %s timeout, but soft recovered\n",
> > + s_job->sched->name);
> > +   return;
> > +   }
> 
> I think we should still bubble up the error to userspace even if we
> can recover.  Data is lost when the wave is killed.  We should treat
> it like a GPU reset.
> 

May I know what does the wavefront stand for? Why we can do the "light"
recover than reset here?

Thanks,
Ray

> Alex
> 
> > +
> > DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
> >   job->base.sched->name, 
> > atomic_read(>fence_drv.last_seq),
> >   ring->fence_drv.sync_seq);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > index 5dfd26be1eec..c045a4e38ad1 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > @@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct 
> > amdgpu_ring *ring,
> > amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
> >  }
> >
> > +/**
> > + * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
> > + *
> > + * @ring: ring to try the recovery on
> > + * @vmid: VMID we try to get going again
> > + * @fence: timedout fence
> > + *
> > + * Tries to get a ring proceeding again when it is stuck.
> > + */
> > +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> > +  struct dma_fence *fence)
> > +{
> > +   ktime_t deadline = ktime_add_us(ktime_get(), 1000);
> > +
> > +   if (!ring->funcs->soft_recovery)
> > +   return false;
> > +
> > +   while (!dma_fence_is_signaled(fence) &&
> > +  ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
> > +   ring->funcs->soft_recovery(ring, vmid);
> > +
> > +   return dma_fence_is_signaled(fence);
> > +}
> > +
> >  /*
> >   * Debugfs info
> >   */
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > index 409fdd9b9710..9cc239968e40 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > @@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
> > /* priority functions */
> > void (*set_priority) (struct amdgpu_ring *ring,
> >   enum drm_sched_priority priority);
> > +   /* Try to soft recover the ring to make the fence signal */
> > +   void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
> >  };
> >
> >  struct amdgpu_ring {
> > @@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
> >  void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
> > uint32_t reg0, uint32_t 
> > val0,
> > uint32_t reg1, uint32_t 
> > val1);
> > +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> > +  struct dma_fence *fence);
> >
> >  static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
> >  {
> > --
> > 2.14.1
> >
> > ___
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/5] drm/amdgpu: add ring soft recovery v2

2018-08-23 Thread Christian König

Am 22.08.2018 um 21:32 schrieb Marek Olšák:

On Wed, Aug 22, 2018 at 12:56 PM Alex Deucher  wrote:

On Wed, Aug 22, 2018 at 6:05 AM Christian König
 wrote:

Instead of hammering hard on the GPU try a soft recovery first.

v2: reorder code a bit

Signed-off-by: Christian König 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 
  3 files changed, 34 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 265ff90f4e01..d93e31a5c4e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
 struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
 struct amdgpu_job *job = to_amdgpu_job(s_job);

+   if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) 
{
+   DRM_ERROR("ring %s timeout, but soft recovered\n",
+ s_job->sched->name);
+   return;
+   }

I think we should still bubble up the error to userspace even if we
can recover.  Data is lost when the wave is killed.  We should treat
it like a GPU reset.

Yes, please increment gpu_reset_counter, so that we are compliant with
OpenGL. Being able to recover from infinite loops is great, but test
suites also expect this to be properly reported to userspace via the
per-context query.


Sure that shouldn't be a problem.


Also please bump the deadline to 1 second. Even you if you kill all
shaders, the IB can also contain CP DMA, which may take longer than 1
ms.


Is there any way we can get a feedback from the SQ if the kill was 
successfully?


1 second is way to long, since in the case of a blocked MC we need to 
start up hard reset relative fast.


Regards,
Christian.



Marek

Marek


Alex


+
 DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
   job->base.sched->name, 
atomic_read(>fence_drv.last_seq),
   ring->fence_drv.sync_seq);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 5dfd26be1eec..c045a4e38ad1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct 
amdgpu_ring *ring,
 amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
  }

+/**
+ * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
+ *
+ * @ring: ring to try the recovery on
+ * @vmid: VMID we try to get going again
+ * @fence: timedout fence
+ *
+ * Tries to get a ring proceeding again when it is stuck.
+ */
+bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
+  struct dma_fence *fence)
+{
+   ktime_t deadline = ktime_add_us(ktime_get(), 1000);
+
+   if (!ring->funcs->soft_recovery)
+   return false;
+
+   while (!dma_fence_is_signaled(fence) &&
+  ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
+   ring->funcs->soft_recovery(ring, vmid);
+
+   return dma_fence_is_signaled(fence);
+}
+
  /*
   * Debugfs info
   */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 409fdd9b9710..9cc239968e40 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
 /* priority functions */
 void (*set_priority) (struct amdgpu_ring *ring,
   enum drm_sched_priority priority);
+   /* Try to soft recover the ring to make the fence signal */
+   void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
  };

  struct amdgpu_ring {
@@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
  void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
 uint32_t reg0, uint32_t val0,
 uint32_t reg1, uint32_t val1);
+bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
+  struct dma_fence *fence);

  static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
  {
--
2.14.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/5] drm/amdgpu: add ring soft recovery v2

2018-08-22 Thread Marek Olšák
On Wed, Aug 22, 2018 at 12:56 PM Alex Deucher  wrote:
>
> On Wed, Aug 22, 2018 at 6:05 AM Christian König
>  wrote:
> >
> > Instead of hammering hard on the GPU try a soft recovery first.
> >
> > v2: reorder code a bit
> >
> > Signed-off-by: Christian König 
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 
> >  3 files changed, 34 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 265ff90f4e01..d93e31a5c4e7 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job 
> > *s_job)
> > struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
> > struct amdgpu_job *job = to_amdgpu_job(s_job);
> >
> > +   if (amdgpu_ring_soft_recovery(ring, job->vmid, 
> > s_job->s_fence->parent)) {
> > +   DRM_ERROR("ring %s timeout, but soft recovered\n",
> > + s_job->sched->name);
> > +   return;
> > +   }
>
> I think we should still bubble up the error to userspace even if we
> can recover.  Data is lost when the wave is killed.  We should treat
> it like a GPU reset.

Yes, please increment gpu_reset_counter, so that we are compliant with
OpenGL. Being able to recover from infinite loops is great, but test
suites also expect this to be properly reported to userspace via the
per-context query.

Also please bump the deadline to 1 second. Even you if you kill all
shaders, the IB can also contain CP DMA, which may take longer than 1
ms.

Marek

Marek

>
> Alex
>
> > +
> > DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
> >   job->base.sched->name, 
> > atomic_read(>fence_drv.last_seq),
> >   ring->fence_drv.sync_seq);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > index 5dfd26be1eec..c045a4e38ad1 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> > @@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct 
> > amdgpu_ring *ring,
> > amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
> >  }
> >
> > +/**
> > + * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
> > + *
> > + * @ring: ring to try the recovery on
> > + * @vmid: VMID we try to get going again
> > + * @fence: timedout fence
> > + *
> > + * Tries to get a ring proceeding again when it is stuck.
> > + */
> > +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> > +  struct dma_fence *fence)
> > +{
> > +   ktime_t deadline = ktime_add_us(ktime_get(), 1000);
> > +
> > +   if (!ring->funcs->soft_recovery)
> > +   return false;
> > +
> > +   while (!dma_fence_is_signaled(fence) &&
> > +  ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
> > +   ring->funcs->soft_recovery(ring, vmid);
> > +
> > +   return dma_fence_is_signaled(fence);
> > +}
> > +
> >  /*
> >   * Debugfs info
> >   */
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > index 409fdd9b9710..9cc239968e40 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > @@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
> > /* priority functions */
> > void (*set_priority) (struct amdgpu_ring *ring,
> >   enum drm_sched_priority priority);
> > +   /* Try to soft recover the ring to make the fence signal */
> > +   void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
> >  };
> >
> >  struct amdgpu_ring {
> > @@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
> >  void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
> > uint32_t reg0, uint32_t 
> > val0,
> > uint32_t reg1, uint32_t 
> > val1);
> > +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> > +  struct dma_fence *fence);
> >
> >  static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
> >  {
> > --
> > 2.14.1
> >
> > ___
> > amd-gfx mailing list
> > amd-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org

Re: [PATCH 2/5] drm/amdgpu: add ring soft recovery v2

2018-08-22 Thread Alex Deucher
On Wed, Aug 22, 2018 at 6:05 AM Christian König
 wrote:
>
> Instead of hammering hard on the GPU try a soft recovery first.
>
> v2: reorder code a bit
>
> Signed-off-by: Christian König 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 
>  3 files changed, 34 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 265ff90f4e01..d93e31a5c4e7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job 
> *s_job)
> struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
> struct amdgpu_job *job = to_amdgpu_job(s_job);
>
> +   if (amdgpu_ring_soft_recovery(ring, job->vmid, 
> s_job->s_fence->parent)) {
> +   DRM_ERROR("ring %s timeout, but soft recovered\n",
> + s_job->sched->name);
> +   return;
> +   }

I think we should still bubble up the error to userspace even if we
can recover.  Data is lost when the wave is killed.  We should treat
it like a GPU reset.

Alex

> +
> DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
>   job->base.sched->name, 
> atomic_read(>fence_drv.last_seq),
>   ring->fence_drv.sync_seq);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> index 5dfd26be1eec..c045a4e38ad1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
> @@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct 
> amdgpu_ring *ring,
> amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
>  }
>
> +/**
> + * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
> + *
> + * @ring: ring to try the recovery on
> + * @vmid: VMID we try to get going again
> + * @fence: timedout fence
> + *
> + * Tries to get a ring proceeding again when it is stuck.
> + */
> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> +  struct dma_fence *fence)
> +{
> +   ktime_t deadline = ktime_add_us(ktime_get(), 1000);
> +
> +   if (!ring->funcs->soft_recovery)
> +   return false;
> +
> +   while (!dma_fence_is_signaled(fence) &&
> +  ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
> +   ring->funcs->soft_recovery(ring, vmid);
> +
> +   return dma_fence_is_signaled(fence);
> +}
> +
>  /*
>   * Debugfs info
>   */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index 409fdd9b9710..9cc239968e40 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
> /* priority functions */
> void (*set_priority) (struct amdgpu_ring *ring,
>   enum drm_sched_priority priority);
> +   /* Try to soft recover the ring to make the fence signal */
> +   void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
>  };
>
>  struct amdgpu_ring {
> @@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
>  void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
> uint32_t reg0, uint32_t val0,
> uint32_t reg1, uint32_t val1);
> +bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
> +  struct dma_fence *fence);
>
>  static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
>  {
> --
> 2.14.1
>
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 2/5] drm/amdgpu: add ring soft recovery v2

2018-08-22 Thread Christian König
Instead of hammering hard on the GPU try a soft recovery first.

v2: reorder code a bit

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c  |  6 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 24 
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  4 
 3 files changed, 34 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 265ff90f4e01..d93e31a5c4e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -33,6 +33,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
struct amdgpu_job *job = to_amdgpu_job(s_job);
 
+   if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) 
{
+   DRM_ERROR("ring %s timeout, but soft recovered\n",
+ s_job->sched->name);
+   return;
+   }
+
DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
  job->base.sched->name, atomic_read(>fence_drv.last_seq),
  ring->fence_drv.sync_seq);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
index 5dfd26be1eec..c045a4e38ad1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c
@@ -383,6 +383,30 @@ void amdgpu_ring_emit_reg_write_reg_wait_helper(struct 
amdgpu_ring *ring,
amdgpu_ring_emit_reg_wait(ring, reg1, mask, mask);
 }
 
+/**
+ * amdgpu_ring_soft_recovery - try to soft recover a ring lockup
+ *
+ * @ring: ring to try the recovery on
+ * @vmid: VMID we try to get going again
+ * @fence: timedout fence
+ *
+ * Tries to get a ring proceeding again when it is stuck.
+ */
+bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
+  struct dma_fence *fence)
+{
+   ktime_t deadline = ktime_add_us(ktime_get(), 1000);
+
+   if (!ring->funcs->soft_recovery)
+   return false;
+
+   while (!dma_fence_is_signaled(fence) &&
+  ktime_to_ns(ktime_sub(deadline, ktime_get())) > 0)
+   ring->funcs->soft_recovery(ring, vmid);
+
+   return dma_fence_is_signaled(fence);
+}
+
 /*
  * Debugfs info
  */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 409fdd9b9710..9cc239968e40 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -168,6 +168,8 @@ struct amdgpu_ring_funcs {
/* priority functions */
void (*set_priority) (struct amdgpu_ring *ring,
  enum drm_sched_priority priority);
+   /* Try to soft recover the ring to make the fence signal */
+   void (*soft_recovery)(struct amdgpu_ring *ring, unsigned vmid);
 };
 
 struct amdgpu_ring {
@@ -260,6 +262,8 @@ void amdgpu_ring_fini(struct amdgpu_ring *ring);
 void amdgpu_ring_emit_reg_write_reg_wait_helper(struct amdgpu_ring *ring,
uint32_t reg0, uint32_t val0,
uint32_t reg1, uint32_t val1);
+bool amdgpu_ring_soft_recovery(struct amdgpu_ring *ring, unsigned int vmid,
+  struct dma_fence *fence);
 
 static inline void amdgpu_ring_clear_ring(struct amdgpu_ring *ring)
 {
-- 
2.14.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx