from:"Boris BREZILLON"

[PATCH v5 10/16] drm/panfrost: Make sure job interrupts are masked before resetting

2021-06-29 Thread Boris Brezillon

This is not yet needed because we let active jobs be killed during by
the reset and we don't really bother making sure they can be restarted.
But once we start adding soft-stop support, controlling when we deal
with the remaining interrrupts and making sure those are handled before
the reset is issued gets tricky if we keep job interrupts active.

Let's prepare for that and mask+flush job IRQs before issuing a reset.

v4:
* Add a comment explaining why we WARN_ON(!job) in the irq handler
* Keep taking the job_lock when evicting stalled jobs

v3:
* New patch

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 27 -
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 98193a557a2d..4bd4d11377b7 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -34,6 +34,7 @@ struct panfrost_queue_state {
 struct panfrost_job_slot {
struct panfrost_queue_state queue[NUM_JOB_SLOTS];
spinlock_t job_lock;
+   int irq;
 };
 
 static struct panfrost_job *
@@ -400,6 +401,15 @@ static void panfrost_reset(struct panfrost_device *pfdev,
if (bad)
drm_sched_increase_karma(bad);
 
+   /* Mask job interrupts and synchronize to make sure we won't be
+* interrupted during our reset.
+*/
+   job_write(pfdev, JOB_INT_MASK, 0);
+   synchronize_irq(pfdev->js->irq);
+
+   /* Schedulers are stopped and interrupts are masked+flushed, we don't
+* need to protect the 'evict unfinished jobs' lock with the job_lock.
+*/
spin_lock(&pfdev->js->job_lock);
for (i = 0; i < NUM_JOB_SLOTS; i++) {
if (pfdev->jobs[i]) {
@@ -502,7 +512,14 @@ static void panfrost_job_handle_irq(struct panfrost_device 
*pfdev, u32 status)
struct panfrost_job *job;
 
job = pfdev->jobs[j];
-   /* Only NULL if job timeout occurred */
+   /* The only reason this job could be NULL is if the
+* job IRQ handler is called just after the
+* in-flight job eviction in the reset path, and
+* this shouldn't happen because the job IRQ has
+* been masked and synchronized when this eviction
+* happens.
+*/
+   WARN_ON(!job);
if (job) {
pfdev->jobs[j] = NULL;
 
@@ -562,7 +579,7 @@ static void panfrost_reset_work(struct work_struct *work)
 int panfrost_job_init(struct panfrost_device *pfdev)
 {
struct panfrost_job_slot *js;
-   int ret, j, irq;
+   int ret, j;
 
INIT_WORK(&pfdev->reset.work, panfrost_reset_work);
 
@@ -572,11 +589,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 
spin_lock_init(&js->job_lock);
 
-   irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "job");
-   if (irq <= 0)
+   js->irq = platform_get_irq_byname(to_platform_device(pfdev->dev), 
"job");
+   if (js->irq <= 0)
return -ENODEV;
 
-   ret = devm_request_threaded_irq(pfdev->dev, irq,
+   ret = devm_request_threaded_irq(pfdev->dev, js->irq,
panfrost_job_irq_handler,
panfrost_job_irq_handler_thread,
IRQF_SHARED, KBUILD_MODNAME "-job",
-- 
2.31.1

[PATCH v5 16/16] drm/panfrost: Increase the AS_ACTIVE polling timeout

2021-06-29 Thread Boris Brezillon

Experience has shown that 1ms is sometimes not enough, even when the GPU
is running at its maximum frequency, not to mention that an MMU operation
might take longer if the GPU is running at a lower frequency, which is
likely to be the case if devfreq is active.

Let's pick a significantly bigger timeout value (1ms -> 100ms) to be on
the safe side.

v5:
* New patch

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 5267c3a1f02f..a32a3df5358e 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -34,7 +34,7 @@ static int wait_ready(struct panfrost_device *pfdev, u32 
as_nr)
/* Wait for the MMU status to indicate there is no active command, in
 * case one is pending. */
ret = readl_relaxed_poll_timeout_atomic(pfdev->iomem + AS_STATUS(as_nr),
-   val, !(val & AS_STATUS_AS_ACTIVE), 10, 1000);
+   val, !(val & AS_STATUS_AS_ACTIVE), 10, 10);
 
if (ret) {
/* The GPU hung, let's trigger a reset */
-- 
2.31.1

[PATCH v5 04/16] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition

2021-06-29 Thread Boris Brezillon

Exception types will be defined as an enum.

v4:
* Fix typo in the commit message

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
Reviewed-by: Alyssa Rosenzweig 
---
 drivers/gpu/drm/panfrost/panfrost_regs.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h 
b/drivers/gpu/drm/panfrost/panfrost_regs.h
index eddaa62ad8b0..151cfebd80a0 100644
--- a/drivers/gpu/drm/panfrost/panfrost_regs.h
+++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
@@ -261,9 +261,6 @@
 #define JS_COMMAND_SOFT_STOP_1 0x06/* Execute SOFT_STOP if 
JOB_CHAIN_FLAG is 1 */
 #define JS_COMMAND_HARD_STOP_1 0x07/* Execute HARD_STOP if 
JOB_CHAIN_FLAG is 1 */
 
-#define JS_STATUS_EVENT_ACTIVE 0x08
-
-
 /* MMU regs */
 #define MMU_INT_RAWSTAT0x2000
 #define MMU_INT_CLEAR  0x2004
-- 
2.31.1

[PATCH v5 13/16] drm/panfrost: Don't reset the GPU on job faults unless we really have to

2021-06-29 Thread Boris Brezillon

If we can recover from a fault without a reset there's no reason to
issue one.

v3:
* Drop the mention of Valhall requiring a reset on JOB_BUS_FAULT
* Set the fence error to -EINVAL instead of having per-exception
  error codes

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_device.c |  9 +
 drivers/gpu/drm/panfrost/panfrost_device.h |  2 ++
 drivers/gpu/drm/panfrost/panfrost_job.c| 16 ++--
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index 736854542b05..f4e42009526d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -379,6 +379,15 @@ const char *panfrost_exception_name(u32 exception_code)
return panfrost_exception_infos[exception_code].name;
 }
 
+bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
+   u32 exception_code)
+{
+   /* Right now, none of the GPU we support need a reset, but this
+* might change.
+*/
+   return false;
+}
+
 void panfrost_device_reset(struct panfrost_device *pfdev)
 {
panfrost_gpu_soft_reset(pfdev);
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 2dc8c0d1d987..d91f71366214 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -244,6 +244,8 @@ enum drm_panfrost_exception_type {
 };
 
 const char *panfrost_exception_name(u32 exception_code);
+bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
+   u32 exception_code);
 
 static inline void
 panfrost_device_schedule_reset(struct panfrost_device *pfdev)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 4bd4d11377b7..b0f4857ca084 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -498,14 +498,26 @@ static void panfrost_job_handle_irq(struct 
panfrost_device *pfdev, u32 status)
job_write(pfdev, JOB_INT_CLEAR, mask);
 
if (status & JOB_INT_MASK_ERR(j)) {
+   u32 js_status = job_read(pfdev, JS_STATUS(j));
+
job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
j,
-   panfrost_exception_name(job_read(pfdev, 
JS_STATUS(j))),
+   panfrost_exception_name(js_status),
job_read(pfdev, JS_HEAD_LO(j)),
job_read(pfdev, JS_TAIL_LO(j)));
-   drm_sched_fault(&pfdev->js->queue[j].sched);
+
+   /* If we need a reset, signal it to the timeout
+* handler, otherwise, update the fence error field and
+* signal the job fence.
+*/
+   if (panfrost_exception_needs_reset(pfdev, js_status)) {
+   drm_sched_fault(&pfdev->js->queue[j].sched);
+   } else {
+   dma_fence_set_error(pfdev->jobs[j]->done_fence, 
-EINVAL);
+   status |= JOB_INT_MASK_DONE(j);
+   }
}
 
if (status & JOB_INT_MASK_DONE(j)) {
-- 
2.31.1

[PATCH v5 15/16] drm/panfrost: Queue jobs on the hardware

2021-06-29 Thread Boris Brezillon

From: Steven Price 

The hardware has a set of '_NEXT' registers that can hold a second job
while the first is executing. Make use of these registers to enqueue a
second job per slot.

v5:
* Fix a comment in panfrost_job_init()

v3:
* Fix the done/err job dequeuing logic to get a valid active state
* Only enable the second slot on GPUs supporting jobchain disambiguation
* Split interrupt handling in sub-functions

Signed-off-by: Steven Price 
Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 468 +++--
 2 files changed, 351 insertions(+), 119 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index d2ee6e5fe5d8..9f1f2a603208 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -101,7 +101,7 @@ struct panfrost_device {
 
struct panfrost_job_slot *js;
 
-   struct panfrost_job *jobs[NUM_JOB_SLOTS];
+   struct panfrost_job *jobs[NUM_JOB_SLOTS][2];
struct list_head scheduled_jobs;
 
struct panfrost_perfcnt *perfcnt;
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index d8e1bc227455..e89e6adef481 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -140,9 +141,52 @@ static void panfrost_job_write_affinity(struct 
panfrost_device *pfdev,
job_write(pfdev, JS_AFFINITY_NEXT_HI(js), affinity >> 32);
 }
 
+static u32
+panfrost_get_job_chain_flag(const struct panfrost_job *job)
+{
+   struct panfrost_fence *f = to_panfrost_fence(job->done_fence);
+
+   if (!panfrost_has_hw_feature(job->pfdev, 
HW_FEATURE_JOBCHAIN_DISAMBIGUATION))
+   return 0;
+
+   return (f->seqno & 1) ? JS_CONFIG_JOB_CHAIN_FLAG : 0;
+}
+
+static struct panfrost_job *
+panfrost_dequeue_job(struct panfrost_device *pfdev, int slot)
+{
+   struct panfrost_job *job = pfdev->jobs[slot][0];
+
+   WARN_ON(!job);
+   pfdev->jobs[slot][0] = pfdev->jobs[slot][1];
+   pfdev->jobs[slot][1] = NULL;
+
+   return job;
+}
+
+static unsigned int
+panfrost_enqueue_job(struct panfrost_device *pfdev, int slot,
+struct panfrost_job *job)
+{
+   if (WARN_ON(!job))
+   return 0;
+
+   if (!pfdev->jobs[slot][0]) {
+   pfdev->jobs[slot][0] = job;
+   return 0;
+   }
+
+   WARN_ON(pfdev->jobs[slot][1]);
+   pfdev->jobs[slot][1] = job;
+   WARN_ON(panfrost_get_job_chain_flag(job) ==
+   panfrost_get_job_chain_flag(pfdev->jobs[slot][0]));
+   return 1;
+}
+
 static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
 {
struct panfrost_device *pfdev = job->pfdev;
+   unsigned int subslot;
u32 cfg;
u64 jc_head = job->jc;
int ret;
@@ -168,7 +212,8 @@ static void panfrost_job_hw_submit(struct panfrost_job 
*job, int js)
 * start */
cfg |= JS_CONFIG_THREAD_PRI(8) |
JS_CONFIG_START_FLUSH_CLEAN_INVALIDATE |
-   JS_CONFIG_END_FLUSH_CLEAN_INVALIDATE;
+   JS_CONFIG_END_FLUSH_CLEAN_INVALIDATE |
+   panfrost_get_job_chain_flag(job);
 
if (panfrost_has_hw_feature(pfdev, HW_FEATURE_FLUSH_REDUCTION))
cfg |= JS_CONFIG_ENABLE_FLUSH_REDUCTION;
@@ -182,10 +227,17 @@ static void panfrost_job_hw_submit(struct panfrost_job 
*job, int js)
job_write(pfdev, JS_FLUSH_ID_NEXT(js), job->flush_id);
 
/* GO ! */
-   dev_dbg(pfdev->dev, "JS: Submitting atom %p to js[%d] with head=0x%llx",
-   job, js, jc_head);
 
-   job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
+   spin_lock(&pfdev->js->job_lock);
+   subslot = panfrost_enqueue_job(pfdev, js, job);
+   /* Don't queue the job if a reset is in progress */
+   if (!atomic_read(&pfdev->reset.pending)) {
+   job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
+   dev_dbg(pfdev->dev,
+   "JS: Submitting atom %p to js[%d][%d] with head=0x%llx 
AS %d",
+   job, js, subslot, jc_head, cfg & 0xf);
+   }
+   spin_unlock(&pfdev->js->job_lock);
 }
 
 static void panfrost_acquire_object_fences(struct drm_gem_object **bos,
@@ -343,7 +395,11 @@ static struct dma_fence *panfrost_job_run(struct 
drm_sched_job *sched_job)
if (unlikely(job->base.s_fence->finished.error))
return NULL;
 
-   pfdev->jobs[slot] = job;
+   /* Nothing to execute: can happen if the job has finished while
+* we were resetting the GPU.
+*/
+

[PATCH v5 14/16] drm/panfrost: Kill in-flight jobs on FD close

2021-06-29 Thread Boris Brezillon

If the process who submitted these jobs decided to close the FD before
the jobs are done it probably means it doesn't care about the result.

v5:
* Add a panfrost_exception_is_fault() helper and the
  DRM_PANFROST_EXCEPTION_MAX_NON_FAULT value

v4:
* Don't disable/restore irqs when taking the job_lock (not needed since
  this lock is never taken from an interrupt context)

v3:
* Set fence error to ECANCELED when a TERMINATED exception is received

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |  7 
 drivers/gpu/drm/panfrost/panfrost_job.c| 42 ++
 2 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index d91f71366214..d2ee6e5fe5d8 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -183,6 +183,7 @@ enum drm_panfrost_exception_type {
DRM_PANFROST_EXCEPTION_KABOOM = 0x05,
DRM_PANFROST_EXCEPTION_EUREKA = 0x06,
DRM_PANFROST_EXCEPTION_ACTIVE = 0x08,
+   DRM_PANFROST_EXCEPTION_MAX_NON_FAULT = 0x3f,
DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT = 0x40,
DRM_PANFROST_EXCEPTION_JOB_POWER_FAULT = 0x41,
DRM_PANFROST_EXCEPTION_JOB_READ_FAULT = 0x42,
@@ -243,6 +244,12 @@ enum drm_panfrost_exception_type {
DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_3 = 0xef,
 };
 
+static inline bool
+panfrost_exception_is_fault(u32 exception_code)
+{
+   return exception_code > DRM_PANFROST_EXCEPTION_MAX_NON_FAULT;
+}
+
 const char *panfrost_exception_name(u32 exception_code);
 bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
u32 exception_code);
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index b0f4857ca084..d8e1bc227455 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -499,14 +499,21 @@ static void panfrost_job_handle_irq(struct 
panfrost_device *pfdev, u32 status)
 
if (status & JOB_INT_MASK_ERR(j)) {
u32 js_status = job_read(pfdev, JS_STATUS(j));
+   const char *exception_name = 
panfrost_exception_name(js_status);
 
job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
-   dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
-   j,
-   panfrost_exception_name(js_status),
-   job_read(pfdev, JS_HEAD_LO(j)),
-   job_read(pfdev, JS_TAIL_LO(j)));
+   if (!panfrost_exception_is_fault(js_status)) {
+   dev_dbg(pfdev->dev, "js interrupt, js=%d, 
status=%s, head=0x%x, tail=0x%x",
+   j, exception_name,
+   job_read(pfdev, JS_HEAD_LO(j)),
+   job_read(pfdev, JS_TAIL_LO(j)));
+   } else {
+   dev_err(pfdev->dev, "js fault, js=%d, 
status=%s, head=0x%x, tail=0x%x",
+   j, exception_name,
+   job_read(pfdev, JS_HEAD_LO(j)),
+   job_read(pfdev, JS_TAIL_LO(j)));
+   }
 
/* If we need a reset, signal it to the timeout
 * handler, otherwise, update the fence error field and
@@ -515,7 +522,16 @@ static void panfrost_job_handle_irq(struct panfrost_device 
*pfdev, u32 status)
if (panfrost_exception_needs_reset(pfdev, js_status)) {
drm_sched_fault(&pfdev->js->queue[j].sched);
} else {
-   dma_fence_set_error(pfdev->jobs[j]->done_fence, 
-EINVAL);
+   int error = 0;
+
+   if (js_status == 
DRM_PANFROST_EXCEPTION_TERMINATED)
+   error = -ECANCELED;
+   else if (panfrost_exception_is_fault(js_status))
+   error = -EINVAL;
+
+   if (error)
+   
dma_fence_set_error(pfdev->jobs[j]->done_fence, error);
+
status |= JOB_INT_MASK_DONE(j);
}
}
@@ -681,10 +697,24 @@ int panfrost_job_open(struct panfrost_file_priv 
*panfrost_priv)
 
 void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
 {
+   struct panfrost_device *pfdev = panfrost_priv->pfdev;
int i;
 
for (i = 0; i < NUM_JOB_SLOTS; i++)
drm_sch

[PATCH v5 11/16] drm/panfrost: Disable the AS on unhandled page faults

2021-06-29 Thread Boris Brezillon

If we don't do that, we have to wait for the job timeout to expire
before the fault jobs gets killed.

v3:
* Make sure the AS is re-enabled when new jobs are submitted to the
  context

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |  1 +
 drivers/gpu/drm/panfrost/panfrost_mmu.c| 34 --
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 59a487e8aba3..2dc8c0d1d987 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -96,6 +96,7 @@ struct panfrost_device {
spinlock_t as_lock;
unsigned long as_in_use_mask;
unsigned long as_alloc_mask;
+   unsigned long as_faulty_mask;
struct list_head as_lru_list;
 
struct panfrost_job_slot *js;
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index b4f0c673cd7f..65e98c51cb66 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -154,6 +154,7 @@ u32 panfrost_mmu_as_get(struct panfrost_device *pfdev, 
struct panfrost_mmu *mmu)
as = mmu->as;
if (as >= 0) {
int en = atomic_inc_return(&mmu->as_count);
+   u32 mask = BIT(as) | BIT(16 + as);
 
/*
 * AS can be retained by active jobs or a perfcnt context,
@@ -162,6 +163,18 @@ u32 panfrost_mmu_as_get(struct panfrost_device *pfdev, 
struct panfrost_mmu *mmu)
WARN_ON(en >= (NUM_JOB_SLOTS + 1));
 
list_move(&mmu->list, &pfdev->as_lru_list);
+
+   if (pfdev->as_faulty_mask & mask) {
+   /* Unhandled pagefault on this AS, the MMU was
+* disabled. We need to re-enable the MMU after
+* clearing+unmasking the AS interrupts.
+*/
+   mmu_write(pfdev, MMU_INT_CLEAR, mask);
+   mmu_write(pfdev, MMU_INT_MASK, ~pfdev->as_faulty_mask);
+   pfdev->as_faulty_mask &= ~mask;
+   panfrost_mmu_enable(pfdev, mmu);
+   }
+
goto out;
}
 
@@ -211,6 +224,7 @@ void panfrost_mmu_reset(struct panfrost_device *pfdev)
spin_lock(&pfdev->as_lock);
 
pfdev->as_alloc_mask = 0;
+   pfdev->as_faulty_mask = 0;
 
list_for_each_entry_safe(mmu, mmu_tmp, &pfdev->as_lru_list, list) {
mmu->as = -1;
@@ -662,7 +676,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, 
void *data)
if ((status & mask) == BIT(as) && (exception_type & 0xF8) == 
0xC0)
ret = panfrost_mmu_map_fault_addr(pfdev, as, addr);
 
-   if (ret)
+   if (ret) {
/* terminal fault, print info about the fault */
dev_err(pfdev->dev,
"Unhandled Page fault in AS%d at VA 0x%016llX\n"
@@ -680,14 +694,28 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int 
irq, void *data)
access_type, access_type_name(pfdev, 
fault_status),
source_id);
 
+   spin_lock(&pfdev->as_lock);
+   /* Ignore MMU interrupts on this AS until it's been
+* re-enabled.
+*/
+   pfdev->as_faulty_mask |= mask;
+
+   /* Disable the MMU to kill jobs on this AS. */
+   panfrost_mmu_disable(pfdev, as);
+   spin_unlock(&pfdev->as_lock);
+   }
+
status &= ~mask;
 
/* If we received new MMU interrupts, process them before 
returning. */
if (!status)
-   status = mmu_read(pfdev, MMU_INT_RAWSTAT);
+   status = mmu_read(pfdev, MMU_INT_RAWSTAT) & 
~pfdev->as_faulty_mask;
}
 
-   mmu_write(pfdev, MMU_INT_MASK, ~0);
+   spin_lock(&pfdev->as_lock);
+   mmu_write(pfdev, MMU_INT_MASK, ~pfdev->as_faulty_mask);
+   spin_unlock(&pfdev->as_lock);
+
return IRQ_HANDLED;
 };
 
-- 
2.31.1

[PATCH v5 06/16] drm/panfrost: Do the exception -> string translation using a table

2021-06-29 Thread Boris Brezillon

Do the exception -> string translation using a table. This way we get
rid of those magic numbers and can easily add new fields if we need
to attach extra information to exception types.

v4:
* Don't expose exception type to userspace
* Merge the enum definition and the enum -> string table declaration
  in the same patch

v3:
* Drop the error field

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
Reviewed-by: Alyssa Rosenzweig 
---
 drivers/gpu/drm/panfrost/panfrost_device.c | 130 +
 drivers/gpu/drm/panfrost/panfrost_device.h |  69 +++
 2 files changed, 152 insertions(+), 47 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index bce6b0aff05e..736854542b05 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -292,55 +292,91 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
panfrost_clk_fini(pfdev);
 }
 
-const char *panfrost_exception_name(u32 exception_code)
-{
-   switch (exception_code) {
-   /* Non-Fault Status code */
-   case 0x00: return "NOT_STARTED/IDLE/OK";
-   case 0x01: return "DONE";
-   case 0x02: return "INTERRUPTED";
-   case 0x03: return "STOPPED";
-   case 0x04: return "TERMINATED";
-   case 0x08: return "ACTIVE";
-   /* Job exceptions */
-   case 0x40: return "JOB_CONFIG_FAULT";
-   case 0x41: return "JOB_POWER_FAULT";
-   case 0x42: return "JOB_READ_FAULT";
-   case 0x43: return "JOB_WRITE_FAULT";
-   case 0x44: return "JOB_AFFINITY_FAULT";
-   case 0x48: return "JOB_BUS_FAULT";
-   case 0x50: return "INSTR_INVALID_PC";
-   case 0x51: return "INSTR_INVALID_ENC";
-   case 0x52: return "INSTR_TYPE_MISMATCH";
-   case 0x53: return "INSTR_OPERAND_FAULT";
-   case 0x54: return "INSTR_TLS_FAULT";
-   case 0x55: return "INSTR_BARRIER_FAULT";
-   case 0x56: return "INSTR_ALIGN_FAULT";
-   case 0x58: return "DATA_INVALID_FAULT";
-   case 0x59: return "TILE_RANGE_FAULT";
-   case 0x5A: return "ADDR_RANGE_FAULT";
-   case 0x60: return "OUT_OF_MEMORY";
-   /* GPU exceptions */
-   case 0x80: return "DELAYED_BUS_FAULT";
-   case 0x88: return "SHAREABILITY_FAULT";
-   /* MMU exceptions */
-   case 0xC1: return "TRANSLATION_FAULT_LEVEL1";
-   case 0xC2: return "TRANSLATION_FAULT_LEVEL2";
-   case 0xC3: return "TRANSLATION_FAULT_LEVEL3";
-   case 0xC4: return "TRANSLATION_FAULT_LEVEL4";
-   case 0xC8: return "PERMISSION_FAULT";
-   case 0xC9 ... 0xCF: return "PERMISSION_FAULT";
-   case 0xD1: return "TRANSTAB_BUS_FAULT_LEVEL1";
-   case 0xD2: return "TRANSTAB_BUS_FAULT_LEVEL2";
-   case 0xD3: return "TRANSTAB_BUS_FAULT_LEVEL3";
-   case 0xD4: return "TRANSTAB_BUS_FAULT_LEVEL4";
-   case 0xD8: return "ACCESS_FLAG";
-   case 0xD9 ... 0xDF: return "ACCESS_FLAG";
-   case 0xE0 ... 0xE7: return "ADDRESS_SIZE_FAULT";
-   case 0xE8 ... 0xEF: return "MEMORY_ATTRIBUTES_FAULT";
+#define PANFROST_EXCEPTION(id) \
+   [DRM_PANFROST_EXCEPTION_ ## id] = { \
+   .name = #id, \
}
 
-   return "UNKNOWN";
+struct panfrost_exception_info {
+   const char *name;
+};
+
+static const struct panfrost_exception_info panfrost_exception_infos[] = {
+   PANFROST_EXCEPTION(OK),
+   PANFROST_EXCEPTION(DONE),
+   PANFROST_EXCEPTION(INTERRUPTED),
+   PANFROST_EXCEPTION(STOPPED),
+   PANFROST_EXCEPTION(TERMINATED),
+   PANFROST_EXCEPTION(KABOOM),
+   PANFROST_EXCEPTION(EUREKA),
+   PANFROST_EXCEPTION(ACTIVE),
+   PANFROST_EXCEPTION(JOB_CONFIG_FAULT),
+   PANFROST_EXCEPTION(JOB_POWER_FAULT),
+   PANFROST_EXCEPTION(JOB_READ_FAULT),
+   PANFROST_EXCEPTION(JOB_WRITE_FAULT),
+   PANFROST_EXCEPTION(JOB_AFFINITY_FAULT),
+   PANFROST_EXCEPTION(JOB_BUS_FAULT),
+   PANFROST_EXCEPTION(INSTR_INVALID_PC),
+   PANFROST_EXCEPTION(INSTR_INVALID_ENC),
+   PANFROST_EXCEPTION(INSTR_TYPE_MISMATCH),
+   PANFROST_EXCEPTION(INSTR_OPERAND_FAULT),
+   PANFROST_EXCEPTION(INSTR_TLS_FAULT),
+   PANFROST_EXCEPTION(INSTR_BARRIER_FAULT),
+   PANFROST_EXCEPTION(INSTR_ALIGN_FAULT),
+   PANFROST_EXCEPTION(DATA_INVALID_FAULT),
+   PANFROST_EXCEPTION(TILE_RANGE_FAULT),
+   PANFROST_EXCEPTION(ADDR_RANGE_FAULT),
+   PANFROST_EXCEPTION(IMPRECISE_FAULT),
+   PANFROST_EXCEPTION(OOM),
+   PANFROST_EXCEPTION(OOM_AFBC),
+   PANFROST_EXCEP

[PATCH v5 05/16] drm/panfrost: Drop the pfdev argument passed to panfrost_exception_name()

2021-06-29 Thread Boris Brezillon

Currently unused. We'll add it back if we need per-GPU definitions.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_device.c | 2 +-
 drivers/gpu/drm/panfrost/panfrost_device.h | 2 +-
 drivers/gpu/drm/panfrost/panfrost_gpu.c| 2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 2 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c| 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index fbcf5edbe367..bce6b0aff05e 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -292,7 +292,7 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
panfrost_clk_fini(pfdev);
 }
 
-const char *panfrost_exception_name(struct panfrost_device *pfdev, u32 
exception_code)
+const char *panfrost_exception_name(u32 exception_code)
 {
switch (exception_code) {
/* Non-Fault Status code */
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 4c6bdea5537b..ade8a1974ee9 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -172,6 +172,6 @@ void panfrost_device_reset(struct panfrost_device *pfdev);
 int panfrost_device_resume(struct device *dev);
 int panfrost_device_suspend(struct device *dev);
 
-const char *panfrost_exception_name(struct panfrost_device *pfdev, u32 
exception_code);
+const char *panfrost_exception_name(u32 exception_code);
 
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c 
b/drivers/gpu/drm/panfrost/panfrost_gpu.c
index 2aae636f1cf5..ec59f15940fb 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
@@ -33,7 +33,7 @@ static irqreturn_t panfrost_gpu_irq_handler(int irq, void 
*data)
address |= gpu_read(pfdev, GPU_FAULT_ADDRESS_LO);
 
dev_warn(pfdev->dev, "GPU Fault 0x%08x (%s) at 0x%016llx\n",
-fault_status & 0xFF, panfrost_exception_name(pfdev, 
fault_status),
+fault_status & 0xFF, 
panfrost_exception_name(fault_status),
 address);
 
if (state & GPU_IRQ_MULTIPLE_FAULT)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index d6c9698bca3b..3cd1aec6c261 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -500,7 +500,7 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
 
dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
j,
-   panfrost_exception_name(pfdev, job_read(pfdev, 
JS_STATUS(j))),
+   panfrost_exception_name(job_read(pfdev, 
JS_STATUS(j))),
job_read(pfdev, JS_HEAD_LO(j)),
job_read(pfdev, JS_TAIL_LO(j)));
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index d76dff201ea6..b4f0c673cd7f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -676,7 +676,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, 
void *data)
"TODO",
fault_status,
(fault_status & (1 << 10) ? "DECODER FAULT" : 
"SLAVE FAULT"),
-   exception_type, panfrost_exception_name(pfdev, 
exception_type),
+   exception_type, 
panfrost_exception_name(exception_type),
access_type, access_type_name(pfdev, 
fault_status),
source_id);
 
-- 
2.31.1

[PATCH v5 08/16] drm/panfrost: Use a threaded IRQ for job interrupts

2021-06-29 Thread Boris Brezillon

This should avoid switching to interrupt context when the GPU is under
heavy use.

v3:
* Don't take the job_lock in panfrost_job_handle_irq()

Signed-off-by: Boris Brezillon 
Acked-by: Alyssa Rosenzweig 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 53 ++---
 1 file changed, 38 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index be8f68f63974..e0c479e67304 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -470,19 +470,12 @@ static const struct drm_sched_backend_ops 
panfrost_sched_ops = {
.free_job = panfrost_job_free
 };
 
-static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
+static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
 {
-   struct panfrost_device *pfdev = data;
-   u32 status = job_read(pfdev, JOB_INT_STAT);
int j;
 
dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status);
 
-   if (!status)
-   return IRQ_NONE;
-
-   pm_runtime_mark_last_busy(pfdev->dev);
-
for (j = 0; status; j++) {
u32 mask = MK_JS_MASK(j);
 
@@ -519,7 +512,6 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
if (status & JOB_INT_MASK_DONE(j)) {
struct panfrost_job *job;
 
-   spin_lock(&pfdev->js->job_lock);
job = pfdev->jobs[j];
/* Only NULL if job timeout occurred */
if (job) {
@@ -531,21 +523,49 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
dma_fence_signal_locked(job->done_fence);
pm_runtime_put_autosuspend(pfdev->dev);
}
-   spin_unlock(&pfdev->js->job_lock);
}
 
status &= ~mask;
}
+}
 
+static irqreturn_t panfrost_job_irq_handler_thread(int irq, void *data)
+{
+   struct panfrost_device *pfdev = data;
+   u32 status = job_read(pfdev, JOB_INT_RAWSTAT);
+
+   while (status) {
+   pm_runtime_mark_last_busy(pfdev->dev);
+
+   spin_lock(&pfdev->js->job_lock);
+   panfrost_job_handle_irq(pfdev, status);
+   spin_unlock(&pfdev->js->job_lock);
+   status = job_read(pfdev, JOB_INT_RAWSTAT);
+   }
+
+   job_write(pfdev, JOB_INT_MASK,
+ GENMASK(16 + NUM_JOB_SLOTS - 1, 16) |
+ GENMASK(NUM_JOB_SLOTS - 1, 0));
return IRQ_HANDLED;
 }
 
+static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
+{
+   struct panfrost_device *pfdev = data;
+   u32 status = job_read(pfdev, JOB_INT_STAT);
+
+   if (!status)
+   return IRQ_NONE;
+
+   job_write(pfdev, JOB_INT_MASK, 0);
+   return IRQ_WAKE_THREAD;
+}
+
 static void panfrost_reset(struct work_struct *work)
 {
struct panfrost_device *pfdev = container_of(work,
 struct panfrost_device,
 reset.work);
-   unsigned long flags;
unsigned int i;
bool cookie;
 
@@ -575,7 +595,7 @@ static void panfrost_reset(struct work_struct *work)
/* All timers have been stopped, we can safely reset the pending state. 
*/
atomic_set(&pfdev->reset.pending, 0);
 
-   spin_lock_irqsave(&pfdev->js->job_lock, flags);
+   spin_lock(&pfdev->js->job_lock);
for (i = 0; i < NUM_JOB_SLOTS; i++) {
if (pfdev->jobs[i]) {
pm_runtime_put_noidle(pfdev->dev);
@@ -583,7 +603,7 @@ static void panfrost_reset(struct work_struct *work)
pfdev->jobs[i] = NULL;
}
}
-   spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
+   spin_unlock(&pfdev->js->job_lock);
 
panfrost_device_reset(pfdev);
 
@@ -610,8 +630,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
if (irq <= 0)
return -ENODEV;
 
-   ret = devm_request_irq(pfdev->dev, irq, panfrost_job_irq_handler,
-  IRQF_SHARED, KBUILD_MODNAME "-job", pfdev);
+   ret = devm_request_threaded_irq(pfdev->dev, irq,
+   panfrost_job_irq_handler,
+   panfrost_job_irq_handler_thread,
+   IRQF_SHARED, KBUILD_MODNAME "-job",
+   pfdev);
if (ret) {
dev_err(pfdev->dev, "failed to request job irq");
return ret;
-- 
2.31.1

[PATCH v5 12/16] drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck

2021-06-29 Thread Boris Brezillon

Things are unlikely to resolve until we reset the GPU. Let's not wait
for other faults/timeout to happen to trigger this reset.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 65e98c51cb66..5267c3a1f02f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -36,8 +36,11 @@ static int wait_ready(struct panfrost_device *pfdev, u32 
as_nr)
ret = readl_relaxed_poll_timeout_atomic(pfdev->iomem + AS_STATUS(as_nr),
val, !(val & AS_STATUS_AS_ACTIVE), 10, 1000);
 
-   if (ret)
+   if (ret) {
+   /* The GPU hung, let's trigger a reset */
+   panfrost_device_schedule_reset(pfdev);
dev_err(pfdev->dev, "AS_ACTIVE bit stuck\n");
+   }
 
return ret;
 }
-- 
2.31.1

[PATCH v5 07/16] drm/panfrost: Expose a helper to trigger a GPU reset

2021-06-29 Thread Boris Brezillon

Expose a helper to trigger a GPU reset so we can easily trigger reset
operations outside the job timeout handler.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
Reviewed-by: Alyssa Rosenzweig 
---
 drivers/gpu/drm/panfrost/panfrost_device.h | 8 
 drivers/gpu/drm/panfrost/panfrost_job.c| 4 +---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 4c876476268f..f2190f90be75 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -243,4 +243,12 @@ enum drm_panfrost_exception_type {
 
 const char *panfrost_exception_name(u32 exception_code);
 
+static inline void
+panfrost_device_schedule_reset(struct panfrost_device *pfdev)
+{
+   /* Schedule a reset if there's no reset in progress. */
+   if (!atomic_xchg(&pfdev->reset.pending, 1))
+   schedule_work(&pfdev->reset.work);
+}
+
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 3cd1aec6c261..be8f68f63974 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -458,9 +458,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct 
drm_sched_job
if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job))
return DRM_GPU_SCHED_STAT_NOMINAL;
 
-   /* Schedule a reset if there's no reset in progress. */
-   if (!atomic_xchg(&pfdev->reset.pending, 1))
-   schedule_work(&pfdev->reset.work);
+   panfrost_device_schedule_reset(pfdev);
 
return DRM_GPU_SCHED_STAT_NOMINAL;
 }
-- 
2.31.1

[PATCH v5 09/16] drm/panfrost: Simplify the reset serialization logic

2021-06-29 Thread Boris Brezillon

Now that we can pass our own workqueue to drm_sched_init(), we can use
an ordered workqueue on for both the scheduler timeout tdr and our own
reset work (which we use when the reset is not caused by a fault/timeout
on a specific job, like when we have AS_ACTIVE bit stuck). This
guarantees that the timeout handlers and reset handler can't run
concurrently which drastically simplifies the locking.

v4:
* Actually pass the reset workqueue to drm_sched_init()
* Don't call cancel_work_sync() in panfrost_reset(). It will deadlock
  since it might be called from the reset work, which is executing and
  cancel_work_sync() will wait for the handler to return. Checking the
  reset pending status should avoid spurious resets

v3:
* New patch

Suggested-by: Daniel Vetter 
Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |   6 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 187 -
 2 files changed, 72 insertions(+), 121 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index f2190f90be75..59a487e8aba3 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -108,6 +108,7 @@ struct panfrost_device {
struct mutex sched_lock;
 
struct {
+   struct workqueue_struct *wq;
struct work_struct work;
atomic_t pending;
} reset;
@@ -246,9 +247,8 @@ const char *panfrost_exception_name(u32 exception_code);
 static inline void
 panfrost_device_schedule_reset(struct panfrost_device *pfdev)
 {
-   /* Schedule a reset if there's no reset in progress. */
-   if (!atomic_xchg(&pfdev->reset.pending, 1))
-   schedule_work(&pfdev->reset.work);
+   atomic_set(&pfdev->reset.pending, 1);
+   queue_work(pfdev->reset.wq, &pfdev->reset.work);
 }
 
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index e0c479e67304..98193a557a2d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -25,17 +25,8 @@
 #define job_write(dev, reg, data) writel(data, dev->iomem + (reg))
 #define job_read(dev, reg) readl(dev->iomem + (reg))
 
-enum panfrost_queue_status {
-   PANFROST_QUEUE_STATUS_ACTIVE,
-   PANFROST_QUEUE_STATUS_STOPPED,
-   PANFROST_QUEUE_STATUS_STARTING,
-   PANFROST_QUEUE_STATUS_FAULT_PENDING,
-};
-
 struct panfrost_queue_state {
struct drm_gpu_scheduler sched;
-   atomic_t status;
-   struct mutex lock;
u64 fence_context;
u64 emit_seqno;
 };
@@ -379,57 +370,72 @@ void panfrost_job_enable_interrupts(struct 
panfrost_device *pfdev)
job_write(pfdev, JOB_INT_MASK, irq_mask);
 }
 
-static bool panfrost_scheduler_stop(struct panfrost_queue_state *queue,
-   struct drm_sched_job *bad)
+static void panfrost_reset(struct panfrost_device *pfdev,
+  struct drm_sched_job *bad)
 {
-   enum panfrost_queue_status old_status;
-   bool stopped = false;
+   unsigned int i;
+   bool cookie;
 
-   mutex_lock(&queue->lock);
-   old_status = atomic_xchg(&queue->status,
-PANFROST_QUEUE_STATUS_STOPPED);
-   if (old_status == PANFROST_QUEUE_STATUS_STOPPED)
-   goto out;
+   if (!atomic_read(&pfdev->reset.pending))
+   return;
+
+   /* Stop the schedulers.
+*
+* FIXME: We temporarily get out of the dma_fence_signalling section
+* because the cleanup path generate lockdep splats when taking locks
+* to release job resources. We should rework the code to follow this
+* pattern:
+*
+*  try_lock
+*  if (locked)
+*  release
+*  else
+*  schedule_work_to_release_later
+*/
+   for (i = 0; i < NUM_JOB_SLOTS; i++)
+   drm_sched_stop(&pfdev->js->queue[i].sched, bad);
+
+   cookie = dma_fence_begin_signalling();
 
-   WARN_ON(old_status != PANFROST_QUEUE_STATUS_ACTIVE);
-   drm_sched_stop(&queue->sched, bad);
if (bad)
drm_sched_increase_karma(bad);
 
-   stopped = true;
+   spin_lock(&pfdev->js->job_lock);
+   for (i = 0; i < NUM_JOB_SLOTS; i++) {
+   if (pfdev->jobs[i]) {
+   pm_runtime_put_noidle(pfdev->dev);
+   panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
+   pfdev->jobs[i] = NULL;
+   }
+   }
+   spin_unlock(&pfdev->js->job_lock);
 
-   /*
-* Set the timeout to max so the timer doesn't get started
-* when we return from the timeout handler (restore

[PATCH v5 02/16] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr

2021-06-29 Thread Boris Brezillon

Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
reset. This leads to extra complexity when we need to synchronize timeout
works with the reset work. One solution to address that is to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. Thanks to the serialization
provided by the ordered workqueue we are guaranteed that timeout
handlers are executed sequentially, and can thus easily reset the GPU
from the timeout handler without extra synchronization.

v5:
* Add a new paragraph to the timedout_job() method

v3:
* New patch

v4:
* Actually use the timeout_wq to queue the timeout work

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
Reviewed-by: Lucas Stach 
Cc: Qiang Yu 
Cc: Emma Anholt 
Cc: Alex Deucher 
Cc: "Christian König" 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c   |  3 ++-
 drivers/gpu/drm/lima/lima_sched.c |  3 ++-
 drivers/gpu/drm/panfrost/panfrost_job.c   |  3 ++-
 drivers/gpu/drm/scheduler/sched_main.c| 14 +-
 drivers/gpu/drm/v3d/v3d_sched.c   | 10 +-
 include/drm/gpu_scheduler.h   | 23 ++-
 7 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 47ea46859618..532636ea20bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -488,7 +488,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 
r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
   num_hw_submission, amdgpu_job_hang_limit,
-  timeout, sched_score, ring->name);
+  timeout, NULL, sched_score, ring->name);
if (r) {
DRM_ERROR("Failed to create scheduler on ring %s.\n",
  ring->name);
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 19826e504efc..feb6da1b6ceb 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -190,7 +190,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
 
ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
 etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
-msecs_to_jiffies(500), NULL, dev_name(gpu->dev));
+msecs_to_jiffies(500), NULL, NULL,
+dev_name(gpu->dev));
if (ret)
return ret;
 
diff --git a/drivers/gpu/drm/lima/lima_sched.c 
b/drivers/gpu/drm/lima/lima_sched.c
index ecf3267334ff..dba8329937a3 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -508,7 +508,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, 
const char *name)
INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
 
return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
- lima_job_hang_limit, msecs_to_jiffies(timeout),
+ lima_job_hang_limit,
+ msecs_to_jiffies(timeout), NULL,
  NULL, name);
 }
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 682f2161b999..8ff79fd49577 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -626,7 +626,8 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 
ret = drm_sched_init(&js->queue[j].sched,
 &panfrost_sched_ops,
-1, 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
+1, 0,
+msecs_to_jiffies(JOB_TIMEOUT_MS), NULL,
 NULL, "pan_js");
if (ret) {
dev_err(pfdev->dev, "Failed to create scheduler: %d.", 
ret);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index c0a2f8f8d472..3e180f0d4305 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -232,7 +232,7 @@ static void drm_sched_start_timeout(struct 
drm_gpu_scheduler *sched)
 {
if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
!list_empty(&sched->pending_list))
-   schedule_delayed_work(&sched->work_tdr, sched->timeout);
+   queue_delayed_work(sched->timeout_wq, &sched->work_tdr, 
sched->timeout);
 }
 
 /**
@@ -244,7 +244,7 @@ static void drm_sched_start_timeout(struct 
drm_gpu_scheduler *sched)
  */
 void drm_sched_fault(struct

[PATCH v5 03/16] drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate

2021-06-29 Thread Boris Brezillon

If the fence creation fail, we can return the error pointer directly.
The core will update the fence error accordingly.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
Reviewed-by: Alyssa Rosenzweig 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 8ff79fd49577..d6c9698bca3b 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -355,7 +355,7 @@ static struct dma_fence *panfrost_job_run(struct 
drm_sched_job *sched_job)
 
fence = panfrost_fence_create(pfdev, slot);
if (IS_ERR(fence))
-   return NULL;
+   return fence;
 
if (job->done_fence)
dma_fence_put(job->done_fence);
-- 
2.31.1

[PATCH v5 00/16] drm/panfrost: Misc improvements

2021-06-29 Thread Boris Brezillon



Hello,

This is a merge of [1] and [2] since the second series depends on
patches in the preparatory series.

Main changes in this v5:
* Document what's excepted in the ->timedout_job() hook
* Add a patch increasing the AS_ACTIVE polling timeout
* Fix a few minor things here and there (see each commit for a detailed
  changelog) and collect R-b/A-b tags

Regards,

Boris

Boris Brezillon (15):
  drm/sched: Document what the timedout_job method should do
  drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr
  drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate
  drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition
  drm/panfrost: Drop the pfdev argument passed to
panfrost_exception_name()
  drm/panfrost: Do the exception -> string translation using a table
  drm/panfrost: Expose a helper to trigger a GPU reset
  drm/panfrost: Use a threaded IRQ for job interrupts
  drm/panfrost: Simplify the reset serialization logic
  drm/panfrost: Make sure job interrupts are masked before resetting
  drm/panfrost: Disable the AS on unhandled page faults
  drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck
  drm/panfrost: Don't reset the GPU on job faults unless we really have
to
  drm/panfrost: Kill in-flight jobs on FD close
  drm/panfrost: Increase the AS_ACTIVE polling timeout

Steven Price (1):
  drm/panfrost: Queue jobs on the hardware

 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c|   3 +-
 drivers/gpu/drm/lima/lima_sched.c  |   3 +-
 drivers/gpu/drm/panfrost/panfrost_device.c | 139 +++--
 drivers/gpu/drm/panfrost/panfrost_device.h |  91 ++-
 drivers/gpu/drm/panfrost/panfrost_gpu.c|   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 630 +++--
 drivers/gpu/drm/panfrost/panfrost_mmu.c|  43 +-
 drivers/gpu/drm/panfrost/panfrost_regs.h   |   3 -
 drivers/gpu/drm/scheduler/sched_main.c |  14 +-
 drivers/gpu/drm/v3d/v3d_sched.c|  10 +-
 include/drm/gpu_scheduler.h|  37 +-
 12 files changed, 721 insertions(+), 256 deletions(-)

-- 
2.31.1

[PATCH v5 01/16] drm/sched: Document what the timedout_job method should do

2021-06-29 Thread Boris Brezillon

The documentation is a bit vague and doesn't really describe what the
->timedout_job() is expected to do. Let's add a few more details.

v5:
* New patch

Suggested-by: Daniel Vetter 
Signed-off-by: Boris Brezillon 
---
 include/drm/gpu_scheduler.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 10225a0a35d0..65700511e074 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -239,6 +239,20 @@ struct drm_sched_backend_ops {
 * @timedout_job: Called when a job has taken too long to execute,
 * to trigger GPU recovery.
 *
+* This method is called in a workqueue context.
+*
+* Drivers typically issue a reset to recover from GPU hangs, and this
+* procedure usually follows the following workflow:
+*
+* 1. Stop the scheduler using drm_sched_stop(). This will park the
+*scheduler thread and cancel the timeout work, guaranteeing that
+*nothing is queued while we reset the hardware queue
+* 2. Try to gracefully stop non-faulty jobs (optional)
+* 3. Issue a GPU reset (driver-specific)
+* 4. Re-submit jobs using drm_sched_resubmit_jobs()
+* 5. Restart the scheduler using drm_sched_start(). At that point, new
+*jobs can be queued, and the scheduler thread is unblocked
+*
 * Return DRM_GPU_SCHED_STAT_NOMINAL, when all is normal,
 * and the underlying driver has started or completed recovery.
 *
-- 
2.31.1

Re: [PATCH] drm/sched: Declare entity idle only after HW submission

2021-06-28 Thread Boris Brezillon

On Mon, 28 Jun 2021 11:46:08 +0200
Lucas Stach  wrote:

> Am Donnerstag, dem 24.06.2021 um 16:08 +0200 schrieb Boris Brezillon:
> > The panfrost driver tries to kill in-flight jobs on FD close after
> > destroying the FD scheduler entities. For this to work properly, we
> > need to make sure the jobs popped from the scheduler entities have
> > been queued at the HW level before declaring the entity idle, otherwise
> > we might iterate over a list that doesn't contain those jobs.
> > 
> > Suggested-by: Lucas Stach 
> > Signed-off-by: Boris Brezillon 
> > Cc: Lucas Stach   
> 
> Not sure how much it's worth to review my own suggestion, but the
> implementation looks correct to me.
> I don't see any downsides for the existing drivers and it solves the
> race window for drivers that want to cancel jobs on the HW submission
> queue, without introducing yet another synchronization point.
> 
> Reviewed-by: Lucas Stach 

Queued to drm-misc-next.

Thanks,

Boris

> 
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 7 ---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 81496ae2602e..aa776ebe326a 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -811,10 +811,10 @@ static int drm_sched_main(void *param)
> >  
> > sched_job = drm_sched_entity_pop_job(entity);
> >  
> > -   complete(&entity->entity_idle);
> > -
> > -   if (!sched_job)
> > +   if (!sched_job) {
> > +   complete(&entity->entity_idle);
> > continue;
> > +   }
> >  
> > s_fence = sched_job->s_fence;
> >  
> > @@ -823,6 +823,7 @@ static int drm_sched_main(void *param)
> >  
> > trace_drm_run_job(sched_job, entity);
> > fence = sched->ops->run_job(sched_job);
> > +   complete(&entity->entity_idle);
> > drm_sched_fence_scheduled(s_fence);
> >  
> > if (!IS_ERR_OR_NULL(fence)) {  
> 
>

Re: [PATCH v4 07/14] drm/panfrost: Use a threaded IRQ for job interrupts

2021-06-28 Thread Boris Brezillon

On Mon, 28 Jun 2021 10:26:39 +0100
Steven Price  wrote:

> On 28/06/2021 08:42, Boris Brezillon wrote:
> > This should avoid switching to interrupt context when the GPU is under
> > heavy use.
> > 
> > v3:
> > * Don't take the job_lock in panfrost_job_handle_irq()
> > 
> > Signed-off-by: Boris Brezillon 
> > Acked-by: Alyssa Rosenzweig   
> 
> I thought I'd already reviewed this one, but anyway:
> 
> Reviewed-by: Steven Price 
> 

Oops, indeed, I overlooked that one.

[PATCH v4 13/14] drm/panfrost: Kill in-flight jobs on FD close

2021-06-28 Thread Boris Brezillon

If the process who submitted these jobs decided to close the FD before
the jobs are done it probably means it doesn't care about the result.

v4:
* Don't disable/restore irqs when taking the job_lock (not needed since
  this lock is never taken from an interrupt context)

v3:
* Set fence error to ECANCELED when a TERMINATED exception is received

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 42 +
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index b0f4857ca084..979108dbc323 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -499,14 +499,21 @@ static void panfrost_job_handle_irq(struct 
panfrost_device *pfdev, u32 status)
 
if (status & JOB_INT_MASK_ERR(j)) {
u32 js_status = job_read(pfdev, JS_STATUS(j));
+   const char *exception_name = 
panfrost_exception_name(js_status);
 
job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
-   dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
-   j,
-   panfrost_exception_name(js_status),
-   job_read(pfdev, JS_HEAD_LO(j)),
-   job_read(pfdev, JS_TAIL_LO(j)));
+   if (js_status < 
DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT) {
+   dev_dbg(pfdev->dev, "js interrupt, js=%d, 
status=%s, head=0x%x, tail=0x%x",
+   j, exception_name,
+   job_read(pfdev, JS_HEAD_LO(j)),
+   job_read(pfdev, JS_TAIL_LO(j)));
+   } else {
+   dev_err(pfdev->dev, "js fault, js=%d, 
status=%s, head=0x%x, tail=0x%x",
+   j, exception_name,
+   job_read(pfdev, JS_HEAD_LO(j)),
+   job_read(pfdev, JS_TAIL_LO(j)));
+   }
 
/* If we need a reset, signal it to the timeout
 * handler, otherwise, update the fence error field and
@@ -515,7 +522,16 @@ static void panfrost_job_handle_irq(struct panfrost_device 
*pfdev, u32 status)
if (panfrost_exception_needs_reset(pfdev, js_status)) {
drm_sched_fault(&pfdev->js->queue[j].sched);
} else {
-   dma_fence_set_error(pfdev->jobs[j]->done_fence, 
-EINVAL);
+   int error = 0;
+
+   if (js_status == 
DRM_PANFROST_EXCEPTION_TERMINATED)
+   error = -ECANCELED;
+   else if (js_status >= 
DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT)
+   error = -EINVAL;
+
+   if (error)
+   
dma_fence_set_error(pfdev->jobs[j]->done_fence, error);
+
status |= JOB_INT_MASK_DONE(j);
}
}
@@ -681,10 +697,24 @@ int panfrost_job_open(struct panfrost_file_priv 
*panfrost_priv)
 
 void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
 {
+   struct panfrost_device *pfdev = panfrost_priv->pfdev;
int i;
 
for (i = 0; i < NUM_JOB_SLOTS; i++)
drm_sched_entity_destroy(&panfrost_priv->sched_entity[i]);
+
+   /* Kill in-flight jobs */
+   spin_lock(&pfdev->js->job_lock);
+   for (i = 0; i < NUM_JOB_SLOTS; i++) {
+   struct drm_sched_entity *entity = 
&panfrost_priv->sched_entity[i];
+   struct panfrost_job *job = pfdev->jobs[i];
+
+   if (!job || job->base.entity != entity)
+   continue;
+
+   job_write(pfdev, JS_COMMAND(i), JS_COMMAND_HARD_STOP);
+   }
+   spin_unlock(&pfdev->js->job_lock);
 }
 
 int panfrost_job_is_idle(struct panfrost_device *pfdev)
-- 
2.31.1

[PATCH v4 04/14] drm/panfrost: Drop the pfdev argument passed to panfrost_exception_name()

2021-06-28 Thread Boris Brezillon

Currently unused. We'll add it back if we need per-GPU definitions.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_device.c | 2 +-
 drivers/gpu/drm/panfrost/panfrost_device.h | 2 +-
 drivers/gpu/drm/panfrost/panfrost_gpu.c| 2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 2 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c| 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index fbcf5edbe367..bce6b0aff05e 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -292,7 +292,7 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
panfrost_clk_fini(pfdev);
 }
 
-const char *panfrost_exception_name(struct panfrost_device *pfdev, u32 
exception_code)
+const char *panfrost_exception_name(u32 exception_code)
 {
switch (exception_code) {
/* Non-Fault Status code */
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 4c6bdea5537b..ade8a1974ee9 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -172,6 +172,6 @@ void panfrost_device_reset(struct panfrost_device *pfdev);
 int panfrost_device_resume(struct device *dev);
 int panfrost_device_suspend(struct device *dev);
 
-const char *panfrost_exception_name(struct panfrost_device *pfdev, u32 
exception_code);
+const char *panfrost_exception_name(u32 exception_code);
 
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c 
b/drivers/gpu/drm/panfrost/panfrost_gpu.c
index 2aae636f1cf5..ec59f15940fb 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
@@ -33,7 +33,7 @@ static irqreturn_t panfrost_gpu_irq_handler(int irq, void 
*data)
address |= gpu_read(pfdev, GPU_FAULT_ADDRESS_LO);
 
dev_warn(pfdev->dev, "GPU Fault 0x%08x (%s) at 0x%016llx\n",
-fault_status & 0xFF, panfrost_exception_name(pfdev, 
fault_status),
+fault_status & 0xFF, 
panfrost_exception_name(fault_status),
 address);
 
if (state & GPU_IRQ_MULTIPLE_FAULT)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index d6c9698bca3b..3cd1aec6c261 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -500,7 +500,7 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
 
dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
j,
-   panfrost_exception_name(pfdev, job_read(pfdev, 
JS_STATUS(j))),
+   panfrost_exception_name(job_read(pfdev, 
JS_STATUS(j))),
job_read(pfdev, JS_HEAD_LO(j)),
job_read(pfdev, JS_TAIL_LO(j)));
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index d76dff201ea6..b4f0c673cd7f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -676,7 +676,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, 
void *data)
"TODO",
fault_status,
(fault_status & (1 << 10) ? "DECODER FAULT" : 
"SLAVE FAULT"),
-   exception_type, panfrost_exception_name(pfdev, 
exception_type),
+   exception_type, 
panfrost_exception_name(exception_type),
access_type, access_type_name(pfdev, 
fault_status),
source_id);
 
-- 
2.31.1

[PATCH v4 06/14] drm/panfrost: Expose a helper to trigger a GPU reset

2021-06-28 Thread Boris Brezillon

Expose a helper to trigger a GPU reset so we can easily trigger reset
operations outside the job timeout handler.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
Reviewed-by: Alyssa Rosenzweig 
---
 drivers/gpu/drm/panfrost/panfrost_device.h | 8 
 drivers/gpu/drm/panfrost/panfrost_job.c| 4 +---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 4c876476268f..f2190f90be75 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -243,4 +243,12 @@ enum drm_panfrost_exception_type {
 
 const char *panfrost_exception_name(u32 exception_code);
 
+static inline void
+panfrost_device_schedule_reset(struct panfrost_device *pfdev)
+{
+   /* Schedule a reset if there's no reset in progress. */
+   if (!atomic_xchg(&pfdev->reset.pending, 1))
+   schedule_work(&pfdev->reset.work);
+}
+
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 3cd1aec6c261..be8f68f63974 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -458,9 +458,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct 
drm_sched_job
if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job))
return DRM_GPU_SCHED_STAT_NOMINAL;
 
-   /* Schedule a reset if there's no reset in progress. */
-   if (!atomic_xchg(&pfdev->reset.pending, 1))
-   schedule_work(&pfdev->reset.work);
+   panfrost_device_schedule_reset(pfdev);
 
return DRM_GPU_SCHED_STAT_NOMINAL;
 }
-- 
2.31.1

[PATCH v4 01/14] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr

2021-06-28 Thread Boris Brezillon

Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
reset. This leads to extra complexity when we need to synchronize timeout
works with the reset work. One solution to address that is to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. Thanks to the serialization
provided by the ordered workqueue we are guaranteed that timeout
handlers are executed sequentially, and can thus easily reset the GPU
from the timeout handler without extra synchronization.

v3:
* New patch

v4:
* Actually use the timeout_wq to queue the timeout work

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c   |  3 ++-
 drivers/gpu/drm/lima/lima_sched.c |  3 ++-
 drivers/gpu/drm/panfrost/panfrost_job.c   |  3 ++-
 drivers/gpu/drm/scheduler/sched_main.c| 14 +-
 drivers/gpu/drm/v3d/v3d_sched.c   | 10 +-
 include/drm/gpu_scheduler.h   |  5 -
 7 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 47ea46859618..532636ea20bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -488,7 +488,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 
r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
   num_hw_submission, amdgpu_job_hang_limit,
-  timeout, sched_score, ring->name);
+  timeout, NULL, sched_score, ring->name);
if (r) {
DRM_ERROR("Failed to create scheduler on ring %s.\n",
  ring->name);
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 19826e504efc..feb6da1b6ceb 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -190,7 +190,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
 
ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
 etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
-msecs_to_jiffies(500), NULL, dev_name(gpu->dev));
+msecs_to_jiffies(500), NULL, NULL,
+dev_name(gpu->dev));
if (ret)
return ret;
 
diff --git a/drivers/gpu/drm/lima/lima_sched.c 
b/drivers/gpu/drm/lima/lima_sched.c
index ecf3267334ff..dba8329937a3 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -508,7 +508,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, 
const char *name)
INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
 
return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
- lima_job_hang_limit, msecs_to_jiffies(timeout),
+ lima_job_hang_limit,
+ msecs_to_jiffies(timeout), NULL,
  NULL, name);
 }
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 682f2161b999..8ff79fd49577 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -626,7 +626,8 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 
ret = drm_sched_init(&js->queue[j].sched,
 &panfrost_sched_ops,
-1, 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
+1, 0,
+msecs_to_jiffies(JOB_TIMEOUT_MS), NULL,
 NULL, "pan_js");
if (ret) {
dev_err(pfdev->dev, "Failed to create scheduler: %d.", 
ret);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index c0a2f8f8d472..3e180f0d4305 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -232,7 +232,7 @@ static void drm_sched_start_timeout(struct 
drm_gpu_scheduler *sched)
 {
if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
!list_empty(&sched->pending_list))
-   schedule_delayed_work(&sched->work_tdr, sched->timeout);
+   queue_delayed_work(sched->timeout_wq, &sched->work_tdr, 
sched->timeout);
 }
 
 /**
@@ -244,7 +244,7 @@ static void drm_sched_start_timeout(struct 
drm_gpu_scheduler *sched)
  */
 void drm_sched_fault(struct drm_gpu_scheduler *sched)
 {
-   mod_delayed_work(system_wq, &sched->work_tdr, 0);
+   mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
 }
 EXPORT_SYMBOL(drm_sched_fault);
 
@@ -270

[PATCH v4 10/14] drm/panfrost: Disable the AS on unhandled page faults

2021-06-28 Thread Boris Brezillon

If we don't do that, we have to wait for the job timeout to expire
before the fault jobs gets killed.

v3:
* Make sure the AS is re-enabled when new jobs are submitted to the
  context

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |  1 +
 drivers/gpu/drm/panfrost/panfrost_mmu.c| 34 --
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 59a487e8aba3..2dc8c0d1d987 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -96,6 +96,7 @@ struct panfrost_device {
spinlock_t as_lock;
unsigned long as_in_use_mask;
unsigned long as_alloc_mask;
+   unsigned long as_faulty_mask;
struct list_head as_lru_list;
 
struct panfrost_job_slot *js;
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index b4f0c673cd7f..65e98c51cb66 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -154,6 +154,7 @@ u32 panfrost_mmu_as_get(struct panfrost_device *pfdev, 
struct panfrost_mmu *mmu)
as = mmu->as;
if (as >= 0) {
int en = atomic_inc_return(&mmu->as_count);
+   u32 mask = BIT(as) | BIT(16 + as);
 
/*
 * AS can be retained by active jobs or a perfcnt context,
@@ -162,6 +163,18 @@ u32 panfrost_mmu_as_get(struct panfrost_device *pfdev, 
struct panfrost_mmu *mmu)
WARN_ON(en >= (NUM_JOB_SLOTS + 1));
 
list_move(&mmu->list, &pfdev->as_lru_list);
+
+   if (pfdev->as_faulty_mask & mask) {
+   /* Unhandled pagefault on this AS, the MMU was
+* disabled. We need to re-enable the MMU after
+* clearing+unmasking the AS interrupts.
+*/
+   mmu_write(pfdev, MMU_INT_CLEAR, mask);
+   mmu_write(pfdev, MMU_INT_MASK, ~pfdev->as_faulty_mask);
+   pfdev->as_faulty_mask &= ~mask;
+   panfrost_mmu_enable(pfdev, mmu);
+   }
+
goto out;
}
 
@@ -211,6 +224,7 @@ void panfrost_mmu_reset(struct panfrost_device *pfdev)
spin_lock(&pfdev->as_lock);
 
pfdev->as_alloc_mask = 0;
+   pfdev->as_faulty_mask = 0;
 
list_for_each_entry_safe(mmu, mmu_tmp, &pfdev->as_lru_list, list) {
mmu->as = -1;
@@ -662,7 +676,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, 
void *data)
if ((status & mask) == BIT(as) && (exception_type & 0xF8) == 
0xC0)
ret = panfrost_mmu_map_fault_addr(pfdev, as, addr);
 
-   if (ret)
+   if (ret) {
/* terminal fault, print info about the fault */
dev_err(pfdev->dev,
"Unhandled Page fault in AS%d at VA 0x%016llX\n"
@@ -680,14 +694,28 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int 
irq, void *data)
access_type, access_type_name(pfdev, 
fault_status),
source_id);
 
+   spin_lock(&pfdev->as_lock);
+   /* Ignore MMU interrupts on this AS until it's been
+* re-enabled.
+*/
+   pfdev->as_faulty_mask |= mask;
+
+   /* Disable the MMU to kill jobs on this AS. */
+   panfrost_mmu_disable(pfdev, as);
+   spin_unlock(&pfdev->as_lock);
+   }
+
status &= ~mask;
 
/* If we received new MMU interrupts, process them before 
returning. */
if (!status)
-   status = mmu_read(pfdev, MMU_INT_RAWSTAT);
+   status = mmu_read(pfdev, MMU_INT_RAWSTAT) & 
~pfdev->as_faulty_mask;
}
 
-   mmu_write(pfdev, MMU_INT_MASK, ~0);
+   spin_lock(&pfdev->as_lock);
+   mmu_write(pfdev, MMU_INT_MASK, ~pfdev->as_faulty_mask);
+   spin_unlock(&pfdev->as_lock);
+
return IRQ_HANDLED;
 };
 
-- 
2.31.1

[PATCH v4 12/14] drm/panfrost: Don't reset the GPU on job faults unless we really have to

2021-06-28 Thread Boris Brezillon

If we can recover from a fault without a reset there's no reason to
issue one.

v3:
* Drop the mention of Valhall requiring a reset on JOB_BUS_FAULT
* Set the fence error to -EINVAL instead of having per-exception
  error codes

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.c |  9 +
 drivers/gpu/drm/panfrost/panfrost_device.h |  2 ++
 drivers/gpu/drm/panfrost/panfrost_job.c| 16 ++--
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index 736854542b05..f4e42009526d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -379,6 +379,15 @@ const char *panfrost_exception_name(u32 exception_code)
return panfrost_exception_infos[exception_code].name;
 }
 
+bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
+   u32 exception_code)
+{
+   /* Right now, none of the GPU we support need a reset, but this
+* might change.
+*/
+   return false;
+}
+
 void panfrost_device_reset(struct panfrost_device *pfdev)
 {
panfrost_gpu_soft_reset(pfdev);
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 2dc8c0d1d987..d91f71366214 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -244,6 +244,8 @@ enum drm_panfrost_exception_type {
 };
 
 const char *panfrost_exception_name(u32 exception_code);
+bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
+   u32 exception_code);
 
 static inline void
 panfrost_device_schedule_reset(struct panfrost_device *pfdev)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 4bd4d11377b7..b0f4857ca084 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -498,14 +498,26 @@ static void panfrost_job_handle_irq(struct 
panfrost_device *pfdev, u32 status)
job_write(pfdev, JOB_INT_CLEAR, mask);
 
if (status & JOB_INT_MASK_ERR(j)) {
+   u32 js_status = job_read(pfdev, JS_STATUS(j));
+
job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
j,
-   panfrost_exception_name(job_read(pfdev, 
JS_STATUS(j))),
+   panfrost_exception_name(js_status),
job_read(pfdev, JS_HEAD_LO(j)),
job_read(pfdev, JS_TAIL_LO(j)));
-   drm_sched_fault(&pfdev->js->queue[j].sched);
+
+   /* If we need a reset, signal it to the timeout
+* handler, otherwise, update the fence error field and
+* signal the job fence.
+*/
+   if (panfrost_exception_needs_reset(pfdev, js_status)) {
+   drm_sched_fault(&pfdev->js->queue[j].sched);
+   } else {
+   dma_fence_set_error(pfdev->jobs[j]->done_fence, 
-EINVAL);
+   status |= JOB_INT_MASK_DONE(j);
+   }
}
 
if (status & JOB_INT_MASK_DONE(j)) {
-- 
2.31.1

[PATCH v4 14/14] drm/panfrost: Queue jobs on the hardware

2021-06-28 Thread Boris Brezillon

From: Steven Price 

The hardware has a set of '_NEXT' registers that can hold a second job
while the first is executing. Make use of these registers to enqueue a
second job per slot.

v3:
* Fix the done/err job dequeuing logic to get a valid active state
* Only enable the second slot on GPUs supporting jobchain disambiguation
* Split interrupt handling in sub-functions

Signed-off-by: Steven Price 
Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 468 +++--
 2 files changed, 351 insertions(+), 119 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index d91f71366214..81f81fc8650e 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -101,7 +101,7 @@ struct panfrost_device {
 
struct panfrost_job_slot *js;
 
-   struct panfrost_job *jobs[NUM_JOB_SLOTS];
+   struct panfrost_job *jobs[NUM_JOB_SLOTS][2];
struct list_head scheduled_jobs;
 
struct panfrost_perfcnt *perfcnt;
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 979108dbc323..b965669a1fed 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -140,9 +141,52 @@ static void panfrost_job_write_affinity(struct 
panfrost_device *pfdev,
job_write(pfdev, JS_AFFINITY_NEXT_HI(js), affinity >> 32);
 }
 
+static u32
+panfrost_get_job_chain_flag(const struct panfrost_job *job)
+{
+   struct panfrost_fence *f = to_panfrost_fence(job->done_fence);
+
+   if (!panfrost_has_hw_feature(job->pfdev, 
HW_FEATURE_JOBCHAIN_DISAMBIGUATION))
+   return 0;
+
+   return (f->seqno & 1) ? JS_CONFIG_JOB_CHAIN_FLAG : 0;
+}
+
+static struct panfrost_job *
+panfrost_dequeue_job(struct panfrost_device *pfdev, int slot)
+{
+   struct panfrost_job *job = pfdev->jobs[slot][0];
+
+   WARN_ON(!job);
+   pfdev->jobs[slot][0] = pfdev->jobs[slot][1];
+   pfdev->jobs[slot][1] = NULL;
+
+   return job;
+}
+
+static unsigned int
+panfrost_enqueue_job(struct panfrost_device *pfdev, int slot,
+struct panfrost_job *job)
+{
+   if (WARN_ON(!job))
+   return 0;
+
+   if (!pfdev->jobs[slot][0]) {
+   pfdev->jobs[slot][0] = job;
+   return 0;
+   }
+
+   WARN_ON(pfdev->jobs[slot][1]);
+   pfdev->jobs[slot][1] = job;
+   WARN_ON(panfrost_get_job_chain_flag(job) ==
+   panfrost_get_job_chain_flag(pfdev->jobs[slot][0]));
+   return 1;
+}
+
 static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
 {
struct panfrost_device *pfdev = job->pfdev;
+   unsigned int subslot;
u32 cfg;
u64 jc_head = job->jc;
int ret;
@@ -168,7 +212,8 @@ static void panfrost_job_hw_submit(struct panfrost_job 
*job, int js)
 * start */
cfg |= JS_CONFIG_THREAD_PRI(8) |
JS_CONFIG_START_FLUSH_CLEAN_INVALIDATE |
-   JS_CONFIG_END_FLUSH_CLEAN_INVALIDATE;
+   JS_CONFIG_END_FLUSH_CLEAN_INVALIDATE |
+   panfrost_get_job_chain_flag(job);
 
if (panfrost_has_hw_feature(pfdev, HW_FEATURE_FLUSH_REDUCTION))
cfg |= JS_CONFIG_ENABLE_FLUSH_REDUCTION;
@@ -182,10 +227,17 @@ static void panfrost_job_hw_submit(struct panfrost_job 
*job, int js)
job_write(pfdev, JS_FLUSH_ID_NEXT(js), job->flush_id);
 
/* GO ! */
-   dev_dbg(pfdev->dev, "JS: Submitting atom %p to js[%d] with head=0x%llx",
-   job, js, jc_head);
 
-   job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
+   spin_lock(&pfdev->js->job_lock);
+   subslot = panfrost_enqueue_job(pfdev, js, job);
+   /* Don't queue the job if a reset is in progress */
+   if (!atomic_read(&pfdev->reset.pending)) {
+   job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
+   dev_dbg(pfdev->dev,
+   "JS: Submitting atom %p to js[%d][%d] with head=0x%llx 
AS %d",
+   job, js, subslot, jc_head, cfg & 0xf);
+   }
+   spin_unlock(&pfdev->js->job_lock);
 }
 
 static void panfrost_acquire_object_fences(struct drm_gem_object **bos,
@@ -343,7 +395,11 @@ static struct dma_fence *panfrost_job_run(struct 
drm_sched_job *sched_job)
if (unlikely(job->base.s_fence->finished.error))
return NULL;
 
-   pfdev->jobs[slot] = job;
+   /* Nothing to execute: can happen if the job has finished while
+* we were resetting the GPU.
+*/
+   if (!job->jc

[PATCH v4 08/14] drm/panfrost: Simplify the reset serialization logic

2021-06-28 Thread Boris Brezillon

Now that we can pass our own workqueue to drm_sched_init(), we can use
an ordered workqueue on for both the scheduler timeout tdr and our own
reset work (which we use when the reset is not caused by a fault/timeout
on a specific job, like when we have AS_ACTIVE bit stuck). This
guarantees that the timeout handlers and reset handler can't run
concurrently which drastically simplifies the locking.

v4:
* Actually pass the reset workqueue to drm_sched_init()
* Don't call cancel_work_sync() in panfrost_reset(). It will deadlock
  since it might be called from the reset work, which is executing and
  cancel_work_sync() will wait for the handler to return. Checking the
  reset pending status should avoid spurious resets

v3:
* New patch

Suggested-by: Daniel Vetter 
Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |   6 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 187 -
 2 files changed, 72 insertions(+), 121 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index f2190f90be75..59a487e8aba3 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -108,6 +108,7 @@ struct panfrost_device {
struct mutex sched_lock;
 
struct {
+   struct workqueue_struct *wq;
struct work_struct work;
atomic_t pending;
} reset;
@@ -246,9 +247,8 @@ const char *panfrost_exception_name(u32 exception_code);
 static inline void
 panfrost_device_schedule_reset(struct panfrost_device *pfdev)
 {
-   /* Schedule a reset if there's no reset in progress. */
-   if (!atomic_xchg(&pfdev->reset.pending, 1))
-   schedule_work(&pfdev->reset.work);
+   atomic_set(&pfdev->reset.pending, 1);
+   queue_work(pfdev->reset.wq, &pfdev->reset.work);
 }
 
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index e0c479e67304..98193a557a2d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -25,17 +25,8 @@
 #define job_write(dev, reg, data) writel(data, dev->iomem + (reg))
 #define job_read(dev, reg) readl(dev->iomem + (reg))
 
-enum panfrost_queue_status {
-   PANFROST_QUEUE_STATUS_ACTIVE,
-   PANFROST_QUEUE_STATUS_STOPPED,
-   PANFROST_QUEUE_STATUS_STARTING,
-   PANFROST_QUEUE_STATUS_FAULT_PENDING,
-};
-
 struct panfrost_queue_state {
struct drm_gpu_scheduler sched;
-   atomic_t status;
-   struct mutex lock;
u64 fence_context;
u64 emit_seqno;
 };
@@ -379,57 +370,72 @@ void panfrost_job_enable_interrupts(struct 
panfrost_device *pfdev)
job_write(pfdev, JOB_INT_MASK, irq_mask);
 }
 
-static bool panfrost_scheduler_stop(struct panfrost_queue_state *queue,
-   struct drm_sched_job *bad)
+static void panfrost_reset(struct panfrost_device *pfdev,
+  struct drm_sched_job *bad)
 {
-   enum panfrost_queue_status old_status;
-   bool stopped = false;
+   unsigned int i;
+   bool cookie;
 
-   mutex_lock(&queue->lock);
-   old_status = atomic_xchg(&queue->status,
-PANFROST_QUEUE_STATUS_STOPPED);
-   if (old_status == PANFROST_QUEUE_STATUS_STOPPED)
-   goto out;
+   if (!atomic_read(&pfdev->reset.pending))
+   return;
+
+   /* Stop the schedulers.
+*
+* FIXME: We temporarily get out of the dma_fence_signalling section
+* because the cleanup path generate lockdep splats when taking locks
+* to release job resources. We should rework the code to follow this
+* pattern:
+*
+*  try_lock
+*  if (locked)
+*  release
+*  else
+*  schedule_work_to_release_later
+*/
+   for (i = 0; i < NUM_JOB_SLOTS; i++)
+   drm_sched_stop(&pfdev->js->queue[i].sched, bad);
+
+   cookie = dma_fence_begin_signalling();
 
-   WARN_ON(old_status != PANFROST_QUEUE_STATUS_ACTIVE);
-   drm_sched_stop(&queue->sched, bad);
if (bad)
drm_sched_increase_karma(bad);
 
-   stopped = true;
+   spin_lock(&pfdev->js->job_lock);
+   for (i = 0; i < NUM_JOB_SLOTS; i++) {
+   if (pfdev->jobs[i]) {
+   pm_runtime_put_noidle(pfdev->dev);
+   panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
+   pfdev->jobs[i] = NULL;
+   }
+   }
+   spin_unlock(&pfdev->js->job_lock);
 
-   /*
-* Set the timeout to max so the timer doesn't get started
-* when we return from the timeout handler (restored in
-* panfrost_

[PATCH v4 11/14] drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck

2021-06-28 Thread Boris Brezillon

Things are unlikely to resolve until we reset the GPU. Let's not wait
for other faults/timeout to happen to trigger this reset.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 65e98c51cb66..5267c3a1f02f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -36,8 +36,11 @@ static int wait_ready(struct panfrost_device *pfdev, u32 
as_nr)
ret = readl_relaxed_poll_timeout_atomic(pfdev->iomem + AS_STATUS(as_nr),
val, !(val & AS_STATUS_AS_ACTIVE), 10, 1000);
 
-   if (ret)
+   if (ret) {
+   /* The GPU hung, let's trigger a reset */
+   panfrost_device_schedule_reset(pfdev);
dev_err(pfdev->dev, "AS_ACTIVE bit stuck\n");
+   }
 
return ret;
 }
-- 
2.31.1

[PATCH v4 09/14] drm/panfrost: Make sure job interrupts are masked before resetting

2021-06-28 Thread Boris Brezillon

This is not yet needed because we let active jobs be killed during by
the reset and we don't really bother making sure they can be restarted.
But once we start adding soft-stop support, controlling when we deal
with the remaining interrrupts and making sure those are handled before
the reset is issued gets tricky if we keep job interrupts active.

Let's prepare for that and mask+flush job IRQs before issuing a reset.

v4:
* Add a comment explaining why we WARN_ON(!job) in the irq handler
* Keep taking the job_lock when evicting stalled jobs

v3:
* New patch

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 27 -
 1 file changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 98193a557a2d..4bd4d11377b7 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -34,6 +34,7 @@ struct panfrost_queue_state {
 struct panfrost_job_slot {
struct panfrost_queue_state queue[NUM_JOB_SLOTS];
spinlock_t job_lock;
+   int irq;
 };
 
 static struct panfrost_job *
@@ -400,6 +401,15 @@ static void panfrost_reset(struct panfrost_device *pfdev,
if (bad)
drm_sched_increase_karma(bad);
 
+   /* Mask job interrupts and synchronize to make sure we won't be
+* interrupted during our reset.
+*/
+   job_write(pfdev, JOB_INT_MASK, 0);
+   synchronize_irq(pfdev->js->irq);
+
+   /* Schedulers are stopped and interrupts are masked+flushed, we don't
+* need to protect the 'evict unfinished jobs' lock with the job_lock.
+*/
spin_lock(&pfdev->js->job_lock);
for (i = 0; i < NUM_JOB_SLOTS; i++) {
if (pfdev->jobs[i]) {
@@ -502,7 +512,14 @@ static void panfrost_job_handle_irq(struct panfrost_device 
*pfdev, u32 status)
struct panfrost_job *job;
 
job = pfdev->jobs[j];
-   /* Only NULL if job timeout occurred */
+   /* The only reason this job could be NULL is if the
+* job IRQ handler is called just after the
+* in-flight job eviction in the reset path, and
+* this shouldn't happen because the job IRQ has
+* been masked and synchronized when this eviction
+* happens.
+*/
+   WARN_ON(!job);
if (job) {
pfdev->jobs[j] = NULL;
 
@@ -562,7 +579,7 @@ static void panfrost_reset_work(struct work_struct *work)
 int panfrost_job_init(struct panfrost_device *pfdev)
 {
struct panfrost_job_slot *js;
-   int ret, j, irq;
+   int ret, j;
 
INIT_WORK(&pfdev->reset.work, panfrost_reset_work);
 
@@ -572,11 +589,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 
spin_lock_init(&js->job_lock);
 
-   irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "job");
-   if (irq <= 0)
+   js->irq = platform_get_irq_byname(to_platform_device(pfdev->dev), 
"job");
+   if (js->irq <= 0)
return -ENODEV;
 
-   ret = devm_request_threaded_irq(pfdev->dev, irq,
+   ret = devm_request_threaded_irq(pfdev->dev, js->irq,
panfrost_job_irq_handler,
panfrost_job_irq_handler_thread,
IRQF_SHARED, KBUILD_MODNAME "-job",
-- 
2.31.1

[PATCH v4 03/14] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition

2021-06-28 Thread Boris Brezillon

Exception types will be defined as an enum.

v4:
* Fix typo in the commit message

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
Reviewed-by: Alyssa Rosenzweig 
---
 drivers/gpu/drm/panfrost/panfrost_regs.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h 
b/drivers/gpu/drm/panfrost/panfrost_regs.h
index eddaa62ad8b0..151cfebd80a0 100644
--- a/drivers/gpu/drm/panfrost/panfrost_regs.h
+++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
@@ -261,9 +261,6 @@
 #define JS_COMMAND_SOFT_STOP_1 0x06/* Execute SOFT_STOP if 
JOB_CHAIN_FLAG is 1 */
 #define JS_COMMAND_HARD_STOP_1 0x07/* Execute HARD_STOP if 
JOB_CHAIN_FLAG is 1 */
 
-#define JS_STATUS_EVENT_ACTIVE 0x08
-
-
 /* MMU regs */
 #define MMU_INT_RAWSTAT0x2000
 #define MMU_INT_CLEAR  0x2004
-- 
2.31.1

[PATCH v4 05/14] drm/panfrost: Do the exception -> string translation using a table

2021-06-28 Thread Boris Brezillon

Do the exception -> string translation using a table. This way we get
rid of those magic numbers and can easily add new fields if we need
to attach extra information to exception types.

v4:
* Don't expose exception type to userspace
* Merge the enum definition and the enum -> string table declaration
  in the same patch

v3:
* Drop the error field

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
Reviewed-by: Alyssa Rosenzweig 
---
 drivers/gpu/drm/panfrost/panfrost_device.c | 130 +
 drivers/gpu/drm/panfrost/panfrost_device.h |  69 +++
 2 files changed, 152 insertions(+), 47 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index bce6b0aff05e..736854542b05 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -292,55 +292,91 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
panfrost_clk_fini(pfdev);
 }
 
-const char *panfrost_exception_name(u32 exception_code)
-{
-   switch (exception_code) {
-   /* Non-Fault Status code */
-   case 0x00: return "NOT_STARTED/IDLE/OK";
-   case 0x01: return "DONE";
-   case 0x02: return "INTERRUPTED";
-   case 0x03: return "STOPPED";
-   case 0x04: return "TERMINATED";
-   case 0x08: return "ACTIVE";
-   /* Job exceptions */
-   case 0x40: return "JOB_CONFIG_FAULT";
-   case 0x41: return "JOB_POWER_FAULT";
-   case 0x42: return "JOB_READ_FAULT";
-   case 0x43: return "JOB_WRITE_FAULT";
-   case 0x44: return "JOB_AFFINITY_FAULT";
-   case 0x48: return "JOB_BUS_FAULT";
-   case 0x50: return "INSTR_INVALID_PC";
-   case 0x51: return "INSTR_INVALID_ENC";
-   case 0x52: return "INSTR_TYPE_MISMATCH";
-   case 0x53: return "INSTR_OPERAND_FAULT";
-   case 0x54: return "INSTR_TLS_FAULT";
-   case 0x55: return "INSTR_BARRIER_FAULT";
-   case 0x56: return "INSTR_ALIGN_FAULT";
-   case 0x58: return "DATA_INVALID_FAULT";
-   case 0x59: return "TILE_RANGE_FAULT";
-   case 0x5A: return "ADDR_RANGE_FAULT";
-   case 0x60: return "OUT_OF_MEMORY";
-   /* GPU exceptions */
-   case 0x80: return "DELAYED_BUS_FAULT";
-   case 0x88: return "SHAREABILITY_FAULT";
-   /* MMU exceptions */
-   case 0xC1: return "TRANSLATION_FAULT_LEVEL1";
-   case 0xC2: return "TRANSLATION_FAULT_LEVEL2";
-   case 0xC3: return "TRANSLATION_FAULT_LEVEL3";
-   case 0xC4: return "TRANSLATION_FAULT_LEVEL4";
-   case 0xC8: return "PERMISSION_FAULT";
-   case 0xC9 ... 0xCF: return "PERMISSION_FAULT";
-   case 0xD1: return "TRANSTAB_BUS_FAULT_LEVEL1";
-   case 0xD2: return "TRANSTAB_BUS_FAULT_LEVEL2";
-   case 0xD3: return "TRANSTAB_BUS_FAULT_LEVEL3";
-   case 0xD4: return "TRANSTAB_BUS_FAULT_LEVEL4";
-   case 0xD8: return "ACCESS_FLAG";
-   case 0xD9 ... 0xDF: return "ACCESS_FLAG";
-   case 0xE0 ... 0xE7: return "ADDRESS_SIZE_FAULT";
-   case 0xE8 ... 0xEF: return "MEMORY_ATTRIBUTES_FAULT";
+#define PANFROST_EXCEPTION(id) \
+   [DRM_PANFROST_EXCEPTION_ ## id] = { \
+   .name = #id, \
}
 
-   return "UNKNOWN";
+struct panfrost_exception_info {
+   const char *name;
+};
+
+static const struct panfrost_exception_info panfrost_exception_infos[] = {
+   PANFROST_EXCEPTION(OK),
+   PANFROST_EXCEPTION(DONE),
+   PANFROST_EXCEPTION(INTERRUPTED),
+   PANFROST_EXCEPTION(STOPPED),
+   PANFROST_EXCEPTION(TERMINATED),
+   PANFROST_EXCEPTION(KABOOM),
+   PANFROST_EXCEPTION(EUREKA),
+   PANFROST_EXCEPTION(ACTIVE),
+   PANFROST_EXCEPTION(JOB_CONFIG_FAULT),
+   PANFROST_EXCEPTION(JOB_POWER_FAULT),
+   PANFROST_EXCEPTION(JOB_READ_FAULT),
+   PANFROST_EXCEPTION(JOB_WRITE_FAULT),
+   PANFROST_EXCEPTION(JOB_AFFINITY_FAULT),
+   PANFROST_EXCEPTION(JOB_BUS_FAULT),
+   PANFROST_EXCEPTION(INSTR_INVALID_PC),
+   PANFROST_EXCEPTION(INSTR_INVALID_ENC),
+   PANFROST_EXCEPTION(INSTR_TYPE_MISMATCH),
+   PANFROST_EXCEPTION(INSTR_OPERAND_FAULT),
+   PANFROST_EXCEPTION(INSTR_TLS_FAULT),
+   PANFROST_EXCEPTION(INSTR_BARRIER_FAULT),
+   PANFROST_EXCEPTION(INSTR_ALIGN_FAULT),
+   PANFROST_EXCEPTION(DATA_INVALID_FAULT),
+   PANFROST_EXCEPTION(TILE_RANGE_FAULT),
+   PANFROST_EXCEPTION(ADDR_RANGE_FAULT),
+   PANFROST_EXCEPTION(IMPRECISE_FAULT),
+   PANFROST_EXCEPTION(OOM),
+   PANFROST_EXCEPTION(OOM_AFBC),
+   PANFROST_EXCEP

[PATCH v4 07/14] drm/panfrost: Use a threaded IRQ for job interrupts

2021-06-28 Thread Boris Brezillon

This should avoid switching to interrupt context when the GPU is under
heavy use.

v3:
* Don't take the job_lock in panfrost_job_handle_irq()

Signed-off-by: Boris Brezillon 
Acked-by: Alyssa Rosenzweig 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 53 ++---
 1 file changed, 38 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index be8f68f63974..e0c479e67304 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -470,19 +470,12 @@ static const struct drm_sched_backend_ops 
panfrost_sched_ops = {
.free_job = panfrost_job_free
 };
 
-static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
+static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
 {
-   struct panfrost_device *pfdev = data;
-   u32 status = job_read(pfdev, JOB_INT_STAT);
int j;
 
dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status);
 
-   if (!status)
-   return IRQ_NONE;
-
-   pm_runtime_mark_last_busy(pfdev->dev);
-
for (j = 0; status; j++) {
u32 mask = MK_JS_MASK(j);
 
@@ -519,7 +512,6 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
if (status & JOB_INT_MASK_DONE(j)) {
struct panfrost_job *job;
 
-   spin_lock(&pfdev->js->job_lock);
job = pfdev->jobs[j];
/* Only NULL if job timeout occurred */
if (job) {
@@ -531,21 +523,49 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
dma_fence_signal_locked(job->done_fence);
pm_runtime_put_autosuspend(pfdev->dev);
}
-   spin_unlock(&pfdev->js->job_lock);
}
 
status &= ~mask;
}
+}
 
+static irqreturn_t panfrost_job_irq_handler_thread(int irq, void *data)
+{
+   struct panfrost_device *pfdev = data;
+   u32 status = job_read(pfdev, JOB_INT_RAWSTAT);
+
+   while (status) {
+   pm_runtime_mark_last_busy(pfdev->dev);
+
+   spin_lock(&pfdev->js->job_lock);
+   panfrost_job_handle_irq(pfdev, status);
+   spin_unlock(&pfdev->js->job_lock);
+   status = job_read(pfdev, JOB_INT_RAWSTAT);
+   }
+
+   job_write(pfdev, JOB_INT_MASK,
+ GENMASK(16 + NUM_JOB_SLOTS - 1, 16) |
+ GENMASK(NUM_JOB_SLOTS - 1, 0));
return IRQ_HANDLED;
 }
 
+static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
+{
+   struct panfrost_device *pfdev = data;
+   u32 status = job_read(pfdev, JOB_INT_STAT);
+
+   if (!status)
+   return IRQ_NONE;
+
+   job_write(pfdev, JOB_INT_MASK, 0);
+   return IRQ_WAKE_THREAD;
+}
+
 static void panfrost_reset(struct work_struct *work)
 {
struct panfrost_device *pfdev = container_of(work,
 struct panfrost_device,
 reset.work);
-   unsigned long flags;
unsigned int i;
bool cookie;
 
@@ -575,7 +595,7 @@ static void panfrost_reset(struct work_struct *work)
/* All timers have been stopped, we can safely reset the pending state. 
*/
atomic_set(&pfdev->reset.pending, 0);
 
-   spin_lock_irqsave(&pfdev->js->job_lock, flags);
+   spin_lock(&pfdev->js->job_lock);
for (i = 0; i < NUM_JOB_SLOTS; i++) {
if (pfdev->jobs[i]) {
pm_runtime_put_noidle(pfdev->dev);
@@ -583,7 +603,7 @@ static void panfrost_reset(struct work_struct *work)
pfdev->jobs[i] = NULL;
}
}
-   spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
+   spin_unlock(&pfdev->js->job_lock);
 
panfrost_device_reset(pfdev);
 
@@ -610,8 +630,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
if (irq <= 0)
return -ENODEV;
 
-   ret = devm_request_irq(pfdev->dev, irq, panfrost_job_irq_handler,
-  IRQF_SHARED, KBUILD_MODNAME "-job", pfdev);
+   ret = devm_request_threaded_irq(pfdev->dev, irq,
+   panfrost_job_irq_handler,
+   panfrost_job_irq_handler_thread,
+   IRQF_SHARED, KBUILD_MODNAME "-job",
+   pfdev);
if (ret) {
dev_err(pfdev->dev, "failed to request job irq");
return ret;
-- 
2.31.1

[PATCH v4 02/14] drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate

2021-06-28 Thread Boris Brezillon

If the fence creation fail, we can return the error pointer directly.
The core will update the fence error accordingly.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
Reviewed-by: Alyssa Rosenzweig 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 8ff79fd49577..d6c9698bca3b 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -355,7 +355,7 @@ static struct dma_fence *panfrost_job_run(struct 
drm_sched_job *sched_job)
 
fence = panfrost_fence_create(pfdev, slot);
if (IS_ERR(fence))
-   return NULL;
+   return fence;
 
if (job->done_fence)
dma_fence_put(job->done_fence);
-- 
2.31.1

[PATCH v4 00/14] drm/panfrost: Misc improvements

2021-06-28 Thread Boris Brezillon

Hello,

This is a merge of [1] and [2] since the second series depends on
patches in the preparatory series.

main changes in this v4:
* fixing the reset serialization
* fixing a deadlock in the reset path
* moving the exception enum to a private header

Regards,

Boris

Boris Brezillon (13):
  drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr
  drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate
  drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition
  drm/panfrost: Drop the pfdev argument passed to
panfrost_exception_name()
  drm/panfrost: Do the exception -> string translation using a table
  drm/panfrost: Expose a helper to trigger a GPU reset
  drm/panfrost: Use a threaded IRQ for job interrupts
  drm/panfrost: Simplify the reset serialization logic
  drm/panfrost: Make sure job interrupts are masked before resetting
  drm/panfrost: Disable the AS on unhandled page faults
  drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck
  drm/panfrost: Don't reset the GPU on job faults unless we really have
to
  drm/panfrost: Kill in-flight jobs on FD close

Steven Price (1):
  drm/panfrost: Queue jobs on the hardware

 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c|   3 +-
 drivers/gpu/drm/lima/lima_sched.c  |   3 +-
 drivers/gpu/drm/panfrost/panfrost_device.c | 139 +++--
 drivers/gpu/drm/panfrost/panfrost_device.h |  84 ++-
 drivers/gpu/drm/panfrost/panfrost_gpu.c|   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 630 +++--
 drivers/gpu/drm/panfrost/panfrost_mmu.c|  41 +-
 drivers/gpu/drm/panfrost/panfrost_regs.h   |   3 -
 drivers/gpu/drm/scheduler/sched_main.c |  14 +-
 drivers/gpu/drm/v3d/v3d_sched.c|  10 +-
 include/drm/gpu_scheduler.h|   5 +-
 12 files changed, 681 insertions(+), 255 deletions(-)

-- 
2.31.1

Re: [PATCH v3 09/15] drm/panfrost: Simplify the reset serialization logic

2021-06-25 Thread Boris Brezillon

On Fri, 25 Jun 2021 15:33:21 +0200
Boris Brezillon  wrote:


> @@ -379,57 +370,73 @@ void panfrost_job_enable_interrupts(struct 
> panfrost_device *pfdev)
>   job_write(pfdev, JOB_INT_MASK, irq_mask);
>  }
>  
> -static bool panfrost_scheduler_stop(struct panfrost_queue_state *queue,
> - struct drm_sched_job *bad)
> +static void panfrost_reset(struct panfrost_device *pfdev,
> +struct drm_sched_job *bad)
>  {
> - enum panfrost_queue_status old_status;
> - bool stopped = false;
> + unsigned int i;
> + bool cookie;
>  
> - mutex_lock(&queue->lock);
> - old_status = atomic_xchg(&queue->status,
> -  PANFROST_QUEUE_STATUS_STOPPED);
> - if (old_status == PANFROST_QUEUE_STATUS_STOPPED)
> - goto out;
> + if (WARN_ON(!atomic_read(&pfdev->reset.pending)))
> + return;
> +
> + /* Stop the schedulers.
> +  *
> +  * FIXME: We temporarily get out of the dma_fence_signalling section
> +  * because the cleanup path generate lockdep splats when taking locks
> +  * to release job resources. We should rework the code to follow this
> +  * pattern:
> +  *
> +  *  try_lock
> +  *  if (locked)
> +  *  release
> +  *  else
> +  *  schedule_work_to_release_later
> +  */
> + for (i = 0; i < NUM_JOB_SLOTS; i++)
> + drm_sched_stop(&pfdev->js->queue[i].sched, bad);
> +
> + cookie = dma_fence_begin_signalling();
>  
> - WARN_ON(old_status != PANFROST_QUEUE_STATUS_ACTIVE);
> - drm_sched_stop(&queue->sched, bad);
>   if (bad)
>   drm_sched_increase_karma(bad);
>  
> - stopped = true;
> + spin_lock(&pfdev->js->job_lock);
> + for (i = 0; i < NUM_JOB_SLOTS; i++) {
> + if (pfdev->jobs[i]) {
> + pm_runtime_put_noidle(pfdev->dev);
> + panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
> + pfdev->jobs[i] = NULL;
> + }
> + }
> + spin_unlock(&pfdev->js->job_lock);
>  
> - /*
> -  * Set the timeout to max so the timer doesn't get started
> -  * when we return from the timeout handler (restored in
> -  * panfrost_scheduler_start()).
> + panfrost_device_reset(pfdev);
> +
> + /* GPU has been reset, we can cancel timeout/fault work that may have
> +  * been queued in the meantime and clear the reset pending bit.
>*/
> - queue->sched.timeout = MAX_SCHEDULE_TIMEOUT;
> + atomic_set(&pfdev->reset.pending, 0);
> + cancel_work_sync(&pfdev->reset.work);

This is introducing a deadlock since panfrost_reset() might be called
from the reset handler, and cancel_work_sync() waits for the handler to
return. Unfortunately there's no cancel_work() variant, so I'll just
remove the

WARN_ON(!atomic_read(&pfdev->reset.pending)

and return directly when the pending bit is cleared.

> + for (i = 0; i < NUM_JOB_SLOTS; i++)
> + cancel_delayed_work(&pfdev->js->queue[i].sched.work_tdr);
>  
> -out:
> - mutex_unlock(&queue->lock);
>  
> - return stopped;
> -}
> + /* Now resubmit jobs that were previously queued but didn't have a
> +  * chance to finish.
> +  * FIXME: We temporarily get out of the DMA fence signalling section
> +  * while resubmitting jobs because the job submission logic will
> +  * allocate memory with the GFP_KERNEL flag which can trigger memory
> +  * reclaim and exposes a lock ordering issue.
> +  */
> + dma_fence_end_signalling(cookie);
> + for (i = 0; i < NUM_JOB_SLOTS; i++)
> + drm_sched_resubmit_jobs(&pfdev->js->queue[i].sched);
> + cookie = dma_fence_begin_signalling();
>  
> -static void panfrost_scheduler_start(struct panfrost_queue_state *queue)
> -{
> - enum panfrost_queue_status old_status;
> + for (i = 0; i < NUM_JOB_SLOTS; i++)
> + drm_sched_start(&pfdev->js->queue[i].sched, true);
>  
> - mutex_lock(&queue->lock);
> - old_status = atomic_xchg(&queue->status,
> -  PANFROST_QUEUE_STATUS_STARTING);
> - WARN_ON(old_status != PANFROST_QUEUE_STATUS_STOPPED);
> -
> - /* Restore the original timeout before starting the scheduler. */
> - queue->sched.timeout = msecs_to_jiffies(JOB_TIMEOUT_MS);
> - drm_sched_resubmit_jobs(&queue->sched);
> - drm_sched_start(&queue->sched, true);
> - old_status = atomic_xchg(&queue->status,
> -  PANFROST_QUEUE_STATUS_ACTIVE);
> - if (old_status == PANFROST_QUEUE_STATUS_FAULT_PENDING)
> - drm_sched_fault(&queue->sched);
> -
> - mutex_unlock(&queue->lock);
> + dma_fence_end_signalling(cookie);
>  }
>

Re: [PATCH v3 10/15] drm/panfrost: Make sure job interrupts are masked before resetting

2021-06-25 Thread Boris Brezillon

On Fri, 25 Jun 2021 16:55:12 +0100
Steven Price  wrote:

> On 25/06/2021 14:33, Boris Brezillon wrote:
> > This is not yet needed because we let active jobs be killed during by
> > the reset and we don't really bother making sure they can be restarted.
> > But once we start adding soft-stop support, controlling when we deal
> > with the remaining interrrupts and making sure those are handled before
> > the reset is issued gets tricky if we keep job interrupts active.
> > 
> > Let's prepare for that and mask+flush job IRQs before issuing a reset.
> > 
> > Signed-off-by: Boris Brezillon 
> > ---
> >  drivers/gpu/drm/panfrost/panfrost_job.c | 21 +++--
> >  1 file changed, 15 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
> > b/drivers/gpu/drm/panfrost/panfrost_job.c
> > index 88d34fd781e8..0566e2f7e84a 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> > @@ -34,6 +34,7 @@ struct panfrost_queue_state {
> >  struct panfrost_job_slot {
> > struct panfrost_queue_state queue[NUM_JOB_SLOTS];
> > spinlock_t job_lock;
> > +   int irq;
> >  };
> >  
> >  static struct panfrost_job *
> > @@ -400,7 +401,15 @@ static void panfrost_reset(struct panfrost_device 
> > *pfdev,
> > if (bad)
> > drm_sched_increase_karma(bad);
> >  
> > -   spin_lock(&pfdev->js->job_lock);  
> 
> I'm not sure it's safe to remove this lock as this protects the
> pfdev->jobs array: I can't see what would prevent panfrost_job_close()
> running at the same time without the lock. Am I missing something?

Ah, you're right, I'll add it back.

> 
> > +   /* Mask job interrupts and synchronize to make sure we won't be
> > +* interrupted during our reset.
> > +*/
> > +   job_write(pfdev, JOB_INT_MASK, 0);
> > +   synchronize_irq(pfdev->js->irq);
> > +
> > +   /* Schedulers are stopped and interrupts are masked+flushed, we don't
> > +* need to protect the 'evict unfinished jobs' lock with the job_lock.
> > +*/
> > for (i = 0; i < NUM_JOB_SLOTS; i++) {
> > if (pfdev->jobs[i]) {
> > pm_runtime_put_noidle(pfdev->dev);
> > @@ -408,7 +417,6 @@ static void panfrost_reset(struct panfrost_device 
> > *pfdev,
> > pfdev->jobs[i] = NULL;
> > }
> > }
> > -   spin_unlock(&pfdev->js->job_lock);
> >  
> > panfrost_device_reset(pfdev);
> >  
> > @@ -504,6 +512,7 @@ static void panfrost_job_handle_irq(struct 
> > panfrost_device *pfdev, u32 status)
> >  
> > job = pfdev->jobs[j];
> > /* Only NULL if job timeout occurred */
> > +   WARN_ON(!job);  
> 
> Was this WARN_ON intentional?

Yes, now that we mask and synchronize the irq in the reset I don't see
any reason why we would end up with an event but no job to attach this
even to, but maybe I missed something.

Re: [PATCH v3 05/15] drm/panfrost: Expose exception types to userspace

2021-06-25 Thread Boris Brezillon

On Fri, 25 Jun 2021 16:32:27 +0100
Steven Price  wrote:

> On 25/06/2021 15:21, Boris Brezillon wrote:
> > On Fri, 25 Jun 2021 09:42:08 -0400
> > Alyssa Rosenzweig  wrote:
> >   
> >> I'm not convinced. Right now most of our UABI is pleasantly
> >> GPU-agnostic. With this suddenly there's divergence between Midgard and
> >> Bifrost uABI.  
> > 
> > Hm, I don't see why. I mean the exception types seem to be the same,
> > there are just some that are not used on Midgard and some that are no
> > used on Bifrost. Are there any collisions I didn't notice?  
> 
> I think the real question is: why are we exporting them if user space
> doesn't want them ;) Should this be in an internal header file at least
> until someone actually requests they be available to user space?

Alright, I'll move it to panfrost_device.h (or panfrost_regs.h) then.

Re: [PATCH v3 01/15] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr

2021-06-25 Thread Boris Brezillon

On Fri, 25 Jun 2021 16:07:03 +0100
Steven Price  wrote:

> On 25/06/2021 14:33, Boris Brezillon wrote:
> > Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
> > reset. This leads to extra complexity when we need to synchronize timeout
> > works with the reset work. One solution to address that is to have an
> > ordered workqueue at the driver level that will be used by the different
> > schedulers to queue their timeout work. Thanks to the serialization
> > provided by the ordered workqueue we are guaranteed that timeout
> > handlers are executed sequentially, and can thus easily reset the GPU
> > from the timeout handler without extra synchronization.
> > 
> > Signed-off-by: Boris Brezillon   
> 
> I feel like I'm missing something here - I can't see where
> sched->timeout_wq is ever actually used in this series. There's clearly
> no point passing it into the drm core if the drm core never accesses it.
> AFAICT the changes are all in patch 9 and that doesn't depend on this one.

Oops, indeed, I forgot to patch sched_main.c to use the timeout_wq (below
is a version doing that). We really need a way to trigger this sort of
race...

--->8---
From 18bb739da5a5fc3e36d2c4378408c6938198993c Mon Sep 17 00:00:00 2001
From: Boris Brezillon 
Date: Wed, 23 Jun 2021 16:14:01 +0200
Subject: [PATCH] drm/sched: Allow using a dedicated workqueue for the
 timeout/fault tdr

Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
reset. This leads to extra complexity when we need to synchronize timeout
works with the reset work. One solution to address that is to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. Thanks to the serialization
provided by the ordered workqueue we are guaranteed that timeout
handlers are executed sequentially, and can thus easily reset the GPU
from the timeout handler without extra synchronization.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c   |  3 ++-
 drivers/gpu/drm/lima/lima_sched.c |  3 ++-
 drivers/gpu/drm/panfrost/panfrost_job.c   |  3 ++-
 drivers/gpu/drm/scheduler/sched_main.c| 14 +-
 drivers/gpu/drm/v3d/v3d_sched.c   | 10 +-
 include/drm/gpu_scheduler.h   |  5 -
 7 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 47ea46859618..532636ea20bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -488,7 +488,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 
r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
   num_hw_submission, amdgpu_job_hang_limit,
-  timeout, sched_score, ring->name);
+  timeout, NULL, sched_score, ring->name);
if (r) {
DRM_ERROR("Failed to create scheduler on ring %s.\n",
  ring->name);
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 19826e504efc..feb6da1b6ceb 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -190,7 +190,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
 
ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
 etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
-msecs_to_jiffies(500), NULL, dev_name(gpu->dev));
+msecs_to_jiffies(500), NULL, NULL,
+dev_name(gpu->dev));
if (ret)
return ret;
 
diff --git a/drivers/gpu/drm/lima/lima_sched.c 
b/drivers/gpu/drm/lima/lima_sched.c
index ecf3267334ff..dba8329937a3 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -508,7 +508,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, 
const char *name)
INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
 
return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
- lima_job_hang_limit, msecs_to_jiffies(timeout),
+ lima_job_hang_limit,
+ msecs_to_jiffies(timeout), NULL,
  NULL, name);
 }
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 682f2161b999..8ff79fd49577 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -626,7 +626,8 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 
ret =

Re: [PATCH v3 14/15] drm/panfrost: Kill in-flight jobs on FD close

2021-06-25 Thread Boris Brezillon

On Fri, 25 Jun 2021 15:43:45 +0200
Lucas Stach  wrote:

> Am Freitag, dem 25.06.2021 um 15:33 +0200 schrieb Boris Brezillon:
> > If the process who submitted these jobs decided to close the FD before
> > the jobs are done it probably means it doesn't care about the result.
> > 
> > v3:
> > * Set fence error to ECANCELED when a TERMINATED exception is received
> > 
> > Signed-off-by: Boris Brezillon 
> > ---
> >  drivers/gpu/drm/panfrost/panfrost_job.c | 43 +
> >  1 file changed, 37 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
> > b/drivers/gpu/drm/panfrost/panfrost_job.c
> > index 948bd174ff99..aa1e6542adde 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> > @@ -498,14 +498,21 @@ static void panfrost_job_handle_irq(struct 
> > panfrost_device *pfdev, u32 status)
> >  
> > if (status & JOB_INT_MASK_ERR(j)) {
> > u32 js_status = job_read(pfdev, JS_STATUS(j));
> > +   const char *exception_name = 
> > panfrost_exception_name(js_status);
> >  
> > job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
> >  
> > -   dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
> > head=0x%x, tail=0x%x",
> > -   j,
> > -   panfrost_exception_name(js_status),
> > -   job_read(pfdev, JS_HEAD_LO(j)),
> > -   job_read(pfdev, JS_TAIL_LO(j)));
> > +   if (js_status < 
> > DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT) {
> > +   dev_dbg(pfdev->dev, "js interrupt, js=%d, 
> > status=%s, head=0x%x, tail=0x%x",
> > +   j, exception_name,
> > +   job_read(pfdev, JS_HEAD_LO(j)),
> > +   job_read(pfdev, JS_TAIL_LO(j)));
> > +   } else {
> > +   dev_err(pfdev->dev, "js fault, js=%d, 
> > status=%s, head=0x%x, tail=0x%x",
> > +   j, exception_name,
> > +   job_read(pfdev, JS_HEAD_LO(j)),
> > +   job_read(pfdev, JS_TAIL_LO(j)));
> > +   }
> >  
> > /* If we need a reset, signal it to the timeout
> >  * handler, otherwise, update the fence error field and
> > @@ -514,7 +521,16 @@ static void panfrost_job_handle_irq(struct 
> > panfrost_device *pfdev, u32 status)
> > if (panfrost_exception_needs_reset(pfdev, js_status)) {
> > drm_sched_fault(&pfdev->js->queue[j].sched);
> > } else {
> > -   dma_fence_set_error(pfdev->jobs[j]->done_fence, 
> > -EINVAL);
> > +   int error = 0;
> > +
> > +   if (js_status == 
> > DRM_PANFROST_EXCEPTION_TERMINATED)
> > +   error = -ECANCELED;
> > +   else if (js_status >= 
> > DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT)
> > +   error = -EINVAL;
> > +
> > +   if (error)
> > +   
> > dma_fence_set_error(pfdev->jobs[j]->done_fence, error);
> > +
> > status |= JOB_INT_MASK_DONE(j);
> > }
> > }
> > @@ -673,10 +689,25 @@ int panfrost_job_open(struct panfrost_file_priv 
> > *panfrost_priv)
> >  
> >  void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
> >  {
> > +   struct panfrost_device *pfdev = panfrost_priv->pfdev;
> > +   unsigned long flags;
> > int i;
> >  
> > for (i = 0; i < NUM_JOB_SLOTS; i++)
> > drm_sched_entity_destroy(&panfrost_priv->sched_entity[i]);
> > +
> > +   /* Kill in-flight jobs */
> > +   spin_lock_irqsave(&pfdev->js->job_lock, flags);  
> 
> Micro-optimization, but this code is never called from IRQ context, so
> a spin_lock_irq would do here, no need to save/restore flags.

Ah, right, I moved patches around. This patch was before the 'move to
threaded-irq' one in v2, but now that it's coming after, we can use a
regular lock here.

> 
> Regards,
> Lucas
> 
> > +   for (i = 0; i < NUM_JOB_SLOTS; i++) {
> > +   struct drm_sched_entity *entity = 
> > &panfrost_priv->sched_entity[i];
> > +   struct panfrost_job *job = pfdev->jobs[i];
> > +
> > +   if (!job || job->base.entity != entity)
> > +   continue;
> > +
> > +   job_write(pfdev, JS_COMMAND(i), JS_COMMAND_HARD_STOP);
> > +   }
> > +   spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
> >  }
> >  
> >  int panfrost_job_is_idle(struct panfrost_device *pfdev)  
> 
>

Re: [PATCH v3 08/15] drm/panfrost: Use a threaded IRQ for job interrupts

2021-06-25 Thread Boris Brezillon

On Fri, 25 Jun 2021 09:47:59 -0400
Alyssa Rosenzweig  wrote:

> A-b, but could you explain the context? Thanks

The rational behind this change is the complexity added to the
interrupt handler in patch 15. That means we might spend more time in
interrupt context after that patch and block other things on the system
while we dequeue job irqs. Moving things to a thread also helps
performances when the GPU gets faster as executing jobs than the CPU at
queueing them. In that case we keep switching back-and-forth between
interrupt and non-interrupt context which has a cost.

One drawback is increased latency when receiving job events and the
thread is idle, since you need to wake up the thread in that case.

> 
> On Fri, Jun 25, 2021 at 03:33:20PM +0200, Boris Brezillon wrote:
> > This should avoid switching to interrupt context when the GPU is under
> > heavy use.
> > 
> > v3:
> > * Don't take the job_lock in panfrost_job_handle_irq()
> > 
> > Signed-off-by: Boris Brezillon 
> > ---
> >  drivers/gpu/drm/panfrost/panfrost_job.c | 53 ++---
> >  1 file changed, 38 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
> > b/drivers/gpu/drm/panfrost/panfrost_job.c
> > index be8f68f63974..e0c479e67304 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> > @@ -470,19 +470,12 @@ static const struct drm_sched_backend_ops 
> > panfrost_sched_ops = {
> > .free_job = panfrost_job_free
> >  };
> >  
> > -static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
> > +static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 
> > status)
> >  {
> > -   struct panfrost_device *pfdev = data;
> > -   u32 status = job_read(pfdev, JOB_INT_STAT);
> > int j;
> >  
> > dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status);
> >  
> > -   if (!status)
> > -   return IRQ_NONE;
> > -
> > -   pm_runtime_mark_last_busy(pfdev->dev);
> > -
> > for (j = 0; status; j++) {
> > u32 mask = MK_JS_MASK(j);
> >  
> > @@ -519,7 +512,6 @@ static irqreturn_t panfrost_job_irq_handler(int irq, 
> > void *data)
> > if (status & JOB_INT_MASK_DONE(j)) {
> > struct panfrost_job *job;
> >  
> > -   spin_lock(&pfdev->js->job_lock);
> > job = pfdev->jobs[j];
> > /* Only NULL if job timeout occurred */
> > if (job) {
> > @@ -531,21 +523,49 @@ static irqreturn_t panfrost_job_irq_handler(int irq, 
> > void *data)
> > dma_fence_signal_locked(job->done_fence);
> > pm_runtime_put_autosuspend(pfdev->dev);
> > }
> > -   spin_unlock(&pfdev->js->job_lock);
> > }
> >  
> > status &= ~mask;
> > }
> > +}
> >  
> > +static irqreturn_t panfrost_job_irq_handler_thread(int irq, void *data)
> > +{
> > +   struct panfrost_device *pfdev = data;
> > +   u32 status = job_read(pfdev, JOB_INT_RAWSTAT);
> > +
> > +   while (status) {
> > +   pm_runtime_mark_last_busy(pfdev->dev);
> > +
> > +   spin_lock(&pfdev->js->job_lock);
> > +   panfrost_job_handle_irq(pfdev, status);
> > +   spin_unlock(&pfdev->js->job_lock);
> > +   status = job_read(pfdev, JOB_INT_RAWSTAT);
> > +   }
> > +
> > +   job_write(pfdev, JOB_INT_MASK,
> > + GENMASK(16 + NUM_JOB_SLOTS - 1, 16) |
> > + GENMASK(NUM_JOB_SLOTS - 1, 0));
> > return IRQ_HANDLED;
> >  }
> >  
> > +static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
> > +{
> > +   struct panfrost_device *pfdev = data;
> > +   u32 status = job_read(pfdev, JOB_INT_STAT);
> > +
> > +   if (!status)
> > +   return IRQ_NONE;
> > +
> > +   job_write(pfdev, JOB_INT_MASK, 0);
> > +   return IRQ_WAKE_THREAD;
> > +}
> > +
> >  static void panfrost_reset(struct work_struct *work)
> >  {
> > struct panfrost_device *pfdev = container_of(work,
> >  struct panfrost_device,
> >  reset.work);
> > -   unsigned long flags;
> > unsigned int i;
> > bool cookie;
> >  
> > @@ -

Re: [PATCH v3 05/15] drm/panfrost: Expose exception types to userspace

2021-06-25 Thread Boris Brezillon

On Fri, 25 Jun 2021 09:42:08 -0400
Alyssa Rosenzweig  wrote:

> I'm not convinced. Right now most of our UABI is pleasantly
> GPU-agnostic. With this suddenly there's divergence between Midgard and
> Bifrost uABI.

Hm, I don't see why. I mean the exception types seem to be the same,
there are just some that are not used on Midgard and some that are no
used on Bifrost. Are there any collisions I didn't notice?

> With that drawback in mind, could you explain the benefit?

Well, I thought having these definitions in a central place would be a
good thing given they're not expected to change even if they might
be per-GPU. I don't know if that changes with CSF, maybe the exception
codes are no longer set in stone and can change with FW update...

> 
> On Fri, Jun 25, 2021 at 03:33:17PM +0200, Boris Brezillon wrote:
> > Job headers contain an exception type field which might be read and
> > converted to a human readable string by tracing tools. Let's expose
> > the exception type as an enum so we share the same definition.
> > 
> > v3:
> > * Add missing values
> > 
> > Signed-off-by: Boris Brezillon 
> > ---
> >  include/uapi/drm/panfrost_drm.h | 71 +
> >  1 file changed, 71 insertions(+)
> > 
> > diff --git a/include/uapi/drm/panfrost_drm.h 
> > b/include/uapi/drm/panfrost_drm.h
> > index ec19db1eead8..899cd6d952d4 100644
> > --- a/include/uapi/drm/panfrost_drm.h
> > +++ b/include/uapi/drm/panfrost_drm.h
> > @@ -223,6 +223,77 @@ struct drm_panfrost_madvise {
> > __u32 retained;   /* out, whether backing store still exists */
> >  };
> >  
> > +/* The exception types */
> > +
> > +enum drm_panfrost_exception_type {
> > +   DRM_PANFROST_EXCEPTION_OK = 0x00,
> > +   DRM_PANFROST_EXCEPTION_DONE = 0x01,
> > +   DRM_PANFROST_EXCEPTION_INTERRUPTED = 0x02,
> > +   DRM_PANFROST_EXCEPTION_STOPPED = 0x03,
> > +   DRM_PANFROST_EXCEPTION_TERMINATED = 0x04,
> > +   DRM_PANFROST_EXCEPTION_KABOOM = 0x05,
> > +   DRM_PANFROST_EXCEPTION_EUREKA = 0x06,
> > +   DRM_PANFROST_EXCEPTION_ACTIVE = 0x08,
> > +   DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT = 0x40,
> > +   DRM_PANFROST_EXCEPTION_JOB_POWER_FAULT = 0x41,
> > +   DRM_PANFROST_EXCEPTION_JOB_READ_FAULT = 0x42,
> > +   DRM_PANFROST_EXCEPTION_JOB_WRITE_FAULT = 0x43,
> > +   DRM_PANFROST_EXCEPTION_JOB_AFFINITY_FAULT = 0x44,
> > +   DRM_PANFROST_EXCEPTION_JOB_BUS_FAULT = 0x48,
> > +   DRM_PANFROST_EXCEPTION_INSTR_INVALID_PC = 0x50,
> > +   DRM_PANFROST_EXCEPTION_INSTR_INVALID_ENC = 0x51,
> > +   DRM_PANFROST_EXCEPTION_INSTR_TYPE_MISMATCH = 0x52,
> > +   DRM_PANFROST_EXCEPTION_INSTR_OPERAND_FAULT = 0x53,
> > +   DRM_PANFROST_EXCEPTION_INSTR_TLS_FAULT = 0x54,
> > +   DRM_PANFROST_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,
> > +   DRM_PANFROST_EXCEPTION_INSTR_ALIGN_FAULT = 0x56,
> > +   DRM_PANFROST_EXCEPTION_DATA_INVALID_FAULT = 0x58,
> > +   DRM_PANFROST_EXCEPTION_TILE_RANGE_FAULT = 0x59,
> > +   DRM_PANFROST_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
> > +   DRM_PANFROST_EXCEPTION_IMPRECISE_FAULT = 0x5b,
> > +   DRM_PANFROST_EXCEPTION_OOM = 0x60,
> > +   DRM_PANFROST_EXCEPTION_OOM_AFBC = 0x61,
> > +   DRM_PANFROST_EXCEPTION_UNKNOWN = 0x7f,
> > +   DRM_PANFROST_EXCEPTION_DELAYED_BUS_FAULT = 0x80,
> > +   DRM_PANFROST_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
> > +   DRM_PANFROST_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
> > +   DRM_PANFROST_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
> > +   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
> > +   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
> > +   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
> > +   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
> > +   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
> > +   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_IDENTITY = 0xc7,
> > +   DRM_PANFROST_EXCEPTION_PERM_FAULT_0 = 0xc8,
> > +   DRM_PANFROST_EXCEPTION_PERM_FAULT_1 = 0xc9,
> > +   DRM_PANFROST_EXCEPTION_PERM_FAULT_2 = 0xca,
> > +   DRM_PANFROST_EXCEPTION_PERM_FAULT_3 = 0xcb,
> > +   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_0 = 0xd0,
> > +   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_1 = 0xd1,
> > +   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_2 = 0xd2,
> > +   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_3 = 0xd3,
> > +   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_0 = 0xd8,
> > +   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
> > +   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_2 = 0xda,
> > +   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
> > +   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN0 = 0xe0,
> > +   DRM_PANFRO

[PATCH v3 11/15] drm/panfrost: Disable the AS on unhandled page faults

2021-06-25 Thread Boris Brezillon

If we don't do that, we have to wait for the job timeout to expire
before the fault jobs gets killed.

v3:
* Make sure the AS is re-enabled when new jobs are submitted to the
  context

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |  1 +
 drivers/gpu/drm/panfrost/panfrost_mmu.c| 34 --
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index bfe32907ba6b..efe9a675b614 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -96,6 +96,7 @@ struct panfrost_device {
spinlock_t as_lock;
unsigned long as_in_use_mask;
unsigned long as_alloc_mask;
+   unsigned long as_faulty_mask;
struct list_head as_lru_list;
 
struct panfrost_job_slot *js;
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index b4f0c673cd7f..65e98c51cb66 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -154,6 +154,7 @@ u32 panfrost_mmu_as_get(struct panfrost_device *pfdev, 
struct panfrost_mmu *mmu)
as = mmu->as;
if (as >= 0) {
int en = atomic_inc_return(&mmu->as_count);
+   u32 mask = BIT(as) | BIT(16 + as);
 
/*
 * AS can be retained by active jobs or a perfcnt context,
@@ -162,6 +163,18 @@ u32 panfrost_mmu_as_get(struct panfrost_device *pfdev, 
struct panfrost_mmu *mmu)
WARN_ON(en >= (NUM_JOB_SLOTS + 1));
 
list_move(&mmu->list, &pfdev->as_lru_list);
+
+   if (pfdev->as_faulty_mask & mask) {
+   /* Unhandled pagefault on this AS, the MMU was
+* disabled. We need to re-enable the MMU after
+* clearing+unmasking the AS interrupts.
+*/
+   mmu_write(pfdev, MMU_INT_CLEAR, mask);
+   mmu_write(pfdev, MMU_INT_MASK, ~pfdev->as_faulty_mask);
+   pfdev->as_faulty_mask &= ~mask;
+   panfrost_mmu_enable(pfdev, mmu);
+   }
+
goto out;
}
 
@@ -211,6 +224,7 @@ void panfrost_mmu_reset(struct panfrost_device *pfdev)
spin_lock(&pfdev->as_lock);
 
pfdev->as_alloc_mask = 0;
+   pfdev->as_faulty_mask = 0;
 
list_for_each_entry_safe(mmu, mmu_tmp, &pfdev->as_lru_list, list) {
mmu->as = -1;
@@ -662,7 +676,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, 
void *data)
if ((status & mask) == BIT(as) && (exception_type & 0xF8) == 
0xC0)
ret = panfrost_mmu_map_fault_addr(pfdev, as, addr);
 
-   if (ret)
+   if (ret) {
/* terminal fault, print info about the fault */
dev_err(pfdev->dev,
"Unhandled Page fault in AS%d at VA 0x%016llX\n"
@@ -680,14 +694,28 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int 
irq, void *data)
access_type, access_type_name(pfdev, 
fault_status),
source_id);
 
+   spin_lock(&pfdev->as_lock);
+   /* Ignore MMU interrupts on this AS until it's been
+* re-enabled.
+*/
+   pfdev->as_faulty_mask |= mask;
+
+   /* Disable the MMU to kill jobs on this AS. */
+   panfrost_mmu_disable(pfdev, as);
+   spin_unlock(&pfdev->as_lock);
+   }
+
status &= ~mask;
 
/* If we received new MMU interrupts, process them before 
returning. */
if (!status)
-   status = mmu_read(pfdev, MMU_INT_RAWSTAT);
+   status = mmu_read(pfdev, MMU_INT_RAWSTAT) & 
~pfdev->as_faulty_mask;
}
 
-   mmu_write(pfdev, MMU_INT_MASK, ~0);
+   spin_lock(&pfdev->as_lock);
+   mmu_write(pfdev, MMU_INT_MASK, ~pfdev->as_faulty_mask);
+   spin_unlock(&pfdev->as_lock);
+
return IRQ_HANDLED;
 };
 
-- 
2.31.1

[PATCH v3 02/15] drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate

2021-06-25 Thread Boris Brezillon

If the fence creation fail, we can return the error pointer directly.
The core will update the fence error accordingly.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 8ff79fd49577..d6c9698bca3b 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -355,7 +355,7 @@ static struct dma_fence *panfrost_job_run(struct 
drm_sched_job *sched_job)
 
fence = panfrost_fence_create(pfdev, slot);
if (IS_ERR(fence))
-   return NULL;
+   return fence;
 
if (job->done_fence)
dma_fence_put(job->done_fence);
-- 
2.31.1

[PATCH v3 05/15] drm/panfrost: Expose exception types to userspace

2021-06-25 Thread Boris Brezillon

Job headers contain an exception type field which might be read and
converted to a human readable string by tracing tools. Let's expose
the exception type as an enum so we share the same definition.

v3:
* Add missing values

Signed-off-by: Boris Brezillon 
---
 include/uapi/drm/panfrost_drm.h | 71 +
 1 file changed, 71 insertions(+)

diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
index ec19db1eead8..899cd6d952d4 100644
--- a/include/uapi/drm/panfrost_drm.h
+++ b/include/uapi/drm/panfrost_drm.h
@@ -223,6 +223,77 @@ struct drm_panfrost_madvise {
__u32 retained;   /* out, whether backing store still exists */
 };
 
+/* The exception types */
+
+enum drm_panfrost_exception_type {
+   DRM_PANFROST_EXCEPTION_OK = 0x00,
+   DRM_PANFROST_EXCEPTION_DONE = 0x01,
+   DRM_PANFROST_EXCEPTION_INTERRUPTED = 0x02,
+   DRM_PANFROST_EXCEPTION_STOPPED = 0x03,
+   DRM_PANFROST_EXCEPTION_TERMINATED = 0x04,
+   DRM_PANFROST_EXCEPTION_KABOOM = 0x05,
+   DRM_PANFROST_EXCEPTION_EUREKA = 0x06,
+   DRM_PANFROST_EXCEPTION_ACTIVE = 0x08,
+   DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT = 0x40,
+   DRM_PANFROST_EXCEPTION_JOB_POWER_FAULT = 0x41,
+   DRM_PANFROST_EXCEPTION_JOB_READ_FAULT = 0x42,
+   DRM_PANFROST_EXCEPTION_JOB_WRITE_FAULT = 0x43,
+   DRM_PANFROST_EXCEPTION_JOB_AFFINITY_FAULT = 0x44,
+   DRM_PANFROST_EXCEPTION_JOB_BUS_FAULT = 0x48,
+   DRM_PANFROST_EXCEPTION_INSTR_INVALID_PC = 0x50,
+   DRM_PANFROST_EXCEPTION_INSTR_INVALID_ENC = 0x51,
+   DRM_PANFROST_EXCEPTION_INSTR_TYPE_MISMATCH = 0x52,
+   DRM_PANFROST_EXCEPTION_INSTR_OPERAND_FAULT = 0x53,
+   DRM_PANFROST_EXCEPTION_INSTR_TLS_FAULT = 0x54,
+   DRM_PANFROST_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,
+   DRM_PANFROST_EXCEPTION_INSTR_ALIGN_FAULT = 0x56,
+   DRM_PANFROST_EXCEPTION_DATA_INVALID_FAULT = 0x58,
+   DRM_PANFROST_EXCEPTION_TILE_RANGE_FAULT = 0x59,
+   DRM_PANFROST_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
+   DRM_PANFROST_EXCEPTION_IMPRECISE_FAULT = 0x5b,
+   DRM_PANFROST_EXCEPTION_OOM = 0x60,
+   DRM_PANFROST_EXCEPTION_OOM_AFBC = 0x61,
+   DRM_PANFROST_EXCEPTION_UNKNOWN = 0x7f,
+   DRM_PANFROST_EXCEPTION_DELAYED_BUS_FAULT = 0x80,
+   DRM_PANFROST_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
+   DRM_PANFROST_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
+   DRM_PANFROST_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_IDENTITY = 0xc7,
+   DRM_PANFROST_EXCEPTION_PERM_FAULT_0 = 0xc8,
+   DRM_PANFROST_EXCEPTION_PERM_FAULT_1 = 0xc9,
+   DRM_PANFROST_EXCEPTION_PERM_FAULT_2 = 0xca,
+   DRM_PANFROST_EXCEPTION_PERM_FAULT_3 = 0xcb,
+   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_0 = 0xd0,
+   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_1 = 0xd1,
+   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_2 = 0xd2,
+   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_3 = 0xd3,
+   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_0 = 0xd8,
+   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
+   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_2 = 0xda,
+   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN0 = 0xe0,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN1 = 0xe1,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN2 = 0xe2,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN3 = 0xe3,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT0 = 0xe4,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT1 = 0xe5,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT2 = 0xe6,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT3 = 0xe7,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_0 = 0xe8,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_1 = 0xe9,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_2 = 0xea,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_3 = 0xeb,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_0 = 0xec,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_1 = 0xed,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_2 = 0xee,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_3 = 0xef,
+};
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.31.1

[PATCH v3 15/15] drm/panfrost: Queue jobs on the hardware

2021-06-25 Thread Boris Brezillon

From: Steven Price 

The hardware has a set of '_NEXT' registers that can hold a second job
while the first is executing. Make use of these registers to enqueue a
second job per slot.

v3:
* Fix the done/err job dequeuing logic to get a valid active state
* Only enable the second slot on GPUs supporting jobchain disambiguation
* Split interrupt handling in sub-functions

Signed-off-by: Steven Price 
Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 473 -
 2 files changed, 357 insertions(+), 118 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index ecbc79ad0006..65a7b9b08f3a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -101,7 +101,7 @@ struct panfrost_device {
 
struct panfrost_job_slot *js;
 
-   struct panfrost_job *jobs[NUM_JOB_SLOTS];
+   struct panfrost_job *jobs[NUM_JOB_SLOTS][2];
struct list_head scheduled_jobs;
 
struct panfrost_perfcnt *perfcnt;
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index aa1e6542adde..0d0011cbe864 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -140,9 +141,52 @@ static void panfrost_job_write_affinity(struct 
panfrost_device *pfdev,
job_write(pfdev, JS_AFFINITY_NEXT_HI(js), affinity >> 32);
 }
 
+static u32
+panfrost_get_job_chain_flag(const struct panfrost_job *job)
+{
+   struct panfrost_fence *f = to_panfrost_fence(job->done_fence);
+
+   if (!panfrost_has_hw_feature(job->pfdev, 
HW_FEATURE_JOBCHAIN_DISAMBIGUATION))
+   return 0;
+
+   return (f->seqno & 1) ? JS_CONFIG_JOB_CHAIN_FLAG : 0;
+}
+
+static struct panfrost_job *
+panfrost_dequeue_job(struct panfrost_device *pfdev, int slot)
+{
+   struct panfrost_job *job = pfdev->jobs[slot][0];
+
+   WARN_ON(!job);
+   pfdev->jobs[slot][0] = pfdev->jobs[slot][1];
+   pfdev->jobs[slot][1] = NULL;
+
+   return job;
+}
+
+static unsigned int
+panfrost_enqueue_job(struct panfrost_device *pfdev, int slot,
+struct panfrost_job *job)
+{
+   if (WARN_ON(!job))
+   return 0;
+
+   if (!pfdev->jobs[slot][0]) {
+   pfdev->jobs[slot][0] = job;
+   return 0;
+   }
+
+   WARN_ON(pfdev->jobs[slot][1]);
+   pfdev->jobs[slot][1] = job;
+   WARN_ON(panfrost_get_job_chain_flag(job) ==
+   panfrost_get_job_chain_flag(pfdev->jobs[slot][0]));
+   return 1;
+}
+
 static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
 {
struct panfrost_device *pfdev = job->pfdev;
+   unsigned int subslot;
u32 cfg;
u64 jc_head = job->jc;
int ret;
@@ -168,7 +212,8 @@ static void panfrost_job_hw_submit(struct panfrost_job 
*job, int js)
 * start */
cfg |= JS_CONFIG_THREAD_PRI(8) |
JS_CONFIG_START_FLUSH_CLEAN_INVALIDATE |
-   JS_CONFIG_END_FLUSH_CLEAN_INVALIDATE;
+   JS_CONFIG_END_FLUSH_CLEAN_INVALIDATE |
+   panfrost_get_job_chain_flag(job);
 
if (panfrost_has_hw_feature(pfdev, HW_FEATURE_FLUSH_REDUCTION))
cfg |= JS_CONFIG_ENABLE_FLUSH_REDUCTION;
@@ -182,10 +227,17 @@ static void panfrost_job_hw_submit(struct panfrost_job 
*job, int js)
job_write(pfdev, JS_FLUSH_ID_NEXT(js), job->flush_id);
 
/* GO ! */
-   dev_dbg(pfdev->dev, "JS: Submitting atom %p to js[%d] with head=0x%llx",
-   job, js, jc_head);
 
-   job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
+   spin_lock(&pfdev->js->job_lock);
+   subslot = panfrost_enqueue_job(pfdev, js, job);
+   /* Don't queue the job if a reset is in progress */
+   if (!atomic_read(&pfdev->reset.pending)) {
+   job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
+   dev_dbg(pfdev->dev,
+   "JS: Submitting atom %p to js[%d][%d] with head=0x%llx 
AS %d",
+   job, js, subslot, jc_head, cfg & 0xf);
+   }
+   spin_unlock(&pfdev->js->job_lock);
 }
 
 static void panfrost_acquire_object_fences(struct drm_gem_object **bos,
@@ -343,7 +395,11 @@ static struct dma_fence *panfrost_job_run(struct 
drm_sched_job *sched_job)
if (unlikely(job->base.s_fence->finished.error))
return NULL;
 
-   pfdev->jobs[slot] = job;
+   /* Nothing to execute: can happen if the job has finished while
+* we were resetting the GPU.
+*/
+   if (!job->jc

[PATCH v3 06/15] drm/panfrost: Do the exception -> string translation using a table

2021-06-25 Thread Boris Brezillon

Do the exception -> string translation using a table. This way we get
rid of those magic numbers and can easily add new fields if we need
to attach extra information to exception types.

v3:
* Drop the error field

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.c | 130 +
 1 file changed, 83 insertions(+), 47 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index bce6b0aff05e..736854542b05 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -292,55 +292,91 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
panfrost_clk_fini(pfdev);
 }
 
-const char *panfrost_exception_name(u32 exception_code)
-{
-   switch (exception_code) {
-   /* Non-Fault Status code */
-   case 0x00: return "NOT_STARTED/IDLE/OK";
-   case 0x01: return "DONE";
-   case 0x02: return "INTERRUPTED";
-   case 0x03: return "STOPPED";
-   case 0x04: return "TERMINATED";
-   case 0x08: return "ACTIVE";
-   /* Job exceptions */
-   case 0x40: return "JOB_CONFIG_FAULT";
-   case 0x41: return "JOB_POWER_FAULT";
-   case 0x42: return "JOB_READ_FAULT";
-   case 0x43: return "JOB_WRITE_FAULT";
-   case 0x44: return "JOB_AFFINITY_FAULT";
-   case 0x48: return "JOB_BUS_FAULT";
-   case 0x50: return "INSTR_INVALID_PC";
-   case 0x51: return "INSTR_INVALID_ENC";
-   case 0x52: return "INSTR_TYPE_MISMATCH";
-   case 0x53: return "INSTR_OPERAND_FAULT";
-   case 0x54: return "INSTR_TLS_FAULT";
-   case 0x55: return "INSTR_BARRIER_FAULT";
-   case 0x56: return "INSTR_ALIGN_FAULT";
-   case 0x58: return "DATA_INVALID_FAULT";
-   case 0x59: return "TILE_RANGE_FAULT";
-   case 0x5A: return "ADDR_RANGE_FAULT";
-   case 0x60: return "OUT_OF_MEMORY";
-   /* GPU exceptions */
-   case 0x80: return "DELAYED_BUS_FAULT";
-   case 0x88: return "SHAREABILITY_FAULT";
-   /* MMU exceptions */
-   case 0xC1: return "TRANSLATION_FAULT_LEVEL1";
-   case 0xC2: return "TRANSLATION_FAULT_LEVEL2";
-   case 0xC3: return "TRANSLATION_FAULT_LEVEL3";
-   case 0xC4: return "TRANSLATION_FAULT_LEVEL4";
-   case 0xC8: return "PERMISSION_FAULT";
-   case 0xC9 ... 0xCF: return "PERMISSION_FAULT";
-   case 0xD1: return "TRANSTAB_BUS_FAULT_LEVEL1";
-   case 0xD2: return "TRANSTAB_BUS_FAULT_LEVEL2";
-   case 0xD3: return "TRANSTAB_BUS_FAULT_LEVEL3";
-   case 0xD4: return "TRANSTAB_BUS_FAULT_LEVEL4";
-   case 0xD8: return "ACCESS_FLAG";
-   case 0xD9 ... 0xDF: return "ACCESS_FLAG";
-   case 0xE0 ... 0xE7: return "ADDRESS_SIZE_FAULT";
-   case 0xE8 ... 0xEF: return "MEMORY_ATTRIBUTES_FAULT";
+#define PANFROST_EXCEPTION(id) \
+   [DRM_PANFROST_EXCEPTION_ ## id] = { \
+   .name = #id, \
}
 
-   return "UNKNOWN";
+struct panfrost_exception_info {
+   const char *name;
+};
+
+static const struct panfrost_exception_info panfrost_exception_infos[] = {
+   PANFROST_EXCEPTION(OK),
+   PANFROST_EXCEPTION(DONE),
+   PANFROST_EXCEPTION(INTERRUPTED),
+   PANFROST_EXCEPTION(STOPPED),
+   PANFROST_EXCEPTION(TERMINATED),
+   PANFROST_EXCEPTION(KABOOM),
+   PANFROST_EXCEPTION(EUREKA),
+   PANFROST_EXCEPTION(ACTIVE),
+   PANFROST_EXCEPTION(JOB_CONFIG_FAULT),
+   PANFROST_EXCEPTION(JOB_POWER_FAULT),
+   PANFROST_EXCEPTION(JOB_READ_FAULT),
+   PANFROST_EXCEPTION(JOB_WRITE_FAULT),
+   PANFROST_EXCEPTION(JOB_AFFINITY_FAULT),
+   PANFROST_EXCEPTION(JOB_BUS_FAULT),
+   PANFROST_EXCEPTION(INSTR_INVALID_PC),
+   PANFROST_EXCEPTION(INSTR_INVALID_ENC),
+   PANFROST_EXCEPTION(INSTR_TYPE_MISMATCH),
+   PANFROST_EXCEPTION(INSTR_OPERAND_FAULT),
+   PANFROST_EXCEPTION(INSTR_TLS_FAULT),
+   PANFROST_EXCEPTION(INSTR_BARRIER_FAULT),
+   PANFROST_EXCEPTION(INSTR_ALIGN_FAULT),
+   PANFROST_EXCEPTION(DATA_INVALID_FAULT),
+   PANFROST_EXCEPTION(TILE_RANGE_FAULT),
+   PANFROST_EXCEPTION(ADDR_RANGE_FAULT),
+   PANFROST_EXCEPTION(IMPRECISE_FAULT),
+   PANFROST_EXCEPTION(OOM),
+   PANFROST_EXCEPTION(OOM_AFBC),
+   PANFROST_EXCEPTION(UNKNOWN),
+   PANFROST_EXCEPTION(DELAYED_BUS_FAULT),
+   PANFROST_EXCEPTION(GPU_SHAREABILITY_FAULT),
+   PANFROST_EXCEPTION(SYS_SHAREABILITY_FAULT),
+   PANFROST_EXCEPTION(GPU_CACHEABILITY_FAULT),
+   PANFROST_EXCEPTION(TRANSLATION_FAULT_0),
+   PANFRO

[PATCH v3 13/15] drm/panfrost: Don't reset the GPU on job faults unless we really have to

2021-06-25 Thread Boris Brezillon

If we can recover from a fault without a reset there's no reason to
issue one.

v3:
* Drop the mention of Valhall requiring a reset on JOB_BUS_FAULT
* Set the fence error to -EINVAL instead of having per-exception
  error codes

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.c |  9 +
 drivers/gpu/drm/panfrost/panfrost_device.h |  2 ++
 drivers/gpu/drm/panfrost/panfrost_job.c| 16 ++--
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index 736854542b05..f4e42009526d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -379,6 +379,15 @@ const char *panfrost_exception_name(u32 exception_code)
return panfrost_exception_infos[exception_code].name;
 }
 
+bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
+   u32 exception_code)
+{
+   /* Right now, none of the GPU we support need a reset, but this
+* might change.
+*/
+   return false;
+}
+
 void panfrost_device_reset(struct panfrost_device *pfdev)
 {
panfrost_gpu_soft_reset(pfdev);
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index efe9a675b614..ecbc79ad0006 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -175,6 +175,8 @@ int panfrost_device_resume(struct device *dev);
 int panfrost_device_suspend(struct device *dev);
 
 const char *panfrost_exception_name(u32 exception_code);
+bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
+   u32 exception_code);
 
 static inline void
 panfrost_device_schedule_reset(struct panfrost_device *pfdev)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 0566e2f7e84a..948bd174ff99 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -497,14 +497,26 @@ static void panfrost_job_handle_irq(struct 
panfrost_device *pfdev, u32 status)
job_write(pfdev, JOB_INT_CLEAR, mask);
 
if (status & JOB_INT_MASK_ERR(j)) {
+   u32 js_status = job_read(pfdev, JS_STATUS(j));
+
job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
j,
-   panfrost_exception_name(job_read(pfdev, 
JS_STATUS(j))),
+   panfrost_exception_name(js_status),
job_read(pfdev, JS_HEAD_LO(j)),
job_read(pfdev, JS_TAIL_LO(j)));
-   drm_sched_fault(&pfdev->js->queue[j].sched);
+
+   /* If we need a reset, signal it to the timeout
+* handler, otherwise, update the fence error field and
+* signal the job fence.
+*/
+   if (panfrost_exception_needs_reset(pfdev, js_status)) {
+   drm_sched_fault(&pfdev->js->queue[j].sched);
+   } else {
+   dma_fence_set_error(pfdev->jobs[j]->done_fence, 
-EINVAL);
+   status |= JOB_INT_MASK_DONE(j);
+   }
}
 
if (status & JOB_INT_MASK_DONE(j)) {
-- 
2.31.1

[PATCH v3 10/15] drm/panfrost: Make sure job interrupts are masked before resetting

2021-06-25 Thread Boris Brezillon

This is not yet needed because we let active jobs be killed during by
the reset and we don't really bother making sure they can be restarted.
But once we start adding soft-stop support, controlling when we deal
with the remaining interrrupts and making sure those are handled before
the reset is issued gets tricky if we keep job interrupts active.

Let's prepare for that and mask+flush job IRQs before issuing a reset.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 88d34fd781e8..0566e2f7e84a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -34,6 +34,7 @@ struct panfrost_queue_state {
 struct panfrost_job_slot {
struct panfrost_queue_state queue[NUM_JOB_SLOTS];
spinlock_t job_lock;
+   int irq;
 };
 
 static struct panfrost_job *
@@ -400,7 +401,15 @@ static void panfrost_reset(struct panfrost_device *pfdev,
if (bad)
drm_sched_increase_karma(bad);
 
-   spin_lock(&pfdev->js->job_lock);
+   /* Mask job interrupts and synchronize to make sure we won't be
+* interrupted during our reset.
+*/
+   job_write(pfdev, JOB_INT_MASK, 0);
+   synchronize_irq(pfdev->js->irq);
+
+   /* Schedulers are stopped and interrupts are masked+flushed, we don't
+* need to protect the 'evict unfinished jobs' lock with the job_lock.
+*/
for (i = 0; i < NUM_JOB_SLOTS; i++) {
if (pfdev->jobs[i]) {
pm_runtime_put_noidle(pfdev->dev);
@@ -408,7 +417,6 @@ static void panfrost_reset(struct panfrost_device *pfdev,
pfdev->jobs[i] = NULL;
}
}
-   spin_unlock(&pfdev->js->job_lock);
 
panfrost_device_reset(pfdev);
 
@@ -504,6 +512,7 @@ static void panfrost_job_handle_irq(struct panfrost_device 
*pfdev, u32 status)
 
job = pfdev->jobs[j];
/* Only NULL if job timeout occurred */
+   WARN_ON(!job);
if (job) {
pfdev->jobs[j] = NULL;
 
@@ -563,7 +572,7 @@ static void panfrost_reset_work(struct work_struct *work)
 int panfrost_job_init(struct panfrost_device *pfdev)
 {
struct panfrost_job_slot *js;
-   int ret, j, irq;
+   int ret, j;
 
INIT_WORK(&pfdev->reset.work, panfrost_reset_work);
 
@@ -573,11 +582,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 
spin_lock_init(&js->job_lock);
 
-   irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "job");
-   if (irq <= 0)
+   js->irq = platform_get_irq_byname(to_platform_device(pfdev->dev), 
"job");
+   if (js->irq <= 0)
return -ENODEV;
 
-   ret = devm_request_threaded_irq(pfdev->dev, irq,
+   ret = devm_request_threaded_irq(pfdev->dev, js->irq,
panfrost_job_irq_handler,
panfrost_job_irq_handler_thread,
IRQF_SHARED, KBUILD_MODNAME "-job",
-- 
2.31.1

[PATCH v3 08/15] drm/panfrost: Use a threaded IRQ for job interrupts

2021-06-25 Thread Boris Brezillon

This should avoid switching to interrupt context when the GPU is under
heavy use.

v3:
* Don't take the job_lock in panfrost_job_handle_irq()

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 53 ++---
 1 file changed, 38 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index be8f68f63974..e0c479e67304 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -470,19 +470,12 @@ static const struct drm_sched_backend_ops 
panfrost_sched_ops = {
.free_job = panfrost_job_free
 };
 
-static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
+static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
 {
-   struct panfrost_device *pfdev = data;
-   u32 status = job_read(pfdev, JOB_INT_STAT);
int j;
 
dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status);
 
-   if (!status)
-   return IRQ_NONE;
-
-   pm_runtime_mark_last_busy(pfdev->dev);
-
for (j = 0; status; j++) {
u32 mask = MK_JS_MASK(j);
 
@@ -519,7 +512,6 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
if (status & JOB_INT_MASK_DONE(j)) {
struct panfrost_job *job;
 
-   spin_lock(&pfdev->js->job_lock);
job = pfdev->jobs[j];
/* Only NULL if job timeout occurred */
if (job) {
@@ -531,21 +523,49 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
dma_fence_signal_locked(job->done_fence);
pm_runtime_put_autosuspend(pfdev->dev);
}
-   spin_unlock(&pfdev->js->job_lock);
}
 
status &= ~mask;
}
+}
 
+static irqreturn_t panfrost_job_irq_handler_thread(int irq, void *data)
+{
+   struct panfrost_device *pfdev = data;
+   u32 status = job_read(pfdev, JOB_INT_RAWSTAT);
+
+   while (status) {
+   pm_runtime_mark_last_busy(pfdev->dev);
+
+   spin_lock(&pfdev->js->job_lock);
+   panfrost_job_handle_irq(pfdev, status);
+   spin_unlock(&pfdev->js->job_lock);
+   status = job_read(pfdev, JOB_INT_RAWSTAT);
+   }
+
+   job_write(pfdev, JOB_INT_MASK,
+ GENMASK(16 + NUM_JOB_SLOTS - 1, 16) |
+ GENMASK(NUM_JOB_SLOTS - 1, 0));
return IRQ_HANDLED;
 }
 
+static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
+{
+   struct panfrost_device *pfdev = data;
+   u32 status = job_read(pfdev, JOB_INT_STAT);
+
+   if (!status)
+   return IRQ_NONE;
+
+   job_write(pfdev, JOB_INT_MASK, 0);
+   return IRQ_WAKE_THREAD;
+}
+
 static void panfrost_reset(struct work_struct *work)
 {
struct panfrost_device *pfdev = container_of(work,
 struct panfrost_device,
 reset.work);
-   unsigned long flags;
unsigned int i;
bool cookie;
 
@@ -575,7 +595,7 @@ static void panfrost_reset(struct work_struct *work)
/* All timers have been stopped, we can safely reset the pending state. 
*/
atomic_set(&pfdev->reset.pending, 0);
 
-   spin_lock_irqsave(&pfdev->js->job_lock, flags);
+   spin_lock(&pfdev->js->job_lock);
for (i = 0; i < NUM_JOB_SLOTS; i++) {
if (pfdev->jobs[i]) {
pm_runtime_put_noidle(pfdev->dev);
@@ -583,7 +603,7 @@ static void panfrost_reset(struct work_struct *work)
pfdev->jobs[i] = NULL;
}
}
-   spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
+   spin_unlock(&pfdev->js->job_lock);
 
panfrost_device_reset(pfdev);
 
@@ -610,8 +630,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
if (irq <= 0)
return -ENODEV;
 
-   ret = devm_request_irq(pfdev->dev, irq, panfrost_job_irq_handler,
-  IRQF_SHARED, KBUILD_MODNAME "-job", pfdev);
+   ret = devm_request_threaded_irq(pfdev->dev, irq,
+   panfrost_job_irq_handler,
+   panfrost_job_irq_handler_thread,
+   IRQF_SHARED, KBUILD_MODNAME "-job",
+   pfdev);
if (ret) {
dev_err(pfdev->dev, "failed to request job irq");
return ret;
-- 
2.31.1

[PATCH v3 12/15] drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck

2021-06-25 Thread Boris Brezillon

Things are unlikely to resolve until we reset the GPU. Let's not wait
for other faults/timeout to happen to trigger this reset.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 65e98c51cb66..5267c3a1f02f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -36,8 +36,11 @@ static int wait_ready(struct panfrost_device *pfdev, u32 
as_nr)
ret = readl_relaxed_poll_timeout_atomic(pfdev->iomem + AS_STATUS(as_nr),
val, !(val & AS_STATUS_AS_ACTIVE), 10, 1000);
 
-   if (ret)
+   if (ret) {
+   /* The GPU hung, let's trigger a reset */
+   panfrost_device_schedule_reset(pfdev);
dev_err(pfdev->dev, "AS_ACTIVE bit stuck\n");
+   }
 
return ret;
 }
-- 
2.31.1

[PATCH v3 04/15] drm/panfrost: Drop the pfdev argument passed to panfrost_exception_name()

2021-06-25 Thread Boris Brezillon

Currently unused. We'll add it back if we need per-GPU definitions.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_device.c | 2 +-
 drivers/gpu/drm/panfrost/panfrost_device.h | 2 +-
 drivers/gpu/drm/panfrost/panfrost_gpu.c| 2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 2 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c| 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index fbcf5edbe367..bce6b0aff05e 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -292,7 +292,7 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
panfrost_clk_fini(pfdev);
 }
 
-const char *panfrost_exception_name(struct panfrost_device *pfdev, u32 
exception_code)
+const char *panfrost_exception_name(u32 exception_code)
 {
switch (exception_code) {
/* Non-Fault Status code */
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 4c6bdea5537b..ade8a1974ee9 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -172,6 +172,6 @@ void panfrost_device_reset(struct panfrost_device *pfdev);
 int panfrost_device_resume(struct device *dev);
 int panfrost_device_suspend(struct device *dev);
 
-const char *panfrost_exception_name(struct panfrost_device *pfdev, u32 
exception_code);
+const char *panfrost_exception_name(u32 exception_code);
 
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c 
b/drivers/gpu/drm/panfrost/panfrost_gpu.c
index 2aae636f1cf5..ec59f15940fb 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
@@ -33,7 +33,7 @@ static irqreturn_t panfrost_gpu_irq_handler(int irq, void 
*data)
address |= gpu_read(pfdev, GPU_FAULT_ADDRESS_LO);
 
dev_warn(pfdev->dev, "GPU Fault 0x%08x (%s) at 0x%016llx\n",
-fault_status & 0xFF, panfrost_exception_name(pfdev, 
fault_status),
+fault_status & 0xFF, 
panfrost_exception_name(fault_status),
 address);
 
if (state & GPU_IRQ_MULTIPLE_FAULT)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index d6c9698bca3b..3cd1aec6c261 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -500,7 +500,7 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
 
dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
j,
-   panfrost_exception_name(pfdev, job_read(pfdev, 
JS_STATUS(j))),
+   panfrost_exception_name(job_read(pfdev, 
JS_STATUS(j))),
job_read(pfdev, JS_HEAD_LO(j)),
job_read(pfdev, JS_TAIL_LO(j)));
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index d76dff201ea6..b4f0c673cd7f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -676,7 +676,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, 
void *data)
"TODO",
fault_status,
(fault_status & (1 << 10) ? "DECODER FAULT" : 
"SLAVE FAULT"),
-   exception_type, panfrost_exception_name(pfdev, 
exception_type),
+   exception_type, 
panfrost_exception_name(exception_type),
access_type, access_type_name(pfdev, 
fault_status),
source_id);
 
-- 
2.31.1

[PATCH v3 03/15] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition

2021-06-25 Thread Boris Brezillon

Exception types will be defined as an enum in panfrost_drm.h so userspace
and use the same definitions if needed.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_regs.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h 
b/drivers/gpu/drm/panfrost/panfrost_regs.h
index eddaa62ad8b0..151cfebd80a0 100644
--- a/drivers/gpu/drm/panfrost/panfrost_regs.h
+++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
@@ -261,9 +261,6 @@
 #define JS_COMMAND_SOFT_STOP_1 0x06/* Execute SOFT_STOP if 
JOB_CHAIN_FLAG is 1 */
 #define JS_COMMAND_HARD_STOP_1 0x07/* Execute HARD_STOP if 
JOB_CHAIN_FLAG is 1 */
 
-#define JS_STATUS_EVENT_ACTIVE 0x08
-
-
 /* MMU regs */
 #define MMU_INT_RAWSTAT0x2000
 #define MMU_INT_CLEAR  0x2004
-- 
2.31.1

[PATCH v3 07/15] drm/panfrost: Expose a helper to trigger a GPU reset

2021-06-25 Thread Boris Brezillon

Expose a helper to trigger a GPU reset so we can easily trigger reset
operations outside the job timeout handler.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_device.h | 8 
 drivers/gpu/drm/panfrost/panfrost_job.c| 4 +---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index ade8a1974ee9..6024eaf34ba0 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -174,4 +174,12 @@ int panfrost_device_suspend(struct device *dev);
 
 const char *panfrost_exception_name(u32 exception_code);
 
+static inline void
+panfrost_device_schedule_reset(struct panfrost_device *pfdev)
+{
+   /* Schedule a reset if there's no reset in progress. */
+   if (!atomic_xchg(&pfdev->reset.pending, 1))
+   schedule_work(&pfdev->reset.work);
+}
+
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 3cd1aec6c261..be8f68f63974 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -458,9 +458,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct 
drm_sched_job
if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job))
return DRM_GPU_SCHED_STAT_NOMINAL;
 
-   /* Schedule a reset if there's no reset in progress. */
-   if (!atomic_xchg(&pfdev->reset.pending, 1))
-   schedule_work(&pfdev->reset.work);
+   panfrost_device_schedule_reset(pfdev);
 
return DRM_GPU_SCHED_STAT_NOMINAL;
 }
-- 
2.31.1

[PATCH v3 14/15] drm/panfrost: Kill in-flight jobs on FD close

2021-06-25 Thread Boris Brezillon

If the process who submitted these jobs decided to close the FD before
the jobs are done it probably means it doesn't care about the result.

v3:
* Set fence error to ECANCELED when a TERMINATED exception is received

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 43 +
 1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 948bd174ff99..aa1e6542adde 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -498,14 +498,21 @@ static void panfrost_job_handle_irq(struct 
panfrost_device *pfdev, u32 status)
 
if (status & JOB_INT_MASK_ERR(j)) {
u32 js_status = job_read(pfdev, JS_STATUS(j));
+   const char *exception_name = 
panfrost_exception_name(js_status);
 
job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
-   dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
-   j,
-   panfrost_exception_name(js_status),
-   job_read(pfdev, JS_HEAD_LO(j)),
-   job_read(pfdev, JS_TAIL_LO(j)));
+   if (js_status < 
DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT) {
+   dev_dbg(pfdev->dev, "js interrupt, js=%d, 
status=%s, head=0x%x, tail=0x%x",
+   j, exception_name,
+   job_read(pfdev, JS_HEAD_LO(j)),
+   job_read(pfdev, JS_TAIL_LO(j)));
+   } else {
+   dev_err(pfdev->dev, "js fault, js=%d, 
status=%s, head=0x%x, tail=0x%x",
+   j, exception_name,
+   job_read(pfdev, JS_HEAD_LO(j)),
+   job_read(pfdev, JS_TAIL_LO(j)));
+   }
 
/* If we need a reset, signal it to the timeout
 * handler, otherwise, update the fence error field and
@@ -514,7 +521,16 @@ static void panfrost_job_handle_irq(struct panfrost_device 
*pfdev, u32 status)
if (panfrost_exception_needs_reset(pfdev, js_status)) {
drm_sched_fault(&pfdev->js->queue[j].sched);
} else {
-   dma_fence_set_error(pfdev->jobs[j]->done_fence, 
-EINVAL);
+   int error = 0;
+
+   if (js_status == 
DRM_PANFROST_EXCEPTION_TERMINATED)
+   error = -ECANCELED;
+   else if (js_status >= 
DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT)
+   error = -EINVAL;
+
+   if (error)
+   
dma_fence_set_error(pfdev->jobs[j]->done_fence, error);
+
status |= JOB_INT_MASK_DONE(j);
}
}
@@ -673,10 +689,25 @@ int panfrost_job_open(struct panfrost_file_priv 
*panfrost_priv)
 
 void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
 {
+   struct panfrost_device *pfdev = panfrost_priv->pfdev;
+   unsigned long flags;
int i;
 
for (i = 0; i < NUM_JOB_SLOTS; i++)
drm_sched_entity_destroy(&panfrost_priv->sched_entity[i]);
+
+   /* Kill in-flight jobs */
+   spin_lock_irqsave(&pfdev->js->job_lock, flags);
+   for (i = 0; i < NUM_JOB_SLOTS; i++) {
+   struct drm_sched_entity *entity = 
&panfrost_priv->sched_entity[i];
+   struct panfrost_job *job = pfdev->jobs[i];
+
+   if (!job || job->base.entity != entity)
+   continue;
+
+   job_write(pfdev, JS_COMMAND(i), JS_COMMAND_HARD_STOP);
+   }
+   spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
 }
 
 int panfrost_job_is_idle(struct panfrost_device *pfdev)
-- 
2.31.1

[PATCH v3 09/15] drm/panfrost: Simplify the reset serialization logic

2021-06-25 Thread Boris Brezillon

Now that we can pass our own workqueue to drm_sched_init(), we can use
an ordered workqueue on for both the scheduler timeout tdr and our own
reset work (which we use when the reset is not caused by a fault/timeout
on a specific job, like when we have AS_ACTIVE bit stuck). This
guarantees that the timeout handlers and reset handler can't run
concurrently which drastically simplifies the locking.

Suggested-by: Daniel Vetter 
Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |   6 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 185 -
 2 files changed, 71 insertions(+), 120 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 6024eaf34ba0..bfe32907ba6b 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -108,6 +108,7 @@ struct panfrost_device {
struct mutex sched_lock;
 
struct {
+   struct workqueue_struct *wq;
struct work_struct work;
atomic_t pending;
} reset;
@@ -177,9 +178,8 @@ const char *panfrost_exception_name(u32 exception_code);
 static inline void
 panfrost_device_schedule_reset(struct panfrost_device *pfdev)
 {
-   /* Schedule a reset if there's no reset in progress. */
-   if (!atomic_xchg(&pfdev->reset.pending, 1))
-   schedule_work(&pfdev->reset.work);
+   atomic_set(&pfdev->reset.pending, 1);
+   queue_work(pfdev->reset.wq, &pfdev->reset.work);
 }
 
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index e0c479e67304..88d34fd781e8 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -25,17 +25,8 @@
 #define job_write(dev, reg, data) writel(data, dev->iomem + (reg))
 #define job_read(dev, reg) readl(dev->iomem + (reg))
 
-enum panfrost_queue_status {
-   PANFROST_QUEUE_STATUS_ACTIVE,
-   PANFROST_QUEUE_STATUS_STOPPED,
-   PANFROST_QUEUE_STATUS_STARTING,
-   PANFROST_QUEUE_STATUS_FAULT_PENDING,
-};
-
 struct panfrost_queue_state {
struct drm_gpu_scheduler sched;
-   atomic_t status;
-   struct mutex lock;
u64 fence_context;
u64 emit_seqno;
 };
@@ -379,57 +370,73 @@ void panfrost_job_enable_interrupts(struct 
panfrost_device *pfdev)
job_write(pfdev, JOB_INT_MASK, irq_mask);
 }
 
-static bool panfrost_scheduler_stop(struct panfrost_queue_state *queue,
-   struct drm_sched_job *bad)
+static void panfrost_reset(struct panfrost_device *pfdev,
+  struct drm_sched_job *bad)
 {
-   enum panfrost_queue_status old_status;
-   bool stopped = false;
+   unsigned int i;
+   bool cookie;
 
-   mutex_lock(&queue->lock);
-   old_status = atomic_xchg(&queue->status,
-PANFROST_QUEUE_STATUS_STOPPED);
-   if (old_status == PANFROST_QUEUE_STATUS_STOPPED)
-   goto out;
+   if (WARN_ON(!atomic_read(&pfdev->reset.pending)))
+   return;
+
+   /* Stop the schedulers.
+*
+* FIXME: We temporarily get out of the dma_fence_signalling section
+* because the cleanup path generate lockdep splats when taking locks
+* to release job resources. We should rework the code to follow this
+* pattern:
+*
+*  try_lock
+*  if (locked)
+*  release
+*  else
+*  schedule_work_to_release_later
+*/
+   for (i = 0; i < NUM_JOB_SLOTS; i++)
+   drm_sched_stop(&pfdev->js->queue[i].sched, bad);
+
+   cookie = dma_fence_begin_signalling();
 
-   WARN_ON(old_status != PANFROST_QUEUE_STATUS_ACTIVE);
-   drm_sched_stop(&queue->sched, bad);
if (bad)
drm_sched_increase_karma(bad);
 
-   stopped = true;
+   spin_lock(&pfdev->js->job_lock);
+   for (i = 0; i < NUM_JOB_SLOTS; i++) {
+   if (pfdev->jobs[i]) {
+   pm_runtime_put_noidle(pfdev->dev);
+   panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
+   pfdev->jobs[i] = NULL;
+   }
+   }
+   spin_unlock(&pfdev->js->job_lock);
 
-   /*
-* Set the timeout to max so the timer doesn't get started
-* when we return from the timeout handler (restored in
-* panfrost_scheduler_start()).
+   panfrost_device_reset(pfdev);
+
+   /* GPU has been reset, we can cancel timeout/fault work that may have
+* been queued in the meantime and clear the reset pending bit.
 */
-   queue->sched.timeout = MAX_SCHEDULE_TIMEOUT;
+   atomic_set(&pfdev->reset.pending, 0);
+

[PATCH v3 01/15] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr

2021-06-25 Thread Boris Brezillon

Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
reset. This leads to extra complexity when we need to synchronize timeout
works with the reset work. One solution to address that is to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. Thanks to the serialization
provided by the ordered workqueue we are guaranteed that timeout
handlers are executed sequentially, and can thus easily reset the GPU
from the timeout handler without extra synchronization.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c   |  3 ++-
 drivers/gpu/drm/lima/lima_sched.c |  3 ++-
 drivers/gpu/drm/panfrost/panfrost_job.c   |  3 ++-
 drivers/gpu/drm/scheduler/sched_main.c|  6 +-
 drivers/gpu/drm/v3d/v3d_sched.c   | 10 +-
 include/drm/gpu_scheduler.h   |  5 -
 7 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 47ea46859618..532636ea20bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -488,7 +488,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 
r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
   num_hw_submission, amdgpu_job_hang_limit,
-  timeout, sched_score, ring->name);
+  timeout, NULL, sched_score, ring->name);
if (r) {
DRM_ERROR("Failed to create scheduler on ring %s.\n",
  ring->name);
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 19826e504efc..feb6da1b6ceb 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -190,7 +190,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
 
ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
 etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
-msecs_to_jiffies(500), NULL, dev_name(gpu->dev));
+msecs_to_jiffies(500), NULL, NULL,
+dev_name(gpu->dev));
if (ret)
return ret;
 
diff --git a/drivers/gpu/drm/lima/lima_sched.c 
b/drivers/gpu/drm/lima/lima_sched.c
index ecf3267334ff..dba8329937a3 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -508,7 +508,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, 
const char *name)
INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
 
return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
- lima_job_hang_limit, msecs_to_jiffies(timeout),
+ lima_job_hang_limit,
+ msecs_to_jiffies(timeout), NULL,
  NULL, name);
 }
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 682f2161b999..8ff79fd49577 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -626,7 +626,8 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 
ret = drm_sched_init(&js->queue[j].sched,
 &panfrost_sched_ops,
-1, 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
+1, 0,
+msecs_to_jiffies(JOB_TIMEOUT_MS), NULL,
 NULL, "pan_js");
if (ret) {
dev_err(pfdev->dev, "Failed to create scheduler: %d.", 
ret);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index c0a2f8f8d472..a937d0529944 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -837,6 +837,8 @@ static int drm_sched_main(void *param)
  * @hw_submission: number of hw submissions that can be in flight
  * @hang_limit: number of times to allow a job to hang before dropping it
  * @timeout: timeout value in jiffies for the scheduler
+ * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
+ * used
  * @score: optional score atomic shared with other schedulers
  * @name: name used for debugging
  *
@@ -844,7 +846,8 @@ static int drm_sched_main(void *param)
  */
 int drm_sched_init(struct drm_gpu_scheduler *sched,
   const struct drm_sched_backend_ops *ops,
-  unsigned hw_submission, unsigned hang_limit, long timeout,
+  unsigned hw_submission, unsigned hang_limit,
+  long timeout, struct workqu

[PATCH v3 00/15] drm/panfrost: Misc improvements

2021-06-25 Thread Boris Brezillon

Hello,

This is a merge of [1] and [2] since the second series depends on
patches in the preparatory series.

The main change in this v3 is the addition of patch 1 and 9 simplifying
the reset synchronisation as suggested by Daniel.

Also addressed Steve's comments, and IGT tests are now passing reliably
(which doesn't guarantee much, but that's still an improvement since
pan-reset was unreliable with v2).

Regards,

Boris

Boris Brezillon (14):
  drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr
  drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate
  drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition
  drm/panfrost: Drop the pfdev argument passed to
panfrost_exception_name()
  drm/panfrost: Expose exception types to userspace
  drm/panfrost: Do the exception -> string translation using a table
  drm/panfrost: Expose a helper to trigger a GPU reset
  drm/panfrost: Use a threaded IRQ for job interrupts
  drm/panfrost: Simplify the reset serialization logic
  drm/panfrost: Make sure job interrupts are masked before resetting
  drm/panfrost: Disable the AS on unhandled page faults
  drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck
  drm/panfrost: Don't reset the GPU on job faults unless we really have
to
  drm/panfrost: Kill in-flight jobs on FD close

Steven Price (1):
  drm/panfrost: Queue jobs on the hardware

 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c|   3 +-
 drivers/gpu/drm/lima/lima_sched.c  |   3 +-
 drivers/gpu/drm/panfrost/panfrost_device.c | 139 +++--
 drivers/gpu/drm/panfrost/panfrost_device.h |  15 +-
 drivers/gpu/drm/panfrost/panfrost_gpu.c|   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 630 +++--
 drivers/gpu/drm/panfrost/panfrost_mmu.c|  41 +-
 drivers/gpu/drm/panfrost/panfrost_regs.h   |   3 -
 drivers/gpu/drm/scheduler/sched_main.c |   6 +-
 drivers/gpu/drm/v3d/v3d_sched.c|  10 +-
 include/drm/gpu_scheduler.h|   5 +-
 include/uapi/drm/panfrost_drm.h|  71 +++
 13 files changed, 679 insertions(+), 251 deletions(-)

-- 
2.31.1

[PATCH] drm/sched: Declare entity idle only after HW submission

2021-06-24 Thread Boris Brezillon

The panfrost driver tries to kill in-flight jobs on FD close after
destroying the FD scheduler entities. For this to work properly, we
need to make sure the jobs popped from the scheduler entities have
been queued at the HW level before declaring the entity idle, otherwise
we might iterate over a list that doesn't contain those jobs.

Suggested-by: Lucas Stach 
Signed-off-by: Boris Brezillon 
Cc: Lucas Stach 
---
 drivers/gpu/drm/scheduler/sched_main.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 81496ae2602e..aa776ebe326a 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -811,10 +811,10 @@ static int drm_sched_main(void *param)
 
sched_job = drm_sched_entity_pop_job(entity);
 
-   complete(&entity->entity_idle);
-
-   if (!sched_job)
+   if (!sched_job) {
+   complete(&entity->entity_idle);
continue;
+   }
 
s_fence = sched_job->s_fence;
 
@@ -823,6 +823,7 @@ static int drm_sched_main(void *param)
 
trace_drm_run_job(sched_job, entity);
fence = sched->ops->run_job(sched_job);
+   complete(&entity->entity_idle);
drm_sched_fence_scheduled(s_fence);
 
if (!IS_ERR_OR_NULL(fence)) {
-- 
2.31.1

Re: [PATCH v2 2/2] drm/panfrost: Queue jobs on the hardware

2021-06-24 Thread Boris Brezillon

On Thu, 24 Jun 2021 10:23:51 +0100
Steven Price  wrote:

> >  static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 
> > status)
> >  {
> > -   int j;
> > +   struct panfrost_job *done[NUM_JOB_SLOTS][2] = {};
> > +   struct panfrost_job *failed[NUM_JOB_SLOTS] = {};
> > +   u32 js_state, js_events = 0;
> > +   unsigned int i, j;
> >  
> > -   dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status);
> > +   while (status) {
> > +   for (j = 0; j < NUM_JOB_SLOTS; j++) {
> > +   if (status & JOB_INT_MASK_DONE(j)) {
> > +   if (done[j][0]) {
> > +   done[j][1] = 
> > panfrost_dequeue_job(pfdev, j);
> > +   WARN_ON(!done[j][1]);
> > +   } else {
> > +   done[j][0] = 
> > panfrost_dequeue_job(pfdev, j);
> > +   WARN_ON(!done[j][0]);  
> 
> NIT: I'd be tempted to move this WARN_ON into panfrost_dequeue_job() as
> it's relevant for any call to the function.

Makes sense. I'll move those WARN_ON()s.

> 
> > +   }
> > +   }
> >  
> > -   for (j = 0; status; j++) {
> > -   u32 mask = MK_JS_MASK(j);
> > +   if (status & JOB_INT_MASK_ERR(j)) {
> > +   /* Cancel the next submission. Will be submitted
> > +* after we're done handling this failure if
> > +* there's no reset pending.
> > +*/
> > +   job_write(pfdev, JS_COMMAND_NEXT(j), 
> > JS_COMMAND_NOP);
> > +   failed[j] = panfrost_dequeue_job(pfdev, j);
> > +   }
> > +   }
> >  
> > -   if (!(status & mask))
> > +   /* JS_STATE is sampled when JOB_INT_CLEAR is written.
> > +* For each BIT(slot) or BIT(slot + 16) bit written to
> > +* JOB_INT_CLEAR, the corresponding bits in JS_STATE
> > +* (BIT(slot) and BIT(slot + 16)) are updated, but this
> > +* is racy. If we only have one job done at the time we
> > +* read JOB_INT_RAWSTAT but the second job fails before we
> > +* clear the status, we end up with a status containing
> > +* only the DONE bit and consider both jobs as DONE since
> > +* JS_STATE reports both NEXT and CURRENT as inactive.
> > +* To prevent that, let's repeat this clear+read steps
> > +* until status is 0.
> > +*/
> > +   job_write(pfdev, JOB_INT_CLEAR, status);
> > +   js_state = job_read(pfdev, JOB_INT_JS_STATE);  
> 
> This seems a bit dodgy. The spec says that JOB_INT_JS_STATE[1] is
> updated only for the job slots which have bits set in the JOB_INT_CLEAR.
> So there's potentially two problems:
> 
>  * The spec makes no gaurentee about the values of the bits for other
> slots. But we're not masking off those bits.
> 
>  * If we loop (e.g. because the other slot finishes while handling the
> first interrupt) then we may lose the state for the first slot.
> 
> I'm not sure what the actual hardware returns in the bits which are
> unrelated to the previous JOB_INT_CLEAR - kbase is careful only to
> consider the bits relating to the slot it's currently dealing with.

Hm, I see. How about something like that?

struct panfrost_job *done[NUM_JOB_SLOTS][2] = {};
struct panfrost_job *failed[NUM_JOB_SLOTS] = {};
u32 js_state = 0, js_events = 0;
unsigned int i, j;

while (status) {
u32 js_state_mask = 0;

for (j = 0; j < NUM_JOB_SLOTS; j++) {
if (status & MK_JS_MASK(j))
js_state_mask |= MK_JS_MASK(j);

if (status & JOB_INT_MASK_DONE(j)) {
if (done[j][0]) {
done[j][1] = 
panfrost_dequeue_job(pfdev, j);
WARN_ON(!done[j][1]);
} else {
done[j][0] = 
panfrost_dequeue_job(pfdev, j);
WARN_ON(!done[j][0]);
}
}

if (status & JOB_INT_MASK_ERR(j)) {
/* Cancel the next submission. Will be submitted
 * after we're done handling this failure if
 * there's no reset pending.
 */
job_write(pfdev, JS_COMMAND_NEXT(j), 
JS_COMMAND_NOP);
failed[j] = panfrost_dequeue_job(pfdev, j);
}
}

/* JS_STATE is sampled when JOB_INT_CLEAR is written.
 * For each BIT(slot) or

Re: [PATCH v2 01/12] drm/panfrost: Make sure MMU context lifetime is not bound to panfrost_priv

2021-06-24 Thread Boris Brezillon

On Mon, 21 Jun 2021 15:38:56 +0200
Boris Brezillon  wrote:

> Jobs can be in-flight when the file descriptor is closed (either because
> the process did not terminate properly, or because it didn't wait for
> all GPU jobs to be finished), and apparently panfrost_job_close() does
> not cancel already running jobs. Let's refcount the MMU context object
> so it's lifetime is no longer bound to the FD lifetime and running jobs
> can finish properly without generating spurious page faults.
> 
> Reported-by: Icecream95 
> Fixes: 7282f7645d06 ("drm/panfrost: Implement per FD address spaces")
> Cc: 
> Signed-off-by: Boris Brezillon 

Queued this patch to drm-misc-next. I'll respin the rest of this series.

Re: [PATCH 04/15] drm/panfrost: Shrink sched_lock

2021-06-23 Thread Boris Brezillon

On Tue, 22 Jun 2021 18:55:00 +0200
Daniel Vetter  wrote:

> drm/scheduler requires a lock between _init and _push_job, but the
> reservation lock dance doesn't. So shrink the critical section a
> notch.
> 
> v2: Lucas pointed out how this should really work, I got it all wrong
> in v1.
> 
> Signed-off-by: Daniel Vetter 
> Cc: Lucas Stach 
> Cc: Rob Herring 
> Cc: Tomeu Vizoso 
> Cc: Steven Price 
> Cc: Alyssa Rosenzweig 

Reviewed-by: Boris Brezillon 

> ---
>  drivers/gpu/drm/panfrost/panfrost_job.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
> b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 2df3e999a38d..38f8580c19f1 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -224,14 +224,13 @@ int panfrost_job_push(struct panfrost_job *job)
>   struct ww_acquire_ctx acquire_ctx;
>   int ret = 0;
>  
> - mutex_lock(&pfdev->sched_lock);
>  
>   ret = drm_gem_lock_reservations(job->bos, job->bo_count,
>   &acquire_ctx);
> - if (ret) {
> - mutex_unlock(&pfdev->sched_lock);
> + if (ret)
>   return ret;
> - }
> +
> + mutex_lock(&pfdev->sched_lock);
>  
>   ret = drm_sched_job_init(&job->base, entity, NULL);
>   if (ret) {

Re: [PATCH 05/15] drm/panfrost: Use xarray and helpers for depedency tracking

2021-06-23 Thread Boris Brezillon

On Tue, 22 Jun 2021 18:55:01 +0200
Daniel Vetter  wrote:

> More consistency and prep work for the next patch.
> 
> Aside: I wonder whether we shouldn't just move this entire xarray
> business into the scheduler so that not everyone has to reinvent the
> same wheels. Cc'ing some scheduler people for this too.
> 
> v2: Correctly handle sched_lock since Lucas pointed out it's needed.
> 
> v3: Rebase, dma_resv_get_excl_unlocked got renamed
> 
> v4: Don't leak job references on failure (Steven).

Hehe, I had pretty much the same patch here [1].

Reviewed-by: Boris Brezillon 

[1]https://patchwork.kernel.org/project/dri-devel/patch/20210311092539.2405596-3-boris.brezil...@collabora.com/

> 
> Cc: Lucas Stach 
> Cc: "Christian König" 
> Cc: Luben Tuikov 
> Cc: Alex Deucher 
> Cc: Lee Jones 
> Cc: Steven Price 
> Cc: Rob Herring 
> Cc: Tomeu Vizoso 
> Cc: Alyssa Rosenzweig 
> Cc: Sumit Semwal 
> Cc: linux-me...@vger.kernel.org
> Cc: linaro-mm-...@lists.linaro.org
> Signed-off-by: Daniel Vetter 
> ---
>  drivers/gpu/drm/panfrost/panfrost_drv.c | 41 +++-
>  drivers/gpu/drm/panfrost/panfrost_job.c | 65 +++--
>  drivers/gpu/drm/panfrost/panfrost_job.h |  8 ++-
>  3 files changed, 49 insertions(+), 65 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
> b/drivers/gpu/drm/panfrost/panfrost_drv.c
> index 075ec0ef746c..3ee828f1e7a5 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> @@ -138,12 +138,6 @@ panfrost_lookup_bos(struct drm_device *dev,
>   if (!job->bo_count)
>   return 0;
>  
> - job->implicit_fences = kvmalloc_array(job->bo_count,
> -   sizeof(struct dma_fence *),
> -   GFP_KERNEL | __GFP_ZERO);
> - if (!job->implicit_fences)
> - return -ENOMEM;
> -
>   ret = drm_gem_objects_lookup(file_priv,
>(void __user *)(uintptr_t)args->bo_handles,
>job->bo_count, &job->bos);
> @@ -174,7 +168,7 @@ panfrost_lookup_bos(struct drm_device *dev,
>  }
>  
>  /**
> - * panfrost_copy_in_sync() - Sets up job->in_fences[] with the sync objects
> + * panfrost_copy_in_sync() - Sets up job->deps with the sync objects
>   * referenced by the job.
>   * @dev: DRM device
>   * @file_priv: DRM file for this fd
> @@ -194,22 +188,14 @@ panfrost_copy_in_sync(struct drm_device *dev,
>  {
>   u32 *handles;
>   int ret = 0;
> - int i;
> + int i, in_fence_count;
>  
> - job->in_fence_count = args->in_sync_count;
> + in_fence_count = args->in_sync_count;
>  
> - if (!job->in_fence_count)
> + if (!in_fence_count)
>   return 0;
>  
> - job->in_fences = kvmalloc_array(job->in_fence_count,
> - sizeof(struct dma_fence *),
> - GFP_KERNEL | __GFP_ZERO);
> - if (!job->in_fences) {
> - DRM_DEBUG("Failed to allocate job in fences\n");
> - return -ENOMEM;
> - }
> -
> - handles = kvmalloc_array(job->in_fence_count, sizeof(u32), GFP_KERNEL);
> + handles = kvmalloc_array(in_fence_count, sizeof(u32), GFP_KERNEL);
>   if (!handles) {
>   ret = -ENOMEM;
>   DRM_DEBUG("Failed to allocate incoming syncobj handles\n");
> @@ -218,16 +204,23 @@ panfrost_copy_in_sync(struct drm_device *dev,
>  
>   if (copy_from_user(handles,
>  (void __user *)(uintptr_t)args->in_syncs,
> -job->in_fence_count * sizeof(u32))) {
> +in_fence_count * sizeof(u32))) {
>   ret = -EFAULT;
>   DRM_DEBUG("Failed to copy in syncobj handles\n");
>   goto fail;
>   }
>  
> - for (i = 0; i < job->in_fence_count; i++) {
> + for (i = 0; i < in_fence_count; i++) {
> + struct dma_fence *fence;
> +
>   ret = drm_syncobj_find_fence(file_priv, handles[i], 0, 0,
> -  &job->in_fences[i]);
> - if (ret == -EINVAL)
> +  &fence);
> + if (ret)
> + goto fail;
> +
> + ret = drm_gem_fence_array_add(&job->deps, fence);
> +
> + if (ret)
>   goto fail;
>   }
>  
> @@ -265,6 +258,8 @@ static int panfrost_ioctl_submit(struct drm_device *dev, 
> void *data,
>

Re: [PATCH 06/15] drm/panfrost: Fix implicit sync

2021-06-23 Thread Boris Brezillon

On Tue, 22 Jun 2021 18:55:02 +0200
Daniel Vetter  wrote:

> Currently this has no practial relevance I think because there's not
> many who can pull off a setup with panfrost and another gpu in the
> same system. But the rules are that if you're setting an exclusive
> fence, indicating a gpu write access in the implicit fencing system,
> then you need to wait for all fences, not just the previous exclusive
> fence.
> 
> panfrost against itself has no problem, because it always sets the
> exclusive fence (but that's probably something that will need to be
> fixed for vulkan and/or multi-engine gpus, or you'll suffer badly).
> Also no problem with that against display.
> 
> With the prep work done to switch over to the dependency helpers this
> is now a oneliner.
> 
> Signed-off-by: Daniel Vetter 
> Cc: Rob Herring 
> Cc: Tomeu Vizoso 
> Cc: Steven Price 
> Cc: Alyssa Rosenzweig 
> Cc: Sumit Semwal 

Reviewed-by: Boris Brezillon 

> Cc: "Christian König" 
> Cc: linux-me...@vger.kernel.org
> Cc: linaro-mm-...@lists.linaro.org
> ---
>  drivers/gpu/drm/panfrost/panfrost_job.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
> b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 71cd43fa1b36..ef004d587dc4 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -203,9 +203,8 @@ static int panfrost_acquire_object_fences(struct 
> drm_gem_object **bos,
>   int i, ret;
>  
>   for (i = 0; i < bo_count; i++) {
> - struct dma_fence *fence = 
> dma_resv_get_excl_unlocked(bos[i]->resv);
> -
> - ret = drm_gem_fence_array_add(deps, fence);
> + /* panfrost always uses write mode in its current uapi */
> + ret = drm_gem_fence_array_add_implicit(deps, bos[i], true);
>   if (ret)
>   return ret;
>   }

Re: [PATCH v2 2/2] drm/panfrost: Queue jobs on the hardware

2021-06-21 Thread Boris Brezillon

On Mon, 21 Jun 2021 17:08:21 +0100
Steven Price  wrote:

> On 21/06/2021 15:02, Boris Brezillon wrote:
> > From: Steven Price 
> > 
> > The hardware has a set of '_NEXT' registers that can hold a second job
> > while the first is executing. Make use of these registers to enqueue a
> > second job per slot.
> > 
> > v2:
> > * Make sure non-faulty jobs get properly paused/resumed on GPU reset
> > 
> > Signed-off-by: Steven Price 
> > Signed-off-by: Boris Brezillon 
> > ---
> >  drivers/gpu/drm/panfrost/panfrost_device.h |   2 +-
> >  drivers/gpu/drm/panfrost/panfrost_job.c| 311 -
> >  2 files changed, 242 insertions(+), 71 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
> > b/drivers/gpu/drm/panfrost/panfrost_device.h
> > index 95e6044008d2..a87917b9e714 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_device.h
> > +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
> > @@ -101,7 +101,7 @@ struct panfrost_device {
> >  
> > struct panfrost_job_slot *js;
> >  
> > -   struct panfrost_job *jobs[NUM_JOB_SLOTS];
> > +   struct panfrost_job *jobs[NUM_JOB_SLOTS][2];
> > struct list_head scheduled_jobs;
> >  
> > struct panfrost_perfcnt *perfcnt;
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
> > b/drivers/gpu/drm/panfrost/panfrost_job.c
> > index 1b5c636794a1..888eceed227f 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> > @@ -4,6 +4,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -41,6 +42,7 @@ struct panfrost_queue_state {
> >  };
> >  
> >  struct panfrost_job_slot {
> > +   int irq;
> > struct panfrost_queue_state queue[NUM_JOB_SLOTS];
> > spinlock_t job_lock;
> >  };
> > @@ -148,9 +150,43 @@ static void panfrost_job_write_affinity(struct 
> > panfrost_device *pfdev,
> > job_write(pfdev, JS_AFFINITY_NEXT_HI(js), affinity >> 32);
> >  }
> >  
> > +static struct panfrost_job *
> > +panfrost_dequeue_job(struct panfrost_device *pfdev, int slot)
> > +{
> > +   struct panfrost_job *job = pfdev->jobs[slot][0];
> > +
> > +   pfdev->jobs[slot][0] = pfdev->jobs[slot][1];
> > +   pfdev->jobs[slot][1] = NULL;
> > +
> > +   return job;
> > +}
> > +
> > +static unsigned int
> > +panfrost_enqueue_job(struct panfrost_device *pfdev, int slot,
> > +struct panfrost_job *job)
> > +{
> > +   if (!pfdev->jobs[slot][0]) {
> > +   pfdev->jobs[slot][0] = job;
> > +   return 0;
> > +   }
> > +
> > +   WARN_ON(pfdev->jobs[slot][1]);
> > +   pfdev->jobs[slot][1] = job;
> > +   return 1;
> > +}
> > +
> > +static u32
> > +panfrost_get_job_chain_flag(const struct panfrost_job *job)
> > +{
> > +   struct panfrost_fence *f = to_panfrost_fence(job->done_fence);
> > +
> > +   return (f->seqno & 1) ? JS_CONFIG_JOB_CHAIN_FLAG : 0;  
> 
> Is the seqno going to reliably toggle like this? We need to ensure that
> when there are two jobs on the hardware they have different "job chain
> disambiguation" flags.

f->seqno is assigned the queue->emit_seqno which increases
monotonically at submission time. Since nothing can fail after the
fence creation in the submission path, 2 consecutive jobs on a given
queue should have different (f->seqno & 1) values.

> 
> Also that feature was only introduced in t76x. So relying on that would
> sadly kill off support for t60x, t62x and t72x (albeit I'm not sure how
> 'supported' these are with Mesa anyway).
> 
> It is possible to implement without the disambiguation flag - but it's
> a bit fiddly: it requires clearing out the _NEXT register, checking that
> you actually cleared it successfully (i.e. the hardware didn't just
> start the job before you cleared it) and then doing the action if still
> necessary. And of course then recovering from having cleared out _NEXT.
> There's a reason for adding the feature! ;)

As mentioned in my previous reply, I think I'll just disable this
feature on t72x-.

> 
> I'll try to review the rest and give it a spin later - although it's of
> course it looks quite familiar ;)

Thank you for your valuable feedback.

Regards,

Boris

Re: [PATCH v2 2/2] drm/panfrost: Queue jobs on the hardware

2021-06-21 Thread Boris Brezillon

On Mon, 21 Jun 2021 14:20:31 -0400
Alyssa Rosenzweig  wrote:

> > Also that feature was only introduced in t76x. So relying on that would
> > sadly kill off support for t60x, t62x and t72x (albeit I'm not sure how
> > 'supported' these are with Mesa anyway).  
> 
> t60x and t62x are not supported, but t720 very much is (albeit GLES2
> only, versus t760+ getting GLES3.1 and soon Vulkan)... t720 has
> deqp-gles2 in CI and is ~close to passing everything... Please don't
> break t720 :)

Okay, I think I'll just disable this feature on t72x then.

Re: [PATCH v2 08/12] drm/panfrost: Do the exception -> string translation using a table

2021-06-21 Thread Boris Brezillon

On Mon, 21 Jun 2021 16:19:38 +0100
Steven Price  wrote:

> On 21/06/2021 14:39, Boris Brezillon wrote:
> > Do the exception -> string translation using a table so we can add extra
> > fields if we need to. While at it add an error field to ease the
> > exception -> error conversion which we'll need if we want to set the
> > fence error to something that reflects the exception code.
> > 
> > TODO: fix the error codes.  
> 
> TODO: Do the TODO ;)

Yeah, I was kinda expecting help with that :-).

> 
> I'm not sure how useful translating the hardware error codes to Linux
> ones are. E.g. 'OOM' means something quite different from a normal
> -ENOMEM. One is running out of a space in a predefined buffer, the other
> is Linux not able to allocate memory.

Okay, then I can just unconditionally set the fence error to -EINVAL
and drop this error field.

> 
> > 
> > Signed-off-by: Boris Brezillon 
> > ---
> >  drivers/gpu/drm/panfrost/panfrost_device.c | 134 +
> >  drivers/gpu/drm/panfrost/panfrost_device.h |   1 +
> >  2 files changed, 88 insertions(+), 47 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
> > b/drivers/gpu/drm/panfrost/panfrost_device.c
> > index f7f5ca94f910..2de011cee258 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_device.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_device.c
> > @@ -292,55 +292,95 @@ void panfrost_device_fini(struct panfrost_device 
> > *pfdev)
> > panfrost_clk_fini(pfdev);
> >  }
> >  
> > -const char *panfrost_exception_name(u32 exception_code)
> > -{
> > -   switch (exception_code) {
> > -   /* Non-Fault Status code */
> > -   case 0x00: return "NOT_STARTED/IDLE/OK";
> > -   case 0x01: return "DONE";
> > -   case 0x02: return "INTERRUPTED";
> > -   case 0x03: return "STOPPED";
> > -   case 0x04: return "TERMINATED";
> > -   case 0x08: return "ACTIVE";
> > -   /* Job exceptions */
> > -   case 0x40: return "JOB_CONFIG_FAULT";
> > -   case 0x41: return "JOB_POWER_FAULT";
> > -   case 0x42: return "JOB_READ_FAULT";
> > -   case 0x43: return "JOB_WRITE_FAULT";
> > -   case 0x44: return "JOB_AFFINITY_FAULT";
> > -   case 0x48: return "JOB_BUS_FAULT";
> > -   case 0x50: return "INSTR_INVALID_PC";
> > -   case 0x51: return "INSTR_INVALID_ENC";
> > -   case 0x52: return "INSTR_TYPE_MISMATCH";
> > -   case 0x53: return "INSTR_OPERAND_FAULT";
> > -   case 0x54: return "INSTR_TLS_FAULT";
> > -   case 0x55: return "INSTR_BARRIER_FAULT";
> > -   case 0x56: return "INSTR_ALIGN_FAULT";
> > -   case 0x58: return "DATA_INVALID_FAULT";
> > -   case 0x59: return "TILE_RANGE_FAULT";
> > -   case 0x5A: return "ADDR_RANGE_FAULT";
> > -   case 0x60: return "OUT_OF_MEMORY";
> > -   /* GPU exceptions */
> > -   case 0x80: return "DELAYED_BUS_FAULT";
> > -   case 0x88: return "SHAREABILITY_FAULT";
> > -   /* MMU exceptions */
> > -   case 0xC1: return "TRANSLATION_FAULT_LEVEL1";
> > -   case 0xC2: return "TRANSLATION_FAULT_LEVEL2";
> > -   case 0xC3: return "TRANSLATION_FAULT_LEVEL3";
> > -   case 0xC4: return "TRANSLATION_FAULT_LEVEL4";
> > -   case 0xC8: return "PERMISSION_FAULT";
> > -   case 0xC9 ... 0xCF: return "PERMISSION_FAULT";
> > -   case 0xD1: return "TRANSTAB_BUS_FAULT_LEVEL1";
> > -   case 0xD2: return "TRANSTAB_BUS_FAULT_LEVEL2";
> > -   case 0xD3: return "TRANSTAB_BUS_FAULT_LEVEL3";
> > -   case 0xD4: return "TRANSTAB_BUS_FAULT_LEVEL4";
> > -   case 0xD8: return "ACCESS_FLAG";
> > -   case 0xD9 ... 0xDF: return "ACCESS_FLAG";
> > -   case 0xE0 ... 0xE7: return "ADDRESS_SIZE_FAULT";
> > -   case 0xE8 ... 0xEF: return "MEMORY_ATTRIBUTES_FAULT";
> > +#define PANFROST_EXCEPTION(id, err) \
> > +   [DRM_PANFROST_EXCEPTION_ ## id] = { \
> > +   .name = #id, \
> > +   .error = err, \
> > }
> >  
> > -   return "UNKNOWN";
> > +struct panfrost_exception_info {
> > +   const char *name;
> > +   int error;
> > +};
> > +
> > +static const struct panfrost_exception_info panfrost_exception_infos[] = {
> > +   PANFROST_EXCEPTION(OK, 0),
> > +   PANFROST

Re: [PATCH v2 05/12] drm/panfrost: Disable the AS on unhandled page faults

2021-06-21 Thread Boris Brezillon

On Mon, 21 Jun 2021 16:09:32 +0100
Steven Price  wrote:

> On 21/06/2021 14:39, Boris Brezillon wrote:
> > If we don't do that, we have to wait for the job timeout to expire
> > before the fault jobs gets killed.
> > 
> > Signed-off-by: Boris Brezillon   
> 
> Don't we need to do something here to allow recovery of the MMU context
> in the future? panfrost_mmu_disable() will zero out the MMU registers on
> the hardware, but AFAICS panfrost_mmu_enable() won't be called to
> restore the values until something evicts the address space (GPU power
> down/reset or just too many other processes).
> 
> The ideal would be to block submission of new jobs from this context and
> then wait until existing jobs have completed at which point the MMU
> state can be restored and jobs allowed again.

Uh, I assumed it'd be okay to have subsequent jobs coming from
this context to fail with a BUS_FAULT until the context is closed. But
what you suggest seems more robust.

> 
> But at a minimum I think we should have something like an 'MMU poisoned'
> bit that panfrost_mmu_as_get() can check.
> 
> Steve
> 
> > ---
> >  drivers/gpu/drm/panfrost/panfrost_mmu.c | 6 +-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
> > b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> > index 2a9bf30edc9d..d5c624e776f1 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> > @@ -661,7 +661,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int 
> > irq, void *data)
> > if ((status & mask) == BIT(as) && (exception_type & 0xF8) == 
> > 0xC0)
> > ret = panfrost_mmu_map_fault_addr(pfdev, as, addr);
> >  
> > -   if (ret)
> > +   if (ret) {
> > /* terminal fault, print info about the fault */
> > dev_err(pfdev->dev,
> > "Unhandled Page fault in AS%d at VA 0x%016llX\n"
> > @@ -679,6 +679,10 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int 
> > irq, void *data)
> > access_type, access_type_name(pfdev, 
> > fault_status),
> > source_id);
> >  
> > +   /* Disable the MMU to stop jobs on this AS immediately 
> > */
> > +   panfrost_mmu_disable(pfdev, as);
> > +   }
> > +
> > status &= ~mask;
> >  
> > /* If we received new MMU interrupts, process them before 
> > returning. */
> >   
>

Re: [PATCH v2 05/12] drm/panfrost: Disable the AS on unhandled page faults

2021-06-21 Thread Boris Brezillon

On Mon, 21 Jun 2021 15:39:00 +0200
Boris Brezillon  wrote:

> If we don't do that, we have to wait for the job timeout to expire
> before the fault jobs gets killed.

 ^ faulty

> 
> Signed-off-by: Boris Brezillon 
> ---
>  drivers/gpu/drm/panfrost/panfrost_mmu.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
> b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> index 2a9bf30edc9d..d5c624e776f1 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> @@ -661,7 +661,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int 
> irq, void *data)
>   if ((status & mask) == BIT(as) && (exception_type & 0xF8) == 
> 0xC0)
>   ret = panfrost_mmu_map_fault_addr(pfdev, as, addr);
>  
> - if (ret)
> + if (ret) {
>   /* terminal fault, print info about the fault */
>   dev_err(pfdev->dev,
>   "Unhandled Page fault in AS%d at VA 0x%016llX\n"
> @@ -679,6 +679,10 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int 
> irq, void *data)
>   access_type, access_type_name(pfdev, 
> fault_status),
>   source_id);
>  
> + /* Disable the MMU to stop jobs on this AS immediately 
> */
> + panfrost_mmu_disable(pfdev, as);
> + }
> +
>   status &= ~mask;
>  
>   /* If we received new MMU interrupts, process them before 
> returning. */

Re: [PATCH v2 04/12] drm/panfrost: Expose exception types to userspace

2021-06-21 Thread Boris Brezillon

On Mon, 21 Jun 2021 15:49:14 +0100
Steven Price  wrote:

> On 21/06/2021 14:38, Boris Brezillon wrote:
> > Job headers contain an exception type field which might be read and
> > converted to a human readable string by tracing tools. Let's expose
> > the exception type as an enum so we share the same definition.
> > 
> > Signed-off-by: Boris Brezillon 
> > ---
> >  include/uapi/drm/panfrost_drm.h | 65 +
> >  1 file changed, 65 insertions(+)
> > 
> > diff --git a/include/uapi/drm/panfrost_drm.h 
> > b/include/uapi/drm/panfrost_drm.h
> > index 061e700dd06c..9a05d57d0118 100644
> > --- a/include/uapi/drm/panfrost_drm.h
> > +++ b/include/uapi/drm/panfrost_drm.h
> > @@ -224,6 +224,71 @@ struct drm_panfrost_madvise {
> > __u32 retained;   /* out, whether backing store still exists */
> >  };
> >  
> > +/* The exception types */
> > +
> > +enum drm_panfrost_exception_type {
> > +   DRM_PANFROST_EXCEPTION_OK = 0x00,
> > +   DRM_PANFROST_EXCEPTION_DONE = 0x01,  
> 
> Any reason to miss INTERRUPTED? Although I don't think you'll ever see it.

Oops, that one is marked 'reserved' on Bifrost. I'll add it.

> 
> > +   DRM_PANFROST_EXCEPTION_STOPPED = 0x03,
> > +   DRM_PANFROST_EXCEPTION_TERMINATED = 0x04,
> > +   DRM_PANFROST_EXCEPTION_KABOOM = 0x05,
> > +   DRM_PANFROST_EXCEPTION_EUREKA = 0x06,  
> 
> Interestingly KABOOM/EUREKA are missing from panfrost_exception_name()

Addressed in patch 8.

> 
> > +   DRM_PANFROST_EXCEPTION_ACTIVE = 0x08,
> > +   DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT = 0x40,
> > +   DRM_PANFROST_EXCEPTION_JOB_POWER_FAULT = 0x41,
> > +   DRM_PANFROST_EXCEPTION_JOB_READ_FAULT = 0x42,
> > +   DRM_PANFROST_EXCEPTION_JOB_WRITE_FAULT = 0x43,
> > +   DRM_PANFROST_EXCEPTION_JOB_AFFINITY_FAULT = 0x44,
> > +   DRM_PANFROST_EXCEPTION_JOB_BUS_FAULT = 0x48,
> > +   DRM_PANFROST_EXCEPTION_INSTR_INVALID_PC = 0x50,
> > +   DRM_PANFROST_EXCEPTION_INSTR_INVALID_ENC = 0x51,  
> 
> 0x52: INSTR_TYPE_MISMATCH
> 0x53: INSTR_OPERAND_FAULT
> 0x54: INSTR_TLS_FAULT
> 
> > +   DRM_PANFROST_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,  
> 
> 0x56: INSTR_ALIGN_FAULT
> 
> By the looks of it this is probably the Bifrost list and missing those
> codes which are Midgard only, whereas panfrost_exception_name() looks
> like it's missing some Bifrost status codes.

Yep, I'll add the missing ones.

> 
> Given this is UAPI there is some argument for missing e.g. INTERRUPTED
> (I'm not sure it was ever actually implemented in hardware and the term
> INTERRUPTED might be reused in future), but it seems a bit wrong just to
> have Bifrost values here.

Definitely, I just didn't notice Midgard and Bifrost had different set
of exceptions.

> 
> Steve
> 
> > +   DRM_PANFROST_EXCEPTION_DATA_INVALID_FAULT = 0x58,
> > +   DRM_PANFROST_EXCEPTION_TILE_RANGE_FAULT = 0x59,
> > +   DRM_PANFROST_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
> > +   DRM_PANFROST_EXCEPTION_IMPRECISE_FAULT = 0x5b,
> > +   DRM_PANFROST_EXCEPTION_OOM = 0x60,
> > +   DRM_PANFROST_EXCEPTION_UNKNOWN = 0x7f,
> > +   DRM_PANFROST_EXCEPTION_DELAYED_BUS_FAULT = 0x80,
> > +   DRM_PANFROST_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
> > +   DRM_PANFROST_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
> > +   DRM_PANFROST_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
> > +   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
> > +   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
> > +   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
> > +   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
> > +   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
> > +   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_IDENTITY = 0xc7,
> > +   DRM_PANFROST_EXCEPTION_PERM_FAULT_0 = 0xc8,
> > +   DRM_PANFROST_EXCEPTION_PERM_FAULT_1 = 0xc9,
> > +   DRM_PANFROST_EXCEPTION_PERM_FAULT_2 = 0xca,
> > +   DRM_PANFROST_EXCEPTION_PERM_FAULT_3 = 0xcb,
> > +   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_0 = 0xd0,
> > +   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_1 = 0xd1,
> > +   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_2 = 0xd2,
> > +   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_3 = 0xd3,
> > +   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_0 = 0xd8,
> > +   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
> > +   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_2 = 0xda,
> > +   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
> > +   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN0 = 0xe0,
> > +   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN1 = 0xe1,
> > +   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAU

Re: [PATCH v2 02/12] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition

2021-06-21 Thread Boris Brezillon

On Mon, 21 Jun 2021 15:34:35 +0100
Steven Price  wrote:

> On 21/06/2021 14:38, Boris Brezillon wrote:
> > Exception types will be defined as an enum in panfrost_drm.h so userspace
> > and use the same definitions if needed.  
> 
> s/and/can/ ?
> 
> While it is (currently) unused in the kernel, this is a hardware value
> so I'm not sure why it's worth removing this and not the other
> (currently) unused values here. This is the value returned from the
> JS_STATUS register when the slot is actively processing a job.

Hm, what's the point of having the same value defined in 2 places
(DRM_PANFROST_EXCEPTION_ACTIVE defined in patch 3 vs
JS_STATUS_EVENT_ACTIVE here)? I mean, values defined in the
drm_panfrost_exception_type enum apply to the JS_STATUS registers too,
right?

> 
> Steve
> 
> > Signed-off-by: Boris Brezillon 
> > ---
> >  drivers/gpu/drm/panfrost/panfrost_regs.h | 3 ---
> >  1 file changed, 3 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h 
> > b/drivers/gpu/drm/panfrost/panfrost_regs.h
> > index dc9df5457f1c..1940ff86e49a 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_regs.h
> > +++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
> > @@ -262,9 +262,6 @@
> >  #define JS_COMMAND_SOFT_STOP_1 0x06/* Execute SOFT_STOP if 
> > JOB_CHAIN_FLAG is 1 */
> >  #define JS_COMMAND_HARD_STOP_1 0x07/* Execute HARD_STOP if 
> > JOB_CHAIN_FLAG is 1 */
> >  
> > -#define JS_STATUS_EVENT_ACTIVE 0x08
> > -
> > -
> >  /* MMU regs */
> >  #define MMU_INT_RAWSTAT0x2000
> >  #define MMU_INT_CLEAR  0x2004
> >   
>

Re: [PATCH v2 01/12] drm/panfrost: Make sure MMU context lifetime is not bound to panfrost_priv

2021-06-21 Thread Boris Brezillon

On Mon, 21 Jun 2021 15:29:55 +0100
Steven Price  wrote:

> On 21/06/2021 14:57, Alyssa Rosenzweig wrote:
> >> Jobs can be in-flight when the file descriptor is closed (either because
> >> the process did not terminate properly, or because it didn't wait for
> >> all GPU jobs to be finished), and apparently panfrost_job_close() does
> >> not cancel already running jobs. Let's refcount the MMU context object
> >> so it's lifetime is no longer bound to the FD lifetime and running jobs
> >> can finish properly without generating spurious page faults.  
> > 
> > Remind me - why can't we hard stop in-flight jobs when the fd is closed?
> > I've seen cases where kill -9'ing a badly behaved process doesn't end
> > the fault storm, or unfreeze the desktop.
> >   
> 
> Hard-stopping the in-flight jobs would also make sense. But unless we
> want to actually hang the close() then there will be a period between
> issuing the hard-stop and actually having completed all jobs in the context.

Patch 10 is doing that, I just didn't want to backport all the
dependencies, so I kept it split in 2 halves: one patch fixing the
use-after-free bug, and the other part killing in-flight jobs.

> 
> But equally to be fair I've been cherry-picking this patch myself for
> quite some time, so we should just merge it and improve from there. So
> you can have my:
> 
> Reviewed-by: Steven Price

[PATCH v2 1/2] drm/panfrost: Use a threaded IRQ for job interrupts

2021-06-21 Thread Boris Brezillon

This should avoid uneccessary interrupt-context switches when the GPU
is passed a lot of short jobs.

v2:
* New patch

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 54 +
 1 file changed, 38 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index cf6abe0fdf47..1b5c636794a1 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -473,19 +473,12 @@ static const struct drm_sched_backend_ops 
panfrost_sched_ops = {
.free_job = panfrost_job_free
 };
 
-static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
+static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
 {
-   struct panfrost_device *pfdev = data;
-   u32 status = job_read(pfdev, JOB_INT_STAT);
int j;
 
dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status);
 
-   if (!status)
-   return IRQ_NONE;
-
-   pm_runtime_mark_last_busy(pfdev->dev);
-
for (j = 0; status; j++) {
u32 mask = MK_JS_MASK(j);
 
@@ -558,16 +551,43 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
 
status &= ~mask;
}
+}
 
+static irqreturn_t panfrost_job_irq_handler_thread(int irq, void *data)
+{
+   struct panfrost_device *pfdev = data;
+   u32 status = job_read(pfdev, JOB_INT_RAWSTAT);
+
+   while (status) {
+   pm_runtime_mark_last_busy(pfdev->dev);
+
+   spin_lock(&pfdev->js->job_lock);
+   panfrost_job_handle_irq(pfdev, status);
+   spin_unlock(&pfdev->js->job_lock);
+   status = job_read(pfdev, JOB_INT_RAWSTAT);
+   }
+
+   job_write(pfdev, JOB_INT_MASK, ~0);
return IRQ_HANDLED;
 }
 
+static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
+{
+   struct panfrost_device *pfdev = data;
+   u32 status = job_read(pfdev, JOB_INT_STAT);
+
+   if (!status)
+   return IRQ_NONE;
+
+   job_write(pfdev, JOB_INT_MASK, 0);
+   return IRQ_WAKE_THREAD;
+}
+
 static void panfrost_reset(struct work_struct *work)
 {
struct panfrost_device *pfdev = container_of(work,
 struct panfrost_device,
 reset.work);
-   unsigned long flags;
unsigned int i;
 
for (i = 0; i < NUM_JOB_SLOTS; i++) {
@@ -595,7 +615,7 @@ static void panfrost_reset(struct work_struct *work)
/* All timers have been stopped, we can safely reset the pending state. 
*/
atomic_set(&pfdev->reset.pending, 0);
 
-   spin_lock_irqsave(&pfdev->js->job_lock, flags);
+   spin_lock(&pfdev->js->job_lock);
for (i = 0; i < NUM_JOB_SLOTS; i++) {
if (pfdev->jobs[i]) {
pm_runtime_put_noidle(pfdev->dev);
@@ -603,7 +623,7 @@ static void panfrost_reset(struct work_struct *work)
pfdev->jobs[i] = NULL;
}
}
-   spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
+   spin_unlock(&pfdev->js->job_lock);
 
panfrost_device_reset(pfdev);
 
@@ -628,8 +648,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
if (irq <= 0)
return -ENODEV;
 
-   ret = devm_request_irq(pfdev->dev, irq, panfrost_job_irq_handler,
-  IRQF_SHARED, KBUILD_MODNAME "-job", pfdev);
+   ret = devm_request_threaded_irq(pfdev->dev, irq,
+   panfrost_job_irq_handler,
+   panfrost_job_irq_handler_thread,
+   IRQF_SHARED, KBUILD_MODNAME "-job",
+   pfdev);
if (ret) {
dev_err(pfdev->dev, "failed to request job irq");
return ret;
@@ -696,14 +719,13 @@ int panfrost_job_open(struct panfrost_file_priv 
*panfrost_priv)
 void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
 {
struct panfrost_device *pfdev = panfrost_priv->pfdev;
-   unsigned long flags;
int i;
 
for (i = 0; i < NUM_JOB_SLOTS; i++)
drm_sched_entity_destroy(&panfrost_priv->sched_entity[i]);
 
/* Kill in-flight jobs */
-   spin_lock_irqsave(&pfdev->js->job_lock, flags);
+   spin_lock(&pfdev->js->job_lock);
for (i = 0; i < NUM_JOB_SLOTS; i++) {
struct drm_sched_entity *entity = 
&panfrost_priv->sched_entity[i];
struct panfrost_job *job = pfdev->jobs[i];
@@ -713,7 +735,7 @@ void panfrost_job_close(struct panfrost_file_priv 
*panfrost_priv)
 
job_write(pfd

[PATCH v2 2/2] drm/panfrost: Queue jobs on the hardware

2021-06-21 Thread Boris Brezillon

From: Steven Price 

The hardware has a set of '_NEXT' registers that can hold a second job
while the first is executing. Make use of these registers to enqueue a
second job per slot.

v2:
* Make sure non-faulty jobs get properly paused/resumed on GPU reset

Signed-off-by: Steven Price 
Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 311 -
 2 files changed, 242 insertions(+), 71 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 95e6044008d2..a87917b9e714 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -101,7 +101,7 @@ struct panfrost_device {
 
struct panfrost_job_slot *js;
 
-   struct panfrost_job *jobs[NUM_JOB_SLOTS];
+   struct panfrost_job *jobs[NUM_JOB_SLOTS][2];
struct list_head scheduled_jobs;
 
struct panfrost_perfcnt *perfcnt;
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 1b5c636794a1..888eceed227f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -41,6 +42,7 @@ struct panfrost_queue_state {
 };
 
 struct panfrost_job_slot {
+   int irq;
struct panfrost_queue_state queue[NUM_JOB_SLOTS];
spinlock_t job_lock;
 };
@@ -148,9 +150,43 @@ static void panfrost_job_write_affinity(struct 
panfrost_device *pfdev,
job_write(pfdev, JS_AFFINITY_NEXT_HI(js), affinity >> 32);
 }
 
+static struct panfrost_job *
+panfrost_dequeue_job(struct panfrost_device *pfdev, int slot)
+{
+   struct panfrost_job *job = pfdev->jobs[slot][0];
+
+   pfdev->jobs[slot][0] = pfdev->jobs[slot][1];
+   pfdev->jobs[slot][1] = NULL;
+
+   return job;
+}
+
+static unsigned int
+panfrost_enqueue_job(struct panfrost_device *pfdev, int slot,
+struct panfrost_job *job)
+{
+   if (!pfdev->jobs[slot][0]) {
+   pfdev->jobs[slot][0] = job;
+   return 0;
+   }
+
+   WARN_ON(pfdev->jobs[slot][1]);
+   pfdev->jobs[slot][1] = job;
+   return 1;
+}
+
+static u32
+panfrost_get_job_chain_flag(const struct panfrost_job *job)
+{
+   struct panfrost_fence *f = to_panfrost_fence(job->done_fence);
+
+   return (f->seqno & 1) ? JS_CONFIG_JOB_CHAIN_FLAG : 0;
+}
+
 static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
 {
struct panfrost_device *pfdev = job->pfdev;
+   unsigned int subslot;
u32 cfg;
u64 jc_head = job->jc;
int ret;
@@ -176,7 +212,8 @@ static void panfrost_job_hw_submit(struct panfrost_job 
*job, int js)
 * start */
cfg |= JS_CONFIG_THREAD_PRI(8) |
JS_CONFIG_START_FLUSH_CLEAN_INVALIDATE |
-   JS_CONFIG_END_FLUSH_CLEAN_INVALIDATE;
+   JS_CONFIG_END_FLUSH_CLEAN_INVALIDATE |
+   panfrost_get_job_chain_flag(job);
 
if (panfrost_has_hw_feature(pfdev, HW_FEATURE_FLUSH_REDUCTION))
cfg |= JS_CONFIG_ENABLE_FLUSH_REDUCTION;
@@ -190,10 +227,17 @@ static void panfrost_job_hw_submit(struct panfrost_job 
*job, int js)
job_write(pfdev, JS_FLUSH_ID_NEXT(js), job->flush_id);
 
/* GO ! */
-   dev_dbg(pfdev->dev, "JS: Submitting atom %p to js[%d] with head=0x%llx",
-   job, js, jc_head);
 
-   job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
+   spin_lock(&pfdev->js->job_lock);
+   subslot = panfrost_enqueue_job(pfdev, js, job);
+   /* Don't queue the job if a reset is in progress */
+   if (!atomic_read(&pfdev->reset.pending)) {
+   job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
+   dev_dbg(pfdev->dev,
+   "JS: Submitting atom %p to js[%d][%d] with head=0x%llx 
AS %d",
+   job, js, subslot, jc_head, cfg & 0xf);
+   }
+   spin_unlock(&pfdev->js->job_lock);
 }
 
 static void panfrost_acquire_object_fences(struct drm_gem_object **bos,
@@ -351,7 +395,11 @@ static struct dma_fence *panfrost_job_run(struct 
drm_sched_job *sched_job)
if (unlikely(job->base.s_fence->finished.error))
return NULL;
 
-   pfdev->jobs[slot] = job;
+   /* Nothing to execute: can happen if the job has finished while
+* we were resetting the GPU.
+*/
+   if (!job->jc)
+   return NULL;
 
fence = panfrost_fence_create(pfdev, slot);
if (IS_ERR(fence))
@@ -475,25 +523,67 @@ static const struct drm_sched_backend_ops 
panfrost_sched_ops = {
 
 static void panfrost_job_handle_irq(struct panfrost_devic

[PATCH v2 0/2] drm/panfrost: Queue jobs on the hardware

2021-06-21 Thread Boris Brezillon

Hello,

I'm resubmitting a patch submitted by Steven a long time ago. Not much
has changed except I realized I'd have to deal with soft-stops to
prevent this HW job queuing feature from killing innocent jobs when
a GPU hang happens. If we don't do that and a non-faulty job was in
progress when the GPU is reset, job headers might end up in an
inconsistent state, preventing them from being re-submitted after
the reset.

Also added an extra patch to use a threaded interrupt for job events
to avoid unneeded interrupt-context switches when short jobs are
queued.

Regards,

Boris

Boris Brezillon (1):
  drm/panfrost: Use a threaded IRQ for job interrupts

Steven Price (1):
  drm/panfrost: Queue jobs on the hardware

 drivers/gpu/drm/panfrost/panfrost_device.h |   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 357 -
 2 files changed, 275 insertions(+), 84 deletions(-)

-- 
2.31.1

[PATCH v2 09/12] drm/panfrost: Don't reset the GPU on job faults unless we really have to

2021-06-21 Thread Boris Brezillon

If we can recover from a fault without a reset there's no reason to
issue one.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.c |  9 ++
 drivers/gpu/drm/panfrost/panfrost_device.h |  2 ++
 drivers/gpu/drm/panfrost/panfrost_job.c| 35 ++
 3 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index 2de011cee258..ac76e8646e97 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -383,6 +383,15 @@ int panfrost_exception_to_error(u32 exception_code)
return panfrost_exception_infos[exception_code].error;
 }
 
+bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
+   u32 exception_code)
+{
+   /* Right now, none of the GPU we support need a reset, but this
+* might change (e.g. Valhall GPUs require a when a BUS_FAULT occurs).
+*/
+   return false;
+}
+
 void panfrost_device_reset(struct panfrost_device *pfdev)
 {
panfrost_gpu_soft_reset(pfdev);
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 498c7b5dccd0..95e6044008d2 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -175,6 +175,8 @@ int panfrost_device_suspend(struct device *dev);
 
 const char *panfrost_exception_name(u32 exception_code);
 int panfrost_exception_to_error(u32 exception_code);
+bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
+   u32 exception_code);
 
 static inline void
 panfrost_device_schedule_reset(struct panfrost_device *pfdev)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index be5d3e4a1d0a..aedc604d331c 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -493,27 +493,38 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
 
if (status & JOB_INT_MASK_ERR(j)) {
enum panfrost_queue_status old_status;
+   u32 js_status = job_read(pfdev, JS_STATUS(j));
 
job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
j,
-   panfrost_exception_name(job_read(pfdev, 
JS_STATUS(j))),
+   panfrost_exception_name(js_status),
job_read(pfdev, JS_HEAD_LO(j)),
job_read(pfdev, JS_TAIL_LO(j)));
 
-   /*
-* When the queue is being restarted we don't report
-* faults directly to avoid races between the timeout
-* and reset handlers. panfrost_scheduler_start() will
-* call drm_sched_fault() after the queue has been
-* started if status == FAULT_PENDING.
+   /* If we need a reset, signal it to the reset handler,
+* otherwise, update the fence error field and signal
+* the job fence.
 */
-   old_status = atomic_cmpxchg(&pfdev->js->queue[j].status,
-   
PANFROST_QUEUE_STATUS_STARTING,
-   
PANFROST_QUEUE_STATUS_FAULT_PENDING);
-   if (old_status == PANFROST_QUEUE_STATUS_ACTIVE)
-   drm_sched_fault(&pfdev->js->queue[j].sched);
+   if (panfrost_exception_needs_reset(pfdev, js_status)) {
+   /*
+* When the queue is being restarted we don't 
report
+* faults directly to avoid races between the 
timeout
+* and reset handlers. 
panfrost_scheduler_start() will
+* call drm_sched_fault() after the queue has 
been
+* started if status == FAULT_PENDING.
+*/
+   old_status = 
atomic_cmpxchg(&pfdev->js->queue[j].status,
+   
PANFROST_QUEUE_STATUS_STARTING,
+   
PANFROST_QUEUE_STATUS_FAULT_PENDING);
+   if (old_status == PANFROST_QUEUE_STATUS_ACTIVE)
+   
drm_sched_fault(&pfdev->js->queue[j].sched);
+   } else {
+

[PATCH v2 12/12] drm/panfrost: Shorten the fence signalling section

2021-06-21 Thread Boris Brezillon

panfrost_reset() does not directly signal fences, but
panfrost_scheduler_start() does, when calling drm_sched_start().

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 74b63e1ee6d9..cf6abe0fdf47 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -414,6 +414,7 @@ static bool panfrost_scheduler_stop(struct 
panfrost_queue_state *queue,
 static void panfrost_scheduler_start(struct panfrost_queue_state *queue)
 {
enum panfrost_queue_status old_status;
+   bool cookie;
 
mutex_lock(&queue->lock);
old_status = atomic_xchg(&queue->status,
@@ -423,7 +424,9 @@ static void panfrost_scheduler_start(struct 
panfrost_queue_state *queue)
/* Restore the original timeout before starting the scheduler. */
queue->sched.timeout = msecs_to_jiffies(JOB_TIMEOUT_MS);
drm_sched_resubmit_jobs(&queue->sched);
+   cookie = dma_fence_begin_signalling();
drm_sched_start(&queue->sched, true);
+   dma_fence_end_signalling(cookie);
old_status = atomic_xchg(&queue->status,
 PANFROST_QUEUE_STATUS_ACTIVE);
if (old_status == PANFROST_QUEUE_STATUS_FAULT_PENDING)
@@ -566,9 +569,7 @@ static void panfrost_reset(struct work_struct *work)
 reset.work);
unsigned long flags;
unsigned int i;
-   bool cookie;
 
-   cookie = dma_fence_begin_signalling();
for (i = 0; i < NUM_JOB_SLOTS; i++) {
/*
 * We want pending timeouts to be handled before we attempt
@@ -608,8 +609,6 @@ static void panfrost_reset(struct work_struct *work)
 
for (i = 0; i < NUM_JOB_SLOTS; i++)
panfrost_scheduler_start(&pfdev->js->queue[i]);
-
-   dma_fence_end_signalling(cookie);
 }
 
 int panfrost_job_init(struct panfrost_device *pfdev)
-- 
2.31.1

[PATCH v2 10/12] drm/panfrost: Kill in-flight jobs on FD close

2021-06-21 Thread Boris Brezillon

If the process who submitted these jobs decided to close the FD before
the jobs are done it probably means it doesn't care about the result.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 33 +
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index aedc604d331c..a51fa0a81367 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -494,14 +494,22 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
if (status & JOB_INT_MASK_ERR(j)) {
enum panfrost_queue_status old_status;
u32 js_status = job_read(pfdev, JS_STATUS(j));
+   int error = panfrost_exception_to_error(js_status);
+   const char *exception_name = 
panfrost_exception_name(js_status);
 
job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
-   dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
-   j,
-   panfrost_exception_name(js_status),
-   job_read(pfdev, JS_HEAD_LO(j)),
-   job_read(pfdev, JS_TAIL_LO(j)));
+   if (!error) {
+   dev_dbg(pfdev->dev, "js interrupt, js=%d, 
status=%s, head=0x%x, tail=0x%x",
+   j, exception_name,
+   job_read(pfdev, JS_HEAD_LO(j)),
+   job_read(pfdev, JS_TAIL_LO(j)));
+   } else {
+   dev_err(pfdev->dev, "js fault, js=%d, 
status=%s, head=0x%x, tail=0x%x",
+   j, exception_name,
+   job_read(pfdev, JS_HEAD_LO(j)),
+   job_read(pfdev, JS_TAIL_LO(j)));
+   }
 
/* If we need a reset, signal it to the reset handler,
 * otherwise, update the fence error field and signal
@@ -688,10 +696,25 @@ int panfrost_job_open(struct panfrost_file_priv 
*panfrost_priv)
 
 void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
 {
+   struct panfrost_device *pfdev = panfrost_priv->pfdev;
+   unsigned long flags;
int i;
 
for (i = 0; i < NUM_JOB_SLOTS; i++)
drm_sched_entity_destroy(&panfrost_priv->sched_entity[i]);
+
+   /* Kill in-flight jobs */
+   spin_lock_irqsave(&pfdev->js->job_lock, flags);
+   for (i = 0; i < NUM_JOB_SLOTS; i++) {
+   struct drm_sched_entity *entity = 
&panfrost_priv->sched_entity[i];
+   struct panfrost_job *job = pfdev->jobs[i];
+
+   if (!job || job->base.entity != entity)
+   continue;
+
+   job_write(pfdev, JS_COMMAND(i), JS_COMMAND_HARD_STOP);
+   }
+   spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
 }
 
 int panfrost_job_is_idle(struct panfrost_device *pfdev)
-- 
2.31.1

[PATCH v2 02/12] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition

2021-06-21 Thread Boris Brezillon

Exception types will be defined as an enum in panfrost_drm.h so userspace
and use the same definitions if needed.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_regs.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h 
b/drivers/gpu/drm/panfrost/panfrost_regs.h
index dc9df5457f1c..1940ff86e49a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_regs.h
+++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
@@ -262,9 +262,6 @@
 #define JS_COMMAND_SOFT_STOP_1 0x06/* Execute SOFT_STOP if 
JOB_CHAIN_FLAG is 1 */
 #define JS_COMMAND_HARD_STOP_1 0x07/* Execute HARD_STOP if 
JOB_CHAIN_FLAG is 1 */
 
-#define JS_STATUS_EVENT_ACTIVE 0x08
-
-
 /* MMU regs */
 #define MMU_INT_RAWSTAT0x2000
 #define MMU_INT_CLEAR  0x2004
-- 
2.31.1

[PATCH v2 06/12] drm/panfrost: Expose a helper to trigger a GPU reset

2021-06-21 Thread Boris Brezillon

Expose a helper to trigger a GPU reset so we can easily trigger reset
operations outside the job timeout handler.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.h | 8 
 drivers/gpu/drm/panfrost/panfrost_job.c| 4 +---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 2fe1550da7f8..1c6a3597eba0 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -175,4 +175,12 @@ int panfrost_device_suspend(struct device *dev);
 
 const char *panfrost_exception_name(u32 exception_code);
 
+static inline void
+panfrost_device_schedule_reset(struct panfrost_device *pfdev)
+{
+   /* Schedule a reset if there's no reset in progress. */
+   if (!atomic_xchg(&pfdev->reset.pending, 1))
+   schedule_work(&pfdev->reset.work);
+}
+
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 1be80b3dd5d0..be5d3e4a1d0a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -458,9 +458,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct 
drm_sched_job
if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job))
return DRM_GPU_SCHED_STAT_NOMINAL;
 
-   /* Schedule a reset if there's no reset in progress. */
-   if (!atomic_xchg(&pfdev->reset.pending, 1))
-   schedule_work(&pfdev->reset.work);
+   panfrost_device_schedule_reset(pfdev);
 
return DRM_GPU_SCHED_STAT_NOMINAL;
 }
-- 
2.31.1

[PATCH v2 11/12] drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate

2021-06-21 Thread Boris Brezillon

If the fence creation fail, we can return the error pointer directly.
The core will update the fence error accordingly.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index a51fa0a81367..74b63e1ee6d9 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -355,7 +355,7 @@ static struct dma_fence *panfrost_job_run(struct 
drm_sched_job *sched_job)
 
fence = panfrost_fence_create(pfdev, slot);
if (IS_ERR(fence))
-   return NULL;
+   return fence;
 
if (job->done_fence)
dma_fence_put(job->done_fence);
-- 
2.31.1

[PATCH v2 05/12] drm/panfrost: Disable the AS on unhandled page faults

2021-06-21 Thread Boris Brezillon

If we don't do that, we have to wait for the job timeout to expire
before the fault jobs gets killed.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 2a9bf30edc9d..d5c624e776f1 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -661,7 +661,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, 
void *data)
if ((status & mask) == BIT(as) && (exception_type & 0xF8) == 
0xC0)
ret = panfrost_mmu_map_fault_addr(pfdev, as, addr);
 
-   if (ret)
+   if (ret) {
/* terminal fault, print info about the fault */
dev_err(pfdev->dev,
"Unhandled Page fault in AS%d at VA 0x%016llX\n"
@@ -679,6 +679,10 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int 
irq, void *data)
access_type, access_type_name(pfdev, 
fault_status),
source_id);
 
+   /* Disable the MMU to stop jobs on this AS immediately 
*/
+   panfrost_mmu_disable(pfdev, as);
+   }
+
status &= ~mask;
 
/* If we received new MMU interrupts, process them before 
returning. */
-- 
2.31.1

[PATCH v2 07/12] drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck

2021-06-21 Thread Boris Brezillon

Things are unlikely to resolve until we reset the GPU. Let's not wait
for other faults/timeout to happen to trigger this reset.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index d5c624e776f1..d20bcaecb78f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -36,8 +36,11 @@ static int wait_ready(struct panfrost_device *pfdev, u32 
as_nr)
ret = readl_relaxed_poll_timeout_atomic(pfdev->iomem + AS_STATUS(as_nr),
val, !(val & AS_STATUS_AS_ACTIVE), 10, 1000);
 
-   if (ret)
+   if (ret) {
+   /* The GPU hung, let's trigger a reset */
+   panfrost_device_schedule_reset(pfdev);
dev_err(pfdev->dev, "AS_ACTIVE bit stuck\n");
+   }
 
return ret;
 }
-- 
2.31.1

[PATCH v2 04/12] drm/panfrost: Expose exception types to userspace

2021-06-21 Thread Boris Brezillon

Job headers contain an exception type field which might be read and
converted to a human readable string by tracing tools. Let's expose
the exception type as an enum so we share the same definition.

Signed-off-by: Boris Brezillon 
---
 include/uapi/drm/panfrost_drm.h | 65 +
 1 file changed, 65 insertions(+)

diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
index 061e700dd06c..9a05d57d0118 100644
--- a/include/uapi/drm/panfrost_drm.h
+++ b/include/uapi/drm/panfrost_drm.h
@@ -224,6 +224,71 @@ struct drm_panfrost_madvise {
__u32 retained;   /* out, whether backing store still exists */
 };
 
+/* The exception types */
+
+enum drm_panfrost_exception_type {
+   DRM_PANFROST_EXCEPTION_OK = 0x00,
+   DRM_PANFROST_EXCEPTION_DONE = 0x01,
+   DRM_PANFROST_EXCEPTION_STOPPED = 0x03,
+   DRM_PANFROST_EXCEPTION_TERMINATED = 0x04,
+   DRM_PANFROST_EXCEPTION_KABOOM = 0x05,
+   DRM_PANFROST_EXCEPTION_EUREKA = 0x06,
+   DRM_PANFROST_EXCEPTION_ACTIVE = 0x08,
+   DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT = 0x40,
+   DRM_PANFROST_EXCEPTION_JOB_POWER_FAULT = 0x41,
+   DRM_PANFROST_EXCEPTION_JOB_READ_FAULT = 0x42,
+   DRM_PANFROST_EXCEPTION_JOB_WRITE_FAULT = 0x43,
+   DRM_PANFROST_EXCEPTION_JOB_AFFINITY_FAULT = 0x44,
+   DRM_PANFROST_EXCEPTION_JOB_BUS_FAULT = 0x48,
+   DRM_PANFROST_EXCEPTION_INSTR_INVALID_PC = 0x50,
+   DRM_PANFROST_EXCEPTION_INSTR_INVALID_ENC = 0x51,
+   DRM_PANFROST_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,
+   DRM_PANFROST_EXCEPTION_DATA_INVALID_FAULT = 0x58,
+   DRM_PANFROST_EXCEPTION_TILE_RANGE_FAULT = 0x59,
+   DRM_PANFROST_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
+   DRM_PANFROST_EXCEPTION_IMPRECISE_FAULT = 0x5b,
+   DRM_PANFROST_EXCEPTION_OOM = 0x60,
+   DRM_PANFROST_EXCEPTION_UNKNOWN = 0x7f,
+   DRM_PANFROST_EXCEPTION_DELAYED_BUS_FAULT = 0x80,
+   DRM_PANFROST_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
+   DRM_PANFROST_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
+   DRM_PANFROST_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_IDENTITY = 0xc7,
+   DRM_PANFROST_EXCEPTION_PERM_FAULT_0 = 0xc8,
+   DRM_PANFROST_EXCEPTION_PERM_FAULT_1 = 0xc9,
+   DRM_PANFROST_EXCEPTION_PERM_FAULT_2 = 0xca,
+   DRM_PANFROST_EXCEPTION_PERM_FAULT_3 = 0xcb,
+   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_0 = 0xd0,
+   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_1 = 0xd1,
+   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_2 = 0xd2,
+   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_3 = 0xd3,
+   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_0 = 0xd8,
+   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
+   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_2 = 0xda,
+   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN0 = 0xe0,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN1 = 0xe1,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN2 = 0xe2,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN3 = 0xe3,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT0 = 0xe4,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT1 = 0xe5,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT2 = 0xe6,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT3 = 0xe7,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_0 = 0xe8,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_1 = 0xe9,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_2 = 0xea,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_3 = 0xeb,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_0 = 0xec,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_1 = 0xed,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_2 = 0xee,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_3 = 0xef,
+};
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.31.1

[PATCH v2 08/12] drm/panfrost: Do the exception -> string translation using a table

2021-06-21 Thread Boris Brezillon

Do the exception -> string translation using a table so we can add extra
fields if we need to. While at it add an error field to ease the
exception -> error conversion which we'll need if we want to set the
fence error to something that reflects the exception code.

TODO: fix the error codes.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.c | 134 +
 drivers/gpu/drm/panfrost/panfrost_device.h |   1 +
 2 files changed, 88 insertions(+), 47 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index f7f5ca94f910..2de011cee258 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -292,55 +292,95 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
panfrost_clk_fini(pfdev);
 }
 
-const char *panfrost_exception_name(u32 exception_code)
-{
-   switch (exception_code) {
-   /* Non-Fault Status code */
-   case 0x00: return "NOT_STARTED/IDLE/OK";
-   case 0x01: return "DONE";
-   case 0x02: return "INTERRUPTED";
-   case 0x03: return "STOPPED";
-   case 0x04: return "TERMINATED";
-   case 0x08: return "ACTIVE";
-   /* Job exceptions */
-   case 0x40: return "JOB_CONFIG_FAULT";
-   case 0x41: return "JOB_POWER_FAULT";
-   case 0x42: return "JOB_READ_FAULT";
-   case 0x43: return "JOB_WRITE_FAULT";
-   case 0x44: return "JOB_AFFINITY_FAULT";
-   case 0x48: return "JOB_BUS_FAULT";
-   case 0x50: return "INSTR_INVALID_PC";
-   case 0x51: return "INSTR_INVALID_ENC";
-   case 0x52: return "INSTR_TYPE_MISMATCH";
-   case 0x53: return "INSTR_OPERAND_FAULT";
-   case 0x54: return "INSTR_TLS_FAULT";
-   case 0x55: return "INSTR_BARRIER_FAULT";
-   case 0x56: return "INSTR_ALIGN_FAULT";
-   case 0x58: return "DATA_INVALID_FAULT";
-   case 0x59: return "TILE_RANGE_FAULT";
-   case 0x5A: return "ADDR_RANGE_FAULT";
-   case 0x60: return "OUT_OF_MEMORY";
-   /* GPU exceptions */
-   case 0x80: return "DELAYED_BUS_FAULT";
-   case 0x88: return "SHAREABILITY_FAULT";
-   /* MMU exceptions */
-   case 0xC1: return "TRANSLATION_FAULT_LEVEL1";
-   case 0xC2: return "TRANSLATION_FAULT_LEVEL2";
-   case 0xC3: return "TRANSLATION_FAULT_LEVEL3";
-   case 0xC4: return "TRANSLATION_FAULT_LEVEL4";
-   case 0xC8: return "PERMISSION_FAULT";
-   case 0xC9 ... 0xCF: return "PERMISSION_FAULT";
-   case 0xD1: return "TRANSTAB_BUS_FAULT_LEVEL1";
-   case 0xD2: return "TRANSTAB_BUS_FAULT_LEVEL2";
-   case 0xD3: return "TRANSTAB_BUS_FAULT_LEVEL3";
-   case 0xD4: return "TRANSTAB_BUS_FAULT_LEVEL4";
-   case 0xD8: return "ACCESS_FLAG";
-   case 0xD9 ... 0xDF: return "ACCESS_FLAG";
-   case 0xE0 ... 0xE7: return "ADDRESS_SIZE_FAULT";
-   case 0xE8 ... 0xEF: return "MEMORY_ATTRIBUTES_FAULT";
+#define PANFROST_EXCEPTION(id, err) \
+   [DRM_PANFROST_EXCEPTION_ ## id] = { \
+   .name = #id, \
+   .error = err, \
}
 
-   return "UNKNOWN";
+struct panfrost_exception_info {
+   const char *name;
+   int error;
+};
+
+static const struct panfrost_exception_info panfrost_exception_infos[] = {
+   PANFROST_EXCEPTION(OK, 0),
+   PANFROST_EXCEPTION(DONE, 0),
+   PANFROST_EXCEPTION(STOPPED, 0),
+   PANFROST_EXCEPTION(TERMINATED, 0),
+   PANFROST_EXCEPTION(KABOOM, 0),
+   PANFROST_EXCEPTION(EUREKA, 0),
+   PANFROST_EXCEPTION(ACTIVE, 0),
+   PANFROST_EXCEPTION(JOB_CONFIG_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(JOB_POWER_FAULT, -ECANCELED),
+   PANFROST_EXCEPTION(JOB_READ_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(JOB_WRITE_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(JOB_AFFINITY_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(JOB_BUS_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(INSTR_INVALID_PC, -EINVAL),
+   PANFROST_EXCEPTION(INSTR_INVALID_ENC, -EINVAL),
+   PANFROST_EXCEPTION(INSTR_BARRIER_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(DATA_INVALID_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(TILE_RANGE_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(ADDR_RANGE_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(IMPRECISE_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(OOM, -ENOMEM),
+   PANFROST_EXCEPTION(UNKNOWN, -EINVAL),
+   PANFROST_EXCEPTION(DELAYED_BUS_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(GPU_SHAREABILITY_FAULT, -ECANCELED),
+   PANFROST_EXCEPTION(SYS_SHAREABILITY_FAULT, -ECANCELE

[PATCH v2 03/12] drm/panfrost: Drop the pfdev argument passed to panfrost_exception_name()

2021-06-21 Thread Boris Brezillon

Currently unused. We'll add it back if we need per-GPU definitions.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.c | 2 +-
 drivers/gpu/drm/panfrost/panfrost_device.h | 2 +-
 drivers/gpu/drm/panfrost/panfrost_gpu.c| 2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 2 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c| 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index a2a09c51eed7..f7f5ca94f910 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -292,7 +292,7 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
panfrost_clk_fini(pfdev);
 }
 
-const char *panfrost_exception_name(struct panfrost_device *pfdev, u32 
exception_code)
+const char *panfrost_exception_name(u32 exception_code)
 {
switch (exception_code) {
/* Non-Fault Status code */
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 8b2cdb8c701d..2fe1550da7f8 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -173,6 +173,6 @@ void panfrost_device_reset(struct panfrost_device *pfdev);
 int panfrost_device_resume(struct device *dev);
 int panfrost_device_suspend(struct device *dev);
 
-const char *panfrost_exception_name(struct panfrost_device *pfdev, u32 
exception_code);
+const char *panfrost_exception_name(u32 exception_code);
 
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c 
b/drivers/gpu/drm/panfrost/panfrost_gpu.c
index 0e70e27fd8c3..26e4196b6c90 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
@@ -33,7 +33,7 @@ static irqreturn_t panfrost_gpu_irq_handler(int irq, void 
*data)
address |= gpu_read(pfdev, GPU_FAULT_ADDRESS_LO);
 
dev_warn(pfdev->dev, "GPU Fault 0x%08x (%s) at 0x%016llx\n",
-fault_status & 0xFF, panfrost_exception_name(pfdev, 
fault_status),
+fault_status & 0xFF, 
panfrost_exception_name(fault_status),
 address);
 
if (state & GPU_IRQ_MULTIPLE_FAULT)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 3757c6eb3023..1be80b3dd5d0 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -500,7 +500,7 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
 
dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
j,
-   panfrost_exception_name(pfdev, job_read(pfdev, 
JS_STATUS(j))),
+   panfrost_exception_name(job_read(pfdev, 
JS_STATUS(j))),
job_read(pfdev, JS_HEAD_LO(j)),
job_read(pfdev, JS_TAIL_LO(j)));
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 569509c2ba27..2a9bf30edc9d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -675,7 +675,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, 
void *data)
"TODO",
fault_status,
(fault_status & (1 << 10) ? "DECODER FAULT" : 
"SLAVE FAULT"),
-   exception_type, panfrost_exception_name(pfdev, 
exception_type),
+   exception_type, 
panfrost_exception_name(exception_type),
access_type, access_type_name(pfdev, 
fault_status),
source_id);
 
-- 
2.31.1

[PATCH v2 01/12] drm/panfrost: Make sure MMU context lifetime is not bound to panfrost_priv

2021-06-21 Thread Boris Brezillon

Jobs can be in-flight when the file descriptor is closed (either because
the process did not terminate properly, or because it didn't wait for
all GPU jobs to be finished), and apparently panfrost_job_close() does
not cancel already running jobs. Let's refcount the MMU context object
so it's lifetime is no longer bound to the FD lifetime and running jobs
can finish properly without generating spurious page faults.

Reported-by: Icecream95 
Fixes: 7282f7645d06 ("drm/panfrost: Implement per FD address spaces")
Cc: 
Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |   8 +-
 drivers/gpu/drm/panfrost/panfrost_drv.c|  50 ++-
 drivers/gpu/drm/panfrost/panfrost_gem.c|  20 ++-
 drivers/gpu/drm/panfrost/panfrost_job.c|   4 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c| 160 ++---
 drivers/gpu/drm/panfrost/panfrost_mmu.h|   5 +-
 6 files changed, 136 insertions(+), 111 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index f614e98771e4..8b2cdb8c701d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -121,8 +121,12 @@ struct panfrost_device {
 };
 
 struct panfrost_mmu {
+   struct panfrost_device *pfdev;
+   struct kref refcount;
struct io_pgtable_cfg pgtbl_cfg;
struct io_pgtable_ops *pgtbl_ops;
+   struct drm_mm mm;
+   spinlock_t mm_lock;
int as;
atomic_t as_count;
struct list_head list;
@@ -133,9 +137,7 @@ struct panfrost_file_priv {
 
struct drm_sched_entity sched_entity[NUM_JOB_SLOTS];
 
-   struct panfrost_mmu mmu;
-   struct drm_mm mm;
-   spinlock_t mm_lock;
+   struct panfrost_mmu *mmu;
 };
 
 static inline struct panfrost_device *to_panfrost_device(struct drm_device 
*ddev)
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 075ec0ef746c..945133db1857 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -417,7 +417,7 @@ static int panfrost_ioctl_madvise(struct drm_device *dev, 
void *data,
 * anyway, so let's not bother.
 */
if (!list_is_singular(&bo->mappings.list) ||
-   WARN_ON_ONCE(first->mmu != &priv->mmu)) {
+   WARN_ON_ONCE(first->mmu != priv->mmu)) {
ret = -EINVAL;
goto out_unlock_mappings;
}
@@ -449,32 +449,6 @@ int panfrost_unstable_ioctl_check(void)
return 0;
 }
 
-#define PFN_4G (SZ_4G >> PAGE_SHIFT)
-#define PFN_4G_MASK(PFN_4G - 1)
-#define PFN_16M(SZ_16M >> PAGE_SHIFT)
-
-static void panfrost_drm_mm_color_adjust(const struct drm_mm_node *node,
-unsigned long color,
-u64 *start, u64 *end)
-{
-   /* Executable buffers can't start or end on a 4GB boundary */
-   if (!(color & PANFROST_BO_NOEXEC)) {
-   u64 next_seg;
-
-   if ((*start & PFN_4G_MASK) == 0)
-   (*start)++;
-
-   if ((*end & PFN_4G_MASK) == 0)
-   (*end)--;
-
-   next_seg = ALIGN(*start, PFN_4G);
-   if (next_seg - *start <= PFN_16M)
-   *start = next_seg + 1;
-
-   *end = min(*end, ALIGN(*start, PFN_4G) - 1);
-   }
-}
-
 static int
 panfrost_open(struct drm_device *dev, struct drm_file *file)
 {
@@ -489,15 +463,11 @@ panfrost_open(struct drm_device *dev, struct drm_file 
*file)
panfrost_priv->pfdev = pfdev;
file->driver_priv = panfrost_priv;
 
-   spin_lock_init(&panfrost_priv->mm_lock);
-
-   /* 4G enough for now. can be 48-bit */
-   drm_mm_init(&panfrost_priv->mm, SZ_32M >> PAGE_SHIFT, (SZ_4G - SZ_32M) 
>> PAGE_SHIFT);
-   panfrost_priv->mm.color_adjust = panfrost_drm_mm_color_adjust;
-
-   ret = panfrost_mmu_pgtable_alloc(panfrost_priv);
-   if (ret)
-   goto err_pgtable;
+   panfrost_priv->mmu = panfrost_mmu_ctx_create(pfdev);
+   if (IS_ERR(panfrost_priv->mmu)) {
+   ret = PTR_ERR(panfrost_priv->mmu);
+   goto err_free;
+   }
 
ret = panfrost_job_open(panfrost_priv);
if (ret)
@@ -506,9 +476,8 @@ panfrost_open(struct drm_device *dev, struct drm_file *file)
return 0;
 
 err_job:
-   panfrost_mmu_pgtable_free(panfrost_priv);
-err_pgtable:
-   drm_mm_takedown(&panfrost_priv->mm);
+   panfrost_mmu_ctx_put(panfrost_priv->mmu);
+err_free:
kfree(panfrost_priv);
return ret;
 }
@@ -521,8 +490,7 @@ panfrost_postclose(struct drm_device *dev, struct drm_file 
*file)
panfrost_perfcnt_close(file)

[PATCH v2 00/12] drm/panfrost: Misc fixes/improvements

2021-06-21 Thread Boris Brezillon

Hello,

Sorry for the noise, but I forgot 2 fixes (one fixing the error
set to the sched fence when the driver fence allocation fails,
and the other one shrinking the dma signalling section to get
rid of spurious lockdep warnings).

The rest of the series hasn't changed.

The first patch has been submitted a while ago but was lacking a way
to kill in-flight jobs when a context is closed; which is now addressed
in patch 10.

The rest of those patches are improving fault handling (with some code
refactoring in the middle).

"drm/panfrost: Do the exception -> string translation using a table"
still has a TODO. I basically mapped all exception types to
EINVAL since most faults are triggered by invalid job/shaders, but
there might be some exceptions that should be translated to something
else. Any feedback on that aspect is welcome.

Regards,

Boris

Changes in v2:
* Added patch 11 and 12

Boris Brezillon (12):
  drm/panfrost: Make sure MMU context lifetime is not bound to
panfrost_priv
  drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition
  drm/panfrost: Drop the pfdev argument passed to
panfrost_exception_name()
  drm/panfrost: Expose exception types to userspace
  drm/panfrost: Disable the AS on unhandled page faults
  drm/panfrost: Expose a helper to trigger a GPU reset
  drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck
  drm/panfrost: Do the exception -> string translation using a table
  drm/panfrost: Don't reset the GPU on job faults unless we really have
to
  drm/panfrost: Kill in-flight jobs on FD close
  drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate
  drm/panfrost: Shorten the fence signalling section

 drivers/gpu/drm/panfrost/panfrost_device.c | 143 +++--
 drivers/gpu/drm/panfrost/panfrost_device.h |  21 ++-
 drivers/gpu/drm/panfrost/panfrost_drv.c|  50 ++
 drivers/gpu/drm/panfrost/panfrost_gem.c|  20 ++-
 drivers/gpu/drm/panfrost/panfrost_gpu.c|   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c|  83 ++
 drivers/gpu/drm/panfrost/panfrost_mmu.c| 173 ++---
 drivers/gpu/drm/panfrost/panfrost_mmu.h|   5 +-
 drivers/gpu/drm/panfrost/panfrost_regs.h   |   3 -
 include/uapi/drm/panfrost_drm.h|  65 
 10 files changed, 375 insertions(+), 190 deletions(-)

-- 
2.31.1

[PATCH 03/10] drm/panfrost: Drop the pfdev argument passed to panfrost_exception_name()

2021-06-21 Thread Boris Brezillon

Currently unused. We'll add it back if we need per-GPU definitions.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.c | 2 +-
 drivers/gpu/drm/panfrost/panfrost_device.h | 2 +-
 drivers/gpu/drm/panfrost/panfrost_gpu.c| 2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c| 2 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c| 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index a2a09c51eed7..f7f5ca94f910 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -292,7 +292,7 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
panfrost_clk_fini(pfdev);
 }
 
-const char *panfrost_exception_name(struct panfrost_device *pfdev, u32 
exception_code)
+const char *panfrost_exception_name(u32 exception_code)
 {
switch (exception_code) {
/* Non-Fault Status code */
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 8b2cdb8c701d..2fe1550da7f8 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -173,6 +173,6 @@ void panfrost_device_reset(struct panfrost_device *pfdev);
 int panfrost_device_resume(struct device *dev);
 int panfrost_device_suspend(struct device *dev);
 
-const char *panfrost_exception_name(struct panfrost_device *pfdev, u32 
exception_code);
+const char *panfrost_exception_name(u32 exception_code);
 
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c 
b/drivers/gpu/drm/panfrost/panfrost_gpu.c
index 0e70e27fd8c3..26e4196b6c90 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
@@ -33,7 +33,7 @@ static irqreturn_t panfrost_gpu_irq_handler(int irq, void 
*data)
address |= gpu_read(pfdev, GPU_FAULT_ADDRESS_LO);
 
dev_warn(pfdev->dev, "GPU Fault 0x%08x (%s) at 0x%016llx\n",
-fault_status & 0xFF, panfrost_exception_name(pfdev, 
fault_status),
+fault_status & 0xFF, 
panfrost_exception_name(fault_status),
 address);
 
if (state & GPU_IRQ_MULTIPLE_FAULT)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 3757c6eb3023..1be80b3dd5d0 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -500,7 +500,7 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
 
dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
j,
-   panfrost_exception_name(pfdev, job_read(pfdev, 
JS_STATUS(j))),
+   panfrost_exception_name(job_read(pfdev, 
JS_STATUS(j))),
job_read(pfdev, JS_HEAD_LO(j)),
job_read(pfdev, JS_TAIL_LO(j)));
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 569509c2ba27..2a9bf30edc9d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -675,7 +675,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, 
void *data)
"TODO",
fault_status,
(fault_status & (1 << 10) ? "DECODER FAULT" : 
"SLAVE FAULT"),
-   exception_type, panfrost_exception_name(pfdev, 
exception_type),
+   exception_type, 
panfrost_exception_name(exception_type),
access_type, access_type_name(pfdev, 
fault_status),
source_id);
 
-- 
2.31.1

[PATCH 05/10] drm/panfrost: Disable the AS on unhandled page faults

2021-06-21 Thread Boris Brezillon

If we don't do that, we have to wait for the job timeout to expire
before the fault jobs gets killed.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 2a9bf30edc9d..d5c624e776f1 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -661,7 +661,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, 
void *data)
if ((status & mask) == BIT(as) && (exception_type & 0xF8) == 
0xC0)
ret = panfrost_mmu_map_fault_addr(pfdev, as, addr);
 
-   if (ret)
+   if (ret) {
/* terminal fault, print info about the fault */
dev_err(pfdev->dev,
"Unhandled Page fault in AS%d at VA 0x%016llX\n"
@@ -679,6 +679,10 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int 
irq, void *data)
access_type, access_type_name(pfdev, 
fault_status),
source_id);
 
+   /* Disable the MMU to stop jobs on this AS immediately 
*/
+   panfrost_mmu_disable(pfdev, as);
+   }
+
status &= ~mask;
 
/* If we received new MMU interrupts, process them before 
returning. */
-- 
2.31.1

[PATCH 08/10] drm/panfrost: Do the exception -> string translation using a table

2021-06-21 Thread Boris Brezillon

Do the exception -> string translation using a table so we can add extra
fields if we need to. While at it add an error field to ease the
exception -> error conversion which we'll need if we want to set the
fence error to something that reflects the exception code.

TODO: fix the error codes.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.c | 134 +
 drivers/gpu/drm/panfrost/panfrost_device.h |   1 +
 2 files changed, 88 insertions(+), 47 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index f7f5ca94f910..2de011cee258 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -292,55 +292,95 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
panfrost_clk_fini(pfdev);
 }
 
-const char *panfrost_exception_name(u32 exception_code)
-{
-   switch (exception_code) {
-   /* Non-Fault Status code */
-   case 0x00: return "NOT_STARTED/IDLE/OK";
-   case 0x01: return "DONE";
-   case 0x02: return "INTERRUPTED";
-   case 0x03: return "STOPPED";
-   case 0x04: return "TERMINATED";
-   case 0x08: return "ACTIVE";
-   /* Job exceptions */
-   case 0x40: return "JOB_CONFIG_FAULT";
-   case 0x41: return "JOB_POWER_FAULT";
-   case 0x42: return "JOB_READ_FAULT";
-   case 0x43: return "JOB_WRITE_FAULT";
-   case 0x44: return "JOB_AFFINITY_FAULT";
-   case 0x48: return "JOB_BUS_FAULT";
-   case 0x50: return "INSTR_INVALID_PC";
-   case 0x51: return "INSTR_INVALID_ENC";
-   case 0x52: return "INSTR_TYPE_MISMATCH";
-   case 0x53: return "INSTR_OPERAND_FAULT";
-   case 0x54: return "INSTR_TLS_FAULT";
-   case 0x55: return "INSTR_BARRIER_FAULT";
-   case 0x56: return "INSTR_ALIGN_FAULT";
-   case 0x58: return "DATA_INVALID_FAULT";
-   case 0x59: return "TILE_RANGE_FAULT";
-   case 0x5A: return "ADDR_RANGE_FAULT";
-   case 0x60: return "OUT_OF_MEMORY";
-   /* GPU exceptions */
-   case 0x80: return "DELAYED_BUS_FAULT";
-   case 0x88: return "SHAREABILITY_FAULT";
-   /* MMU exceptions */
-   case 0xC1: return "TRANSLATION_FAULT_LEVEL1";
-   case 0xC2: return "TRANSLATION_FAULT_LEVEL2";
-   case 0xC3: return "TRANSLATION_FAULT_LEVEL3";
-   case 0xC4: return "TRANSLATION_FAULT_LEVEL4";
-   case 0xC8: return "PERMISSION_FAULT";
-   case 0xC9 ... 0xCF: return "PERMISSION_FAULT";
-   case 0xD1: return "TRANSTAB_BUS_FAULT_LEVEL1";
-   case 0xD2: return "TRANSTAB_BUS_FAULT_LEVEL2";
-   case 0xD3: return "TRANSTAB_BUS_FAULT_LEVEL3";
-   case 0xD4: return "TRANSTAB_BUS_FAULT_LEVEL4";
-   case 0xD8: return "ACCESS_FLAG";
-   case 0xD9 ... 0xDF: return "ACCESS_FLAG";
-   case 0xE0 ... 0xE7: return "ADDRESS_SIZE_FAULT";
-   case 0xE8 ... 0xEF: return "MEMORY_ATTRIBUTES_FAULT";
+#define PANFROST_EXCEPTION(id, err) \
+   [DRM_PANFROST_EXCEPTION_ ## id] = { \
+   .name = #id, \
+   .error = err, \
}
 
-   return "UNKNOWN";
+struct panfrost_exception_info {
+   const char *name;
+   int error;
+};
+
+static const struct panfrost_exception_info panfrost_exception_infos[] = {
+   PANFROST_EXCEPTION(OK, 0),
+   PANFROST_EXCEPTION(DONE, 0),
+   PANFROST_EXCEPTION(STOPPED, 0),
+   PANFROST_EXCEPTION(TERMINATED, 0),
+   PANFROST_EXCEPTION(KABOOM, 0),
+   PANFROST_EXCEPTION(EUREKA, 0),
+   PANFROST_EXCEPTION(ACTIVE, 0),
+   PANFROST_EXCEPTION(JOB_CONFIG_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(JOB_POWER_FAULT, -ECANCELED),
+   PANFROST_EXCEPTION(JOB_READ_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(JOB_WRITE_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(JOB_AFFINITY_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(JOB_BUS_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(INSTR_INVALID_PC, -EINVAL),
+   PANFROST_EXCEPTION(INSTR_INVALID_ENC, -EINVAL),
+   PANFROST_EXCEPTION(INSTR_BARRIER_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(DATA_INVALID_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(TILE_RANGE_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(ADDR_RANGE_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(IMPRECISE_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(OOM, -ENOMEM),
+   PANFROST_EXCEPTION(UNKNOWN, -EINVAL),
+   PANFROST_EXCEPTION(DELAYED_BUS_FAULT, -EINVAL),
+   PANFROST_EXCEPTION(GPU_SHAREABILITY_FAULT, -ECANCELED),
+   PANFROST_EXCEPTION(SYS_SHAREABILITY_FAULT, -ECANCELE

[PATCH 10/10] drm/panfrost: Kill in-flight jobs on FD close

2021-06-21 Thread Boris Brezillon

If the process who submitted these jobs decided to close the FD before
the jobs are done it probably means it doesn't care about the result.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 33 +
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index aedc604d331c..a51fa0a81367 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -494,14 +494,22 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
if (status & JOB_INT_MASK_ERR(j)) {
enum panfrost_queue_status old_status;
u32 js_status = job_read(pfdev, JS_STATUS(j));
+   int error = panfrost_exception_to_error(js_status);
+   const char *exception_name = 
panfrost_exception_name(js_status);
 
job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
-   dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
-   j,
-   panfrost_exception_name(js_status),
-   job_read(pfdev, JS_HEAD_LO(j)),
-   job_read(pfdev, JS_TAIL_LO(j)));
+   if (!error) {
+   dev_dbg(pfdev->dev, "js interrupt, js=%d, 
status=%s, head=0x%x, tail=0x%x",
+   j, exception_name,
+   job_read(pfdev, JS_HEAD_LO(j)),
+   job_read(pfdev, JS_TAIL_LO(j)));
+   } else {
+   dev_err(pfdev->dev, "js fault, js=%d, 
status=%s, head=0x%x, tail=0x%x",
+   j, exception_name,
+   job_read(pfdev, JS_HEAD_LO(j)),
+   job_read(pfdev, JS_TAIL_LO(j)));
+   }
 
/* If we need a reset, signal it to the reset handler,
 * otherwise, update the fence error field and signal
@@ -688,10 +696,25 @@ int panfrost_job_open(struct panfrost_file_priv 
*panfrost_priv)
 
 void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
 {
+   struct panfrost_device *pfdev = panfrost_priv->pfdev;
+   unsigned long flags;
int i;
 
for (i = 0; i < NUM_JOB_SLOTS; i++)
drm_sched_entity_destroy(&panfrost_priv->sched_entity[i]);
+
+   /* Kill in-flight jobs */
+   spin_lock_irqsave(&pfdev->js->job_lock, flags);
+   for (i = 0; i < NUM_JOB_SLOTS; i++) {
+   struct drm_sched_entity *entity = 
&panfrost_priv->sched_entity[i];
+   struct panfrost_job *job = pfdev->jobs[i];
+
+   if (!job || job->base.entity != entity)
+   continue;
+
+   job_write(pfdev, JS_COMMAND(i), JS_COMMAND_HARD_STOP);
+   }
+   spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
 }
 
 int panfrost_job_is_idle(struct panfrost_device *pfdev)
-- 
2.31.1

[PATCH 04/10] drm/panfrost: Expose exception types to userspace

2021-06-21 Thread Boris Brezillon

Job headers contain an exception type field which might be read and
converted to a human readable string by tracing tools. Let's expose
the exception type as an enum so we share the same definition.

Signed-off-by: Boris Brezillon 
---
 include/uapi/drm/panfrost_drm.h | 65 +
 1 file changed, 65 insertions(+)

diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
index 061e700dd06c..9a05d57d0118 100644
--- a/include/uapi/drm/panfrost_drm.h
+++ b/include/uapi/drm/panfrost_drm.h
@@ -224,6 +224,71 @@ struct drm_panfrost_madvise {
__u32 retained;   /* out, whether backing store still exists */
 };
 
+/* The exception types */
+
+enum drm_panfrost_exception_type {
+   DRM_PANFROST_EXCEPTION_OK = 0x00,
+   DRM_PANFROST_EXCEPTION_DONE = 0x01,
+   DRM_PANFROST_EXCEPTION_STOPPED = 0x03,
+   DRM_PANFROST_EXCEPTION_TERMINATED = 0x04,
+   DRM_PANFROST_EXCEPTION_KABOOM = 0x05,
+   DRM_PANFROST_EXCEPTION_EUREKA = 0x06,
+   DRM_PANFROST_EXCEPTION_ACTIVE = 0x08,
+   DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT = 0x40,
+   DRM_PANFROST_EXCEPTION_JOB_POWER_FAULT = 0x41,
+   DRM_PANFROST_EXCEPTION_JOB_READ_FAULT = 0x42,
+   DRM_PANFROST_EXCEPTION_JOB_WRITE_FAULT = 0x43,
+   DRM_PANFROST_EXCEPTION_JOB_AFFINITY_FAULT = 0x44,
+   DRM_PANFROST_EXCEPTION_JOB_BUS_FAULT = 0x48,
+   DRM_PANFROST_EXCEPTION_INSTR_INVALID_PC = 0x50,
+   DRM_PANFROST_EXCEPTION_INSTR_INVALID_ENC = 0x51,
+   DRM_PANFROST_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,
+   DRM_PANFROST_EXCEPTION_DATA_INVALID_FAULT = 0x58,
+   DRM_PANFROST_EXCEPTION_TILE_RANGE_FAULT = 0x59,
+   DRM_PANFROST_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
+   DRM_PANFROST_EXCEPTION_IMPRECISE_FAULT = 0x5b,
+   DRM_PANFROST_EXCEPTION_OOM = 0x60,
+   DRM_PANFROST_EXCEPTION_UNKNOWN = 0x7f,
+   DRM_PANFROST_EXCEPTION_DELAYED_BUS_FAULT = 0x80,
+   DRM_PANFROST_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
+   DRM_PANFROST_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
+   DRM_PANFROST_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
+   DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_IDENTITY = 0xc7,
+   DRM_PANFROST_EXCEPTION_PERM_FAULT_0 = 0xc8,
+   DRM_PANFROST_EXCEPTION_PERM_FAULT_1 = 0xc9,
+   DRM_PANFROST_EXCEPTION_PERM_FAULT_2 = 0xca,
+   DRM_PANFROST_EXCEPTION_PERM_FAULT_3 = 0xcb,
+   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_0 = 0xd0,
+   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_1 = 0xd1,
+   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_2 = 0xd2,
+   DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_3 = 0xd3,
+   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_0 = 0xd8,
+   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
+   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_2 = 0xda,
+   DRM_PANFROST_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN0 = 0xe0,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN1 = 0xe1,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN2 = 0xe2,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN3 = 0xe3,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT0 = 0xe4,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT1 = 0xe5,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT2 = 0xe6,
+   DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT3 = 0xe7,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_0 = 0xe8,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_1 = 0xe9,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_2 = 0xea,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_3 = 0xeb,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_0 = 0xec,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_1 = 0xed,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_2 = 0xee,
+   DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_3 = 0xef,
+};
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.31.1

[PATCH 07/10] drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck

2021-06-21 Thread Boris Brezillon

Things are unlikely to resolve until we reset the GPU. Let's not wait
for other faults/timeout to happen to trigger this reset.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index d5c624e776f1..d20bcaecb78f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -36,8 +36,11 @@ static int wait_ready(struct panfrost_device *pfdev, u32 
as_nr)
ret = readl_relaxed_poll_timeout_atomic(pfdev->iomem + AS_STATUS(as_nr),
val, !(val & AS_STATUS_AS_ACTIVE), 10, 1000);
 
-   if (ret)
+   if (ret) {
+   /* The GPU hung, let's trigger a reset */
+   panfrost_device_schedule_reset(pfdev);
dev_err(pfdev->dev, "AS_ACTIVE bit stuck\n");
+   }
 
return ret;
 }
-- 
2.31.1

[PATCH 09/10] drm/panfrost: Don't reset the GPU on job faults unless we really have to

2021-06-21 Thread Boris Brezillon

If we can recover from a fault without a reset there's no reason to
issue one.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.c |  9 ++
 drivers/gpu/drm/panfrost/panfrost_device.h |  2 ++
 drivers/gpu/drm/panfrost/panfrost_job.c| 35 ++
 3 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c 
b/drivers/gpu/drm/panfrost/panfrost_device.c
index 2de011cee258..ac76e8646e97 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -383,6 +383,15 @@ int panfrost_exception_to_error(u32 exception_code)
return panfrost_exception_infos[exception_code].error;
 }
 
+bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
+   u32 exception_code)
+{
+   /* Right now, none of the GPU we support need a reset, but this
+* might change (e.g. Valhall GPUs require a when a BUS_FAULT occurs).
+*/
+   return false;
+}
+
 void panfrost_device_reset(struct panfrost_device *pfdev)
 {
panfrost_gpu_soft_reset(pfdev);
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 498c7b5dccd0..95e6044008d2 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -175,6 +175,8 @@ int panfrost_device_suspend(struct device *dev);
 
 const char *panfrost_exception_name(u32 exception_code);
 int panfrost_exception_to_error(u32 exception_code);
+bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
+   u32 exception_code);
 
 static inline void
 panfrost_device_schedule_reset(struct panfrost_device *pfdev)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index be5d3e4a1d0a..aedc604d331c 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -493,27 +493,38 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void 
*data)
 
if (status & JOB_INT_MASK_ERR(j)) {
enum panfrost_queue_status old_status;
+   u32 js_status = job_read(pfdev, JS_STATUS(j));
 
job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
dev_err(pfdev->dev, "js fault, js=%d, status=%s, 
head=0x%x, tail=0x%x",
j,
-   panfrost_exception_name(job_read(pfdev, 
JS_STATUS(j))),
+   panfrost_exception_name(js_status),
job_read(pfdev, JS_HEAD_LO(j)),
job_read(pfdev, JS_TAIL_LO(j)));
 
-   /*
-* When the queue is being restarted we don't report
-* faults directly to avoid races between the timeout
-* and reset handlers. panfrost_scheduler_start() will
-* call drm_sched_fault() after the queue has been
-* started if status == FAULT_PENDING.
+   /* If we need a reset, signal it to the reset handler,
+* otherwise, update the fence error field and signal
+* the job fence.
 */
-   old_status = atomic_cmpxchg(&pfdev->js->queue[j].status,
-   
PANFROST_QUEUE_STATUS_STARTING,
-   
PANFROST_QUEUE_STATUS_FAULT_PENDING);
-   if (old_status == PANFROST_QUEUE_STATUS_ACTIVE)
-   drm_sched_fault(&pfdev->js->queue[j].sched);
+   if (panfrost_exception_needs_reset(pfdev, js_status)) {
+   /*
+* When the queue is being restarted we don't 
report
+* faults directly to avoid races between the 
timeout
+* and reset handlers. 
panfrost_scheduler_start() will
+* call drm_sched_fault() after the queue has 
been
+* started if status == FAULT_PENDING.
+*/
+   old_status = 
atomic_cmpxchg(&pfdev->js->queue[j].status,
+   
PANFROST_QUEUE_STATUS_STARTING,
+   
PANFROST_QUEUE_STATUS_FAULT_PENDING);
+   if (old_status == PANFROST_QUEUE_STATUS_ACTIVE)
+   
drm_sched_fault(&pfdev->js->queue[j].sched);
+   } else {
+

[PATCH 06/10] drm/panfrost: Expose a helper to trigger a GPU reset

2021-06-21 Thread Boris Brezillon

Expose a helper to trigger a GPU reset so we can easily trigger reset
operations outside the job timeout handler.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.h | 8 
 drivers/gpu/drm/panfrost/panfrost_job.c| 4 +---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 2fe1550da7f8..1c6a3597eba0 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -175,4 +175,12 @@ int panfrost_device_suspend(struct device *dev);
 
 const char *panfrost_exception_name(u32 exception_code);
 
+static inline void
+panfrost_device_schedule_reset(struct panfrost_device *pfdev)
+{
+   /* Schedule a reset if there's no reset in progress. */
+   if (!atomic_xchg(&pfdev->reset.pending, 1))
+   schedule_work(&pfdev->reset.work);
+}
+
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 1be80b3dd5d0..be5d3e4a1d0a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -458,9 +458,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct 
drm_sched_job
if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job))
return DRM_GPU_SCHED_STAT_NOMINAL;
 
-   /* Schedule a reset if there's no reset in progress. */
-   if (!atomic_xchg(&pfdev->reset.pending, 1))
-   schedule_work(&pfdev->reset.work);
+   panfrost_device_schedule_reset(pfdev);
 
return DRM_GPU_SCHED_STAT_NOMINAL;
 }
-- 
2.31.1

[PATCH 01/10] drm/panfrost: Make sure MMU context lifetime is not bound to panfrost_priv

2021-06-21 Thread Boris Brezillon

Jobs can be in-flight when the file descriptor is closed (either because
the process did not terminate properly, or because it didn't wait for
all GPU jobs to be finished), and apparently panfrost_job_close() does
not cancel already running jobs. Let's refcount the MMU context object
so it's lifetime is no longer bound to the FD lifetime and running jobs
can finish properly without generating spurious page faults.

Reported-by: Icecream95 
Fixes: 7282f7645d06 ("drm/panfrost: Implement per FD address spaces")
Cc: 
Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_device.h |   8 +-
 drivers/gpu/drm/panfrost/panfrost_drv.c|  50 ++-
 drivers/gpu/drm/panfrost/panfrost_gem.c|  20 ++-
 drivers/gpu/drm/panfrost/panfrost_job.c|   4 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c| 160 ++---
 drivers/gpu/drm/panfrost/panfrost_mmu.h|   5 +-
 6 files changed, 136 insertions(+), 111 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index f614e98771e4..8b2cdb8c701d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -121,8 +121,12 @@ struct panfrost_device {
 };
 
 struct panfrost_mmu {
+   struct panfrost_device *pfdev;
+   struct kref refcount;
struct io_pgtable_cfg pgtbl_cfg;
struct io_pgtable_ops *pgtbl_ops;
+   struct drm_mm mm;
+   spinlock_t mm_lock;
int as;
atomic_t as_count;
struct list_head list;
@@ -133,9 +137,7 @@ struct panfrost_file_priv {
 
struct drm_sched_entity sched_entity[NUM_JOB_SLOTS];
 
-   struct panfrost_mmu mmu;
-   struct drm_mm mm;
-   spinlock_t mm_lock;
+   struct panfrost_mmu *mmu;
 };
 
 static inline struct panfrost_device *to_panfrost_device(struct drm_device 
*ddev)
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 075ec0ef746c..945133db1857 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -417,7 +417,7 @@ static int panfrost_ioctl_madvise(struct drm_device *dev, 
void *data,
 * anyway, so let's not bother.
 */
if (!list_is_singular(&bo->mappings.list) ||
-   WARN_ON_ONCE(first->mmu != &priv->mmu)) {
+   WARN_ON_ONCE(first->mmu != priv->mmu)) {
ret = -EINVAL;
goto out_unlock_mappings;
}
@@ -449,32 +449,6 @@ int panfrost_unstable_ioctl_check(void)
return 0;
 }
 
-#define PFN_4G (SZ_4G >> PAGE_SHIFT)
-#define PFN_4G_MASK(PFN_4G - 1)
-#define PFN_16M(SZ_16M >> PAGE_SHIFT)
-
-static void panfrost_drm_mm_color_adjust(const struct drm_mm_node *node,
-unsigned long color,
-u64 *start, u64 *end)
-{
-   /* Executable buffers can't start or end on a 4GB boundary */
-   if (!(color & PANFROST_BO_NOEXEC)) {
-   u64 next_seg;
-
-   if ((*start & PFN_4G_MASK) == 0)
-   (*start)++;
-
-   if ((*end & PFN_4G_MASK) == 0)
-   (*end)--;
-
-   next_seg = ALIGN(*start, PFN_4G);
-   if (next_seg - *start <= PFN_16M)
-   *start = next_seg + 1;
-
-   *end = min(*end, ALIGN(*start, PFN_4G) - 1);
-   }
-}
-
 static int
 panfrost_open(struct drm_device *dev, struct drm_file *file)
 {
@@ -489,15 +463,11 @@ panfrost_open(struct drm_device *dev, struct drm_file 
*file)
panfrost_priv->pfdev = pfdev;
file->driver_priv = panfrost_priv;
 
-   spin_lock_init(&panfrost_priv->mm_lock);
-
-   /* 4G enough for now. can be 48-bit */
-   drm_mm_init(&panfrost_priv->mm, SZ_32M >> PAGE_SHIFT, (SZ_4G - SZ_32M) 
>> PAGE_SHIFT);
-   panfrost_priv->mm.color_adjust = panfrost_drm_mm_color_adjust;
-
-   ret = panfrost_mmu_pgtable_alloc(panfrost_priv);
-   if (ret)
-   goto err_pgtable;
+   panfrost_priv->mmu = panfrost_mmu_ctx_create(pfdev);
+   if (IS_ERR(panfrost_priv->mmu)) {
+   ret = PTR_ERR(panfrost_priv->mmu);
+   goto err_free;
+   }
 
ret = panfrost_job_open(panfrost_priv);
if (ret)
@@ -506,9 +476,8 @@ panfrost_open(struct drm_device *dev, struct drm_file *file)
return 0;
 
 err_job:
-   panfrost_mmu_pgtable_free(panfrost_priv);
-err_pgtable:
-   drm_mm_takedown(&panfrost_priv->mm);
+   panfrost_mmu_ctx_put(panfrost_priv->mmu);
+err_free:
kfree(panfrost_priv);
return ret;
 }
@@ -521,8 +490,7 @@ panfrost_postclose(struct drm_device *dev, struct drm_file 
*file)
panfrost_perfcnt_close(file)

[PATCH 02/10] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition

2021-06-21 Thread Boris Brezillon

Exception types will be defined as an enum in panfrost_drm.h so userspace
and use the same definitions if needed.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_regs.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h 
b/drivers/gpu/drm/panfrost/panfrost_regs.h
index dc9df5457f1c..1940ff86e49a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_regs.h
+++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
@@ -262,9 +262,6 @@
 #define JS_COMMAND_SOFT_STOP_1 0x06/* Execute SOFT_STOP if 
JOB_CHAIN_FLAG is 1 */
 #define JS_COMMAND_HARD_STOP_1 0x07/* Execute HARD_STOP if 
JOB_CHAIN_FLAG is 1 */
 
-#define JS_STATUS_EVENT_ACTIVE 0x08
-
-
 /* MMU regs */
 #define MMU_INT_RAWSTAT0x2000
 #define MMU_INT_CLEAR  0x2004
-- 
2.31.1

[PATCH 00/10] drm/panfrost: Misc fixes/improvements

2021-06-21 Thread Boris Brezillon

Hello,

Here's is collection of patches addressing some stability issues.

The first patch has been submitted a while ago but was lacking a way
to kill in-flight jobs when a context is closed; which is now addressed
in patch 10.

The rest of those patches are improving fault handling (with some code
refactoring in the middle).

"drm/panfrost: Do the exception -> string translation using a table"
still has a TODO. I basically mapped all exception types to
EINVAL since most faults are triggered by invalid job/shaders, but
there might be some exceptions that should be translated to something
else. Any feedback on that aspect is welcome.

Regards,

Boris

Boris Brezillon (10):
  drm/panfrost: Make sure MMU context lifetime is not bound to
panfrost_priv
  drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition
  drm/panfrost: Drop the pfdev argument passed to
panfrost_exception_name()
  drm/panfrost: Expose exception types to userspace
  drm/panfrost: Disable the AS on unhandled page faults
  drm/panfrost: Expose a helper to trigger a GPU reset
  drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck
  drm/panfrost: Do the exception -> string translation using a table
  drm/panfrost: Don't reset the GPU on job faults unless we really have
to
  drm/panfrost: Kill in-flight jobs on FD close

 drivers/gpu/drm/panfrost/panfrost_device.c | 143 +++--
 drivers/gpu/drm/panfrost/panfrost_device.h |  21 ++-
 drivers/gpu/drm/panfrost/panfrost_drv.c|  50 ++
 drivers/gpu/drm/panfrost/panfrost_gem.c|  20 ++-
 drivers/gpu/drm/panfrost/panfrost_gpu.c|   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c|  74 ++---
 drivers/gpu/drm/panfrost/panfrost_mmu.c| 173 ++---
 drivers/gpu/drm/panfrost/panfrost_mmu.h|   5 +-
 drivers/gpu/drm/panfrost/panfrost_regs.h   |   3 -
 include/uapi/drm/panfrost_drm.h|  65 
 10 files changed, 371 insertions(+), 185 deletions(-)

-- 
2.31.1

[PATCH] drm/panfrost: Fix the panfrost_mmu_map_fault_addr() error path

2021-05-21 Thread Boris Brezillon

Make sure all bo->base.pages entries are either NULL or pointing to a
valid page before calling drm_gem_shmem_put_pages().

Reported-by: Tomeu Vizoso 
Cc: 
Fixes: 187d2929206e ("drm/panfrost: Add support for GPU heap allocations")
Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 569509c2ba27..d76dff201ea6 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -460,6 +460,7 @@ static int panfrost_mmu_map_fault_addr(struct 
panfrost_device *pfdev, int as,
if (IS_ERR(pages[i])) {
mutex_unlock(&bo->base.pages_lock);
ret = PTR_ERR(pages[i]);
+   pages[i] = NULL;
goto err_pages;
}
}
-- 
2.31.1

Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl

2021-03-12 Thread Boris Brezillon

On Fri, 12 Mar 2021 19:25:13 +0100
Boris Brezillon  wrote:

> > So where does this leave us?  Well, it depends on your submit model
> > and exactly how you handle pipeline barriers that sync between
> > engines.  If you're taking option 3 above and doing two command
> > buffers for each VkCommandBuffer, then you probably want two
> > serialized timelines, one for each engine, and some mechanism to tell
> > the kernel driver "these two command buffers have to run in parallel"
> > so that your ping-pong works.  If you're doing 1 or 2 above, I think
> > you probably still want two simple syncobjs, one for each engine.  You
> > don't really have any need to go all that far back in history.  All
> > you really need to describe is "command buffer X depends on previous
> > compute work" or "command buffer X depends on previous binning work".  
> 
> Okay, so this will effectively force in-order execution. Let's take your
> previous example and add 2 more jobs at the end that have no deps on
> previous commands:
> 
> vkBeginRenderPass() /* Writes to ImageA */
> vkCmdDraw()
> vkCmdDraw()
> ...
> vkEndRenderPass()
> vkPipelineBarrier(imageA /* fragment -> compute */)
> vkCmdDispatch() /* reads imageA, writes BufferB */
> vkBeginRenderPass() /* Writes to ImageC */
> vkCmdBindVertexBuffers(bufferB)
> vkCmdDraw();
> ...
> vkEndRenderPass()
> vkBeginRenderPass() /* Writes to ImageD */
> vkCmdDraw()
> ...
> vkEndRenderPass()
> 
> A: Vertex for the first draw on the compute engine
> B: Vertex for the first draw on the compute engine
> C: Fragment for the first draw on the binning engine; depends on A
> D: Fragment for the second draw on the binning engine; depends on B
> E: Compute on the compute engine; depends on C and D
> F: Vertex for the third draw on the compute engine; depends on E
> G: Fragment for the third draw on the binning engine; depends on F
> H: Vertex for the fourth draw on the compute engine
> I: Fragment for the fourth draw on the binning engine
> 
> When we reach E, we might be waiting for D to finish before scheduling
> the job, and because of the implicit serialization we have on the
> compute queue (F implicitly depends on E, and H on F) we can't schedule
> H either, which could, in theory be started. I guess that's where the
> term submission order is a bit unclear to me. The action of starting a
> job sounds like execution order to me (the order you starts jobs
> determines the execution order since we only have one HW queue per job
> type). All implicit deps have been calculated when we queued the job to
> the SW queue, and I thought that would be enough to meet the submission
> order requirements, but I might be wrong.
> 
> The PoC I have was trying to get rid of this explicit serialization on
> the compute and fragment queues by having one syncobj timeline
> (queue()) and synchronization points (Sx).
> 
> S0: in-fences=, out-fences= #waitSemaphore 
> sync point
> A: in-fences=, out-fences=
> B: in-fences=, out-fences=
> C: in-fences=, out-fence= #implicit dep on A through 
> the tiler context
> D: in-fences=, out-fence= #implicit dep on B through 
> the tiler context
> E: in-fences=, out-fence= #implicit dep on D through 
> imageA
> F: in-fences=, out-fence= #implicit dep on E through 
> buffer B
> G: in-fences=, out-fence= #implicit dep on F through 
> the tiler context
> H: in-fences=, out-fence=
> I: in-fences=, out-fence= #implicit dep on H through 
> the tiler buffer
> S1: in-fences=, out-fences= 
> #signalSemaphore,fence sync point
> # QueueWaitIdle is implemented with a wait(queue(0)), AKA wait on the last 
> point
> 
> With this solution H can be started before E if the compute slot
> is empty and E's implicit deps are not done. It's probably overkill,
> but I thought maximizing GPU utilization was important.

Nevermind, I forgot the drm scheduler was dequeuing jobs in order, so 2
syncobjs (one per queue type) is indeed the right approach.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

< 4 5 6 7 8 9 10 11 12 13 >

801 - 900 of 2462 matches

Mail list logo