[PATCH v3 2/5] drm/panthor: Make sure the tiler initial/max chunks are consistent

2024-05-02 Thread Boris Brezillon
It doesn't make sense to have a maximum number of chunks smaller than
the initial number of chunks attached to the context.

Fix the uAPI header to reflect the new constraint, and mention the
undocumented "initial_chunk_count > 0" constraint while at it.

v3:
- Add R-b

v2:
- Fix the check

Fixes: 9cca48fa4f89 ("drm/panthor: Add the heap logical block")
Signed-off-by: Boris Brezillon 
Reviewed-by: Liviu Dudau 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panthor/panthor_heap.c | 3 +++
 include/uapi/drm/panthor_drm.h | 8 ++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_heap.c 
b/drivers/gpu/drm/panthor/panthor_heap.c
index c3c0ba744937..3be86ec383d6 100644
--- a/drivers/gpu/drm/panthor/panthor_heap.c
+++ b/drivers/gpu/drm/panthor/panthor_heap.c
@@ -281,6 +281,9 @@ int panthor_heap_create(struct panthor_heap_pool *pool,
if (initial_chunk_count == 0)
return -EINVAL;
 
+   if (initial_chunk_count > max_chunks)
+   return -EINVAL;
+
if (hweight32(chunk_size) != 1 ||
chunk_size < SZ_256K || chunk_size > SZ_2M)
return -EINVAL;
diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index dadb05ab1235..5db80a0682d5 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -895,13 +895,17 @@ struct drm_panthor_tiler_heap_create {
/** @vm_id: VM ID the tiler heap should be mapped to */
__u32 vm_id;
 
-   /** @initial_chunk_count: Initial number of chunks to allocate. */
+   /** @initial_chunk_count: Initial number of chunks to allocate. Must be 
at least one. */
__u32 initial_chunk_count;
 
/** @chunk_size: Chunk size. Must be a power of two at least 256KB 
large. */
__u32 chunk_size;
 
-   /** @max_chunks: Maximum number of chunks that can be allocated. */
+   /**
+* @max_chunks: Maximum number of chunks that can be allocated.
+*
+* Must be at least @initial_chunk_count.
+*/
__u32 max_chunks;
 
/**
-- 
2.44.0



[PATCH v3 3/5] drm/panthor: Relax the constraints on the tiler chunk size

2024-05-02 Thread Boris Brezillon
The field used to store the chunk size if 12 bits wide, and the encoding
is chunk_size = chunk_header.chunk_size << 12, which gives us a
theoretical [4k:8M] range. This range is further limited by
implementation constraints, and all known implementations seem to
impose a [128k:8M] range, so do the same here.

We also relax the power-of-two constraint, which doesn't seem to
exist on v10. This will allow userspace to fine-tune initial/max
tiler memory on memory-constrained devices.

v3:
- Add R-bs
- Fix valid range in the kerneldoc

v2:
- Turn the power-of-two constraint into a page-aligned constraint to allow
  fine-tune of the initial/max heap memory size
- Fix the panthor_heap_create() kerneldoc

Fixes: 9cca48fa4f89 ("drm/panthor: Add the heap logical block")
Signed-off-by: Boris Brezillon 
Reviewed-by: Liviu Dudau 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panthor/panthor_heap.c | 8 
 include/uapi/drm/panthor_drm.h | 6 +-
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_heap.c 
b/drivers/gpu/drm/panthor/panthor_heap.c
index 3be86ec383d6..683bb94761bc 100644
--- a/drivers/gpu/drm/panthor/panthor_heap.c
+++ b/drivers/gpu/drm/panthor/panthor_heap.c
@@ -253,8 +253,8 @@ int panthor_heap_destroy(struct panthor_heap_pool *pool, 
u32 handle)
  * @pool: Pool to instantiate the heap context from.
  * @initial_chunk_count: Number of chunk allocated at initialization time.
  * Must be at least 1.
- * @chunk_size: The size of each chunk. Must be a power of two between 256k
- * and 2M.
+ * @chunk_size: The size of each chunk. Must be page-aligned and lie in the
+ * [128k:2M] range.
  * @max_chunks: Maximum number of chunks that can be allocated.
  * @target_in_flight: Maximum number of in-flight render passes.
  * @heap_ctx_gpu_va: Pointer holding the GPU address of the allocated heap
@@ -284,8 +284,8 @@ int panthor_heap_create(struct panthor_heap_pool *pool,
if (initial_chunk_count > max_chunks)
return -EINVAL;
 
-   if (hweight32(chunk_size) != 1 ||
-   chunk_size < SZ_256K || chunk_size > SZ_2M)
+   if (!IS_ALIGNED(chunk_size, PAGE_SIZE) ||
+   chunk_size < SZ_128K || chunk_size > SZ_8M)
return -EINVAL;
 
down_read(&pool->lock);
diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index 5db80a0682d5..b8220d2e698f 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -898,7 +898,11 @@ struct drm_panthor_tiler_heap_create {
/** @initial_chunk_count: Initial number of chunks to allocate. Must be 
at least one. */
__u32 initial_chunk_count;
 
-   /** @chunk_size: Chunk size. Must be a power of two at least 256KB 
large. */
+   /**
+* @chunk_size: Chunk size.
+*
+* Must be page-aligned and lie in the [128k:8M] range.
+*/
__u32 chunk_size;
 
/**
-- 
2.44.0



[PATCH v3 4/5] drm/panthor: Fix an off-by-one in the heap context retrieval logic

2024-05-02 Thread Boris Brezillon
The heap ID is used to index the heap context pool, and allocating
in the [1:MAX_HEAPS_PER_POOL] leads to an off-by-one. This was
originally to avoid returning a zero heap handle, but given the handle
is formed with (vm_id << 16) | heap_id, with vm_id > 0, we already can't
end up with a valid heap handle that's zero.

v3:
- Allocate in the [0:MAX_HEAPS_PER_POOL-1] range

v2:
- New patch

Fixes: 9cca48fa4f89 ("drm/panthor: Add the heap logical block")
Reported-by: Eric Smith 
Signed-off-by: Boris Brezillon 
Tested-by: Eric Smith 
---
 drivers/gpu/drm/panthor/panthor_heap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panthor/panthor_heap.c 
b/drivers/gpu/drm/panthor/panthor_heap.c
index 683bb94761bc..252332f5390f 100644
--- a/drivers/gpu/drm/panthor/panthor_heap.c
+++ b/drivers/gpu/drm/panthor/panthor_heap.c
@@ -323,7 +323,8 @@ int panthor_heap_create(struct panthor_heap_pool *pool,
if (!pool->vm) {
ret = -EINVAL;
} else {
-   ret = xa_alloc(&pool->xa, &id, heap, XA_LIMIT(1, 
MAX_HEAPS_PER_POOL), GFP_KERNEL);
+   ret = xa_alloc(&pool->xa, &id, heap,
+  XA_LIMIT(0, MAX_HEAPS_PER_POOL - 1), GFP_KERNEL);
if (!ret) {
void *gpu_ctx = panthor_get_heap_ctx(pool, id);
 
-- 
2.44.0



[PATCH v3 5/5] drm/panthor: Document drm_panthor_tiler_heap_destroy::handle validity constraints

2024-05-02 Thread Boris Brezillon
Make sure the user is aware that drm_panthor_tiler_heap_destroy::handle
must be a handle previously returned by
DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE.

Signed-off-by: Boris Brezillon 
---
 include/uapi/drm/panthor_drm.h | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index b8220d2e698f..aaed8e12ad0b 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -939,7 +939,11 @@ struct drm_panthor_tiler_heap_create {
  * struct drm_panthor_tiler_heap_destroy - Arguments passed to 
DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY
  */
 struct drm_panthor_tiler_heap_destroy {
-   /** @handle: Handle of the tiler heap to destroy */
+   /**
+* @handle: Handle of the tiler heap to destroy.
+*
+* Must be a valid heap handle returned by 
DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE.
+*/
__u32 handle;
 
/** @pad: Padding field, MBZ. */
-- 
2.44.0



[PATCH v2] drm/panthor: Make sure we handle 'unknown group state' case properly

2024-05-02 Thread Boris Brezillon
When we check for state values returned by the FW, we only cover part of
the 0:7 range. Make sure we catch FW inconsistencies by adding a default
to the switch statement, and flagging the group state as unknown in that
case.

When an unknown state is detected, we trigger a reset, and consider the
group as unusable after that point, to prevent the potential corruption
from creeping in other places if we continue executing stuff on this
context.

v2:
- Add Steve's R-b
- Fix commit message

Reported-by: Dan Carpenter 
Suggested-by: Steven Price 
Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
Reviewed-by: Liviu Dudau 
---
 drivers/gpu/drm/panthor/panthor_sched.c | 37 +++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
b/drivers/gpu/drm/panthor/panthor_sched.c
index b3a51a6de523..1a14ee30f774 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -490,6 +490,18 @@ enum panthor_group_state {
 * Can no longer be scheduled. The only allowed action is a destruction.
 */
PANTHOR_CS_GROUP_TERMINATED,
+
+   /**
+* @PANTHOR_CS_GROUP_UNKNOWN_STATE: Group is an unknown state.
+*
+* The FW returned an inconsistent state. The group is flagged unusable
+* and can no longer be scheduled. The only allowed action is a
+* destruction.
+*
+* When that happens, we also schedule a FW reset, to start from a fresh
+* state.
+*/
+   PANTHOR_CS_GROUP_UNKNOWN_STATE,
 };
 
 /**
@@ -1127,6 +1139,7 @@ csg_slot_sync_state_locked(struct panthor_device *ptdev, 
u32 csg_id)
struct panthor_fw_csg_iface *csg_iface;
struct panthor_group *group;
enum panthor_group_state new_state, old_state;
+   u32 csg_state;
 
lockdep_assert_held(&ptdev->scheduler->lock);
 
@@ -1137,7 +1150,8 @@ csg_slot_sync_state_locked(struct panthor_device *ptdev, 
u32 csg_id)
return;
 
old_state = group->state;
-   switch (csg_iface->output->ack & CSG_STATE_MASK) {
+   csg_state = csg_iface->output->ack & CSG_STATE_MASK;
+   switch (csg_state) {
case CSG_STATE_START:
case CSG_STATE_RESUME:
new_state = PANTHOR_CS_GROUP_ACTIVE;
@@ -1148,11 +1162,28 @@ csg_slot_sync_state_locked(struct panthor_device 
*ptdev, u32 csg_id)
case CSG_STATE_SUSPEND:
new_state = PANTHOR_CS_GROUP_SUSPENDED;
break;
+   default:
+   /* The unknown state might be caused by a FW state corruption,
+* which means the group metadata can't be trusted anymore, and
+* the SUSPEND operation might propagate the corruption to the
+* suspend buffers. Flag the group state as unknown to make
+* sure it's unusable after that point.
+*/
+   drm_err(&ptdev->base, "Invalid state on CSG %d (state=%d)",
+   csg_id, csg_state);
+   new_state = PANTHOR_CS_GROUP_UNKNOWN_STATE;
+   break;
}
 
if (old_state == new_state)
return;
 
+   /* The unknown state might be caused by a FW issue, reset the FW to
+* take a fresh start.
+*/
+   if (new_state == PANTHOR_CS_GROUP_UNKNOWN_STATE)
+   panthor_device_schedule_reset(ptdev);
+
if (new_state == PANTHOR_CS_GROUP_SUSPENDED)
csg_slot_sync_queues_state_locked(ptdev, csg_id);
 
@@ -1783,6 +1814,7 @@ static bool
 group_can_run(struct panthor_group *group)
 {
return group->state != PANTHOR_CS_GROUP_TERMINATED &&
+  group->state != PANTHOR_CS_GROUP_UNKNOWN_STATE &&
   !group->destroyed && group->fatal_queues == 0 &&
   !group->timedout;
 }
@@ -2557,7 +2589,8 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
 
if (csg_slot->group) {
csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, i,
-   CSG_STATE_SUSPEND,
+   group_can_run(csg_slot->group) ?
+   CSG_STATE_SUSPEND : 
CSG_STATE_TERMINATE,
CSG_STATE_MASK);
}
}
-- 
2.44.0



Re: [PATCH] drm/panthor: Kill the faulty_slots variable in panthor_sched_suspend()

2024-05-02 Thread Boris Brezillon
On Thu, 25 Apr 2024 12:18:29 +0100
Steven Price  wrote:

> On 25/04/2024 11:39, Boris Brezillon wrote:
> > We can use upd_ctx.timedout_mask directly, and the faulty_slots update
> > in the flush_caches_failed situation is never used.
> > 
> > Suggested-by: Suggested-by: Steven Price   
> 
> I'm obviously too full of suggestions! ;)

Pushed to drm-misc-next-fixes, but I realize I forgot to drop the extra
Suggested-by. Oh well.

> 
> And you're doing a much better job of my todo list than I am!
> 
> > Signed-off-by: Boris Brezillon   
> 
> Reviewed-by: Steven Price 
> 
> > ---
> >  drivers/gpu/drm/panthor/panthor_sched.c | 10 +++---
> >  1 file changed, 3 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
> > b/drivers/gpu/drm/panthor/panthor_sched.c
> > index fad4678ca4c8..fed28c16d5d1 100644
> > --- a/drivers/gpu/drm/panthor/panthor_sched.c
> > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> > @@ -2584,8 +2584,8 @@ void panthor_sched_suspend(struct panthor_device 
> > *ptdev)
> >  {
> > struct panthor_scheduler *sched = ptdev->scheduler;
> > struct panthor_csg_slots_upd_ctx upd_ctx;
> > -   u32 suspended_slots, faulty_slots;
> > struct panthor_group *group;
> > +   u32 suspended_slots;
> > u32 i;
> >  
> > mutex_lock(&sched->lock);
> > @@ -2605,10 +2605,9 @@ void panthor_sched_suspend(struct panthor_device 
> > *ptdev)
> >  
> > csgs_upd_ctx_apply_locked(ptdev, &upd_ctx);
> > suspended_slots &= ~upd_ctx.timedout_mask;
> > -   faulty_slots = upd_ctx.timedout_mask;
> >  
> > -   if (faulty_slots) {
> > -   u32 slot_mask = faulty_slots;
> > +   if (upd_ctx.timedout_mask) {
> > +   u32 slot_mask = upd_ctx.timedout_mask;
> >  
> > drm_err(&ptdev->base, "CSG suspend failed, escalating to 
> > termination");
> > csgs_upd_ctx_init(&upd_ctx);
> > @@ -2659,9 +2658,6 @@ void panthor_sched_suspend(struct panthor_device 
> > *ptdev)
> >  
> > slot_mask &= ~BIT(csg_id);
> > }
> > -
> > -   if (flush_caches_failed)
> > -   faulty_slots |= suspended_slots;
> > }
> >  
> > for (i = 0; i < sched->csg_slot_count; i++) {  
> 



Re: [PATCH v2] drm/panthor: Make sure we handle 'unknown group state' case properly

2024-05-02 Thread Boris Brezillon
On Thu,  2 May 2024 17:52:48 +0200
Boris Brezillon  wrote:

> When we check for state values returned by the FW, we only cover part of
> the 0:7 range. Make sure we catch FW inconsistencies by adding a default
> to the switch statement, and flagging the group state as unknown in that
> case.
> 
> When an unknown state is detected, we trigger a reset, and consider the
> group as unusable after that point, to prevent the potential corruption
> from creeping in other places if we continue executing stuff on this
> context.
> 
> v2:
> - Add Steve's R-b
> - Fix commit message
> 
> Reported-by: Dan Carpenter 
> Suggested-by: Steven Price 
> Signed-off-by: Boris Brezillon 
> Reviewed-by: Steven Price 
> Reviewed-by: Liviu Dudau 

Queued to drm-misc-next-fixes.

> ---
>  drivers/gpu/drm/panthor/panthor_sched.c | 37 +++--
>  1 file changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
> b/drivers/gpu/drm/panthor/panthor_sched.c
> index b3a51a6de523..1a14ee30f774 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -490,6 +490,18 @@ enum panthor_group_state {
>* Can no longer be scheduled. The only allowed action is a destruction.
>*/
>   PANTHOR_CS_GROUP_TERMINATED,
> +
> + /**
> +  * @PANTHOR_CS_GROUP_UNKNOWN_STATE: Group is an unknown state.
> +  *
> +  * The FW returned an inconsistent state. The group is flagged unusable
> +  * and can no longer be scheduled. The only allowed action is a
> +  * destruction.
> +  *
> +  * When that happens, we also schedule a FW reset, to start from a fresh
> +  * state.
> +  */
> + PANTHOR_CS_GROUP_UNKNOWN_STATE,
>  };
>  
>  /**
> @@ -1127,6 +1139,7 @@ csg_slot_sync_state_locked(struct panthor_device 
> *ptdev, u32 csg_id)
>   struct panthor_fw_csg_iface *csg_iface;
>   struct panthor_group *group;
>   enum panthor_group_state new_state, old_state;
> + u32 csg_state;
>  
>   lockdep_assert_held(&ptdev->scheduler->lock);
>  
> @@ -1137,7 +1150,8 @@ csg_slot_sync_state_locked(struct panthor_device 
> *ptdev, u32 csg_id)
>   return;
>  
>   old_state = group->state;
> - switch (csg_iface->output->ack & CSG_STATE_MASK) {
> + csg_state = csg_iface->output->ack & CSG_STATE_MASK;
> + switch (csg_state) {
>   case CSG_STATE_START:
>   case CSG_STATE_RESUME:
>   new_state = PANTHOR_CS_GROUP_ACTIVE;
> @@ -1148,11 +1162,28 @@ csg_slot_sync_state_locked(struct panthor_device 
> *ptdev, u32 csg_id)
>   case CSG_STATE_SUSPEND:
>   new_state = PANTHOR_CS_GROUP_SUSPENDED;
>   break;
> + default:
> + /* The unknown state might be caused by a FW state corruption,
> +  * which means the group metadata can't be trusted anymore, and
> +  * the SUSPEND operation might propagate the corruption to the
> +  * suspend buffers. Flag the group state as unknown to make
> +  * sure it's unusable after that point.
> +  */
> + drm_err(&ptdev->base, "Invalid state on CSG %d (state=%d)",
> + csg_id, csg_state);
> + new_state = PANTHOR_CS_GROUP_UNKNOWN_STATE;
> + break;
>   }
>  
>   if (old_state == new_state)
>   return;
>  
> + /* The unknown state might be caused by a FW issue, reset the FW to
> +  * take a fresh start.
> +  */
> + if (new_state == PANTHOR_CS_GROUP_UNKNOWN_STATE)
> + panthor_device_schedule_reset(ptdev);
> +
>   if (new_state == PANTHOR_CS_GROUP_SUSPENDED)
>   csg_slot_sync_queues_state_locked(ptdev, csg_id);
>  
> @@ -1783,6 +1814,7 @@ static bool
>  group_can_run(struct panthor_group *group)
>  {
>   return group->state != PANTHOR_CS_GROUP_TERMINATED &&
> +group->state != PANTHOR_CS_GROUP_UNKNOWN_STATE &&
>  !group->destroyed && group->fatal_queues == 0 &&
>  !group->timedout;
>  }
> @@ -2557,7 +2589,8 @@ void panthor_sched_suspend(struct panthor_device *ptdev)
>  
>   if (csg_slot->group) {
>   csgs_upd_ctx_queue_reqs(ptdev, &upd_ctx, i,
> - CSG_STATE_SUSPEND,
> + group_can_run(csg_slot->group) ?
> + CSG_STATE_SUSPEND : 
> CSG_STATE_TERMINATE,
>   CSG_STATE_MASK);
>   }
>   }



Re: [PATCH v3 4/5] drm/panthor: Fix an off-by-one in the heap context retrieval logic

2024-05-02 Thread Boris Brezillon
On Thu, 2 May 2024 16:52:24 +0100
Steven Price  wrote:

> On 02/05/2024 16:40, Boris Brezillon wrote:
> > The heap ID is used to index the heap context pool, and allocating
> > in the [1:MAX_HEAPS_PER_POOL] leads to an off-by-one. This was
> > originally to avoid returning a zero heap handle, but given the handle
> > is formed with (vm_id << 16) | heap_id, with vm_id > 0, we already can't
> > end up with a valid heap handle that's zero.
> > 
> > v3:
> > - Allocate in the [0:MAX_HEAPS_PER_POOL-1] range
> > 
> > v2:
> > - New patch
> > 
> > Fixes: 9cca48fa4f89 ("drm/panthor: Add the heap logical block")
> > Reported-by: Eric Smith 
> > Signed-off-by: Boris Brezillon 
> > Tested-by: Eric Smith   
> 
> Don't we also need to change the xa_init_flags() in
> panthor_heap_pool_create()?

Uh, we should, indeed.

> 
> Steve
> 
> > ---
> >  drivers/gpu/drm/panthor/panthor_heap.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_heap.c 
> > b/drivers/gpu/drm/panthor/panthor_heap.c
> > index 683bb94761bc..252332f5390f 100644
> > --- a/drivers/gpu/drm/panthor/panthor_heap.c
> > +++ b/drivers/gpu/drm/panthor/panthor_heap.c
> > @@ -323,7 +323,8 @@ int panthor_heap_create(struct panthor_heap_pool *pool,
> > if (!pool->vm) {
> > ret = -EINVAL;
> > } else {
> > -   ret = xa_alloc(&pool->xa, &id, heap, XA_LIMIT(1, 
> > MAX_HEAPS_PER_POOL), GFP_KERNEL);
> > +   ret = xa_alloc(&pool->xa, &id, heap,
> > +  XA_LIMIT(0, MAX_HEAPS_PER_POOL - 1), GFP_KERNEL);
> > if (!ret) {
> > void *gpu_ctx = panthor_get_heap_ctx(pool, id);
> >
> 



[PATCH v4 0/5] drm/panthor: Collection of tiler heap related fixes

2024-05-02 Thread Boris Brezillon
This is a collection of tiler heap fixes for bugs/oddities found while
looking at incremental rendering.

Ideally, we want to land those before 6.10 is released, so we don't need
to increment the driver version to reflect the ABI changes.

Changelog detailed in each commit.

Regards,

Boris

Antonino Maniscalco (1):
  drm/panthor: Fix tiler OOM handling to allow incremental rendering

Boris Brezillon (4):
  drm/panthor: Make sure the tiler initial/max chunks are consistent
  drm/panthor: Relax the constraints on the tiler chunk size
  drm/panthor: Fix an off-by-one in the heap context retrieval logic
  drm/panthor: Document drm_panthor_tiler_heap_destroy::handle validity
constraints

 drivers/gpu/drm/panthor/panthor_heap.c  | 28 -
 drivers/gpu/drm/panthor/panthor_sched.c |  7 ++-
 include/uapi/drm/panthor_drm.h  | 20 ++
 3 files changed, 40 insertions(+), 15 deletions(-)

-- 
2.44.0



[PATCH v4 2/5] drm/panthor: Make sure the tiler initial/max chunks are consistent

2024-05-02 Thread Boris Brezillon
It doesn't make sense to have a maximum number of chunks smaller than
the initial number of chunks attached to the context.

Fix the uAPI header to reflect the new constraint, and mention the
undocumented "initial_chunk_count > 0" constraint while at it.

v3:
- Add R-b

v2:
- Fix the check

Fixes: 9cca48fa4f89 ("drm/panthor: Add the heap logical block")
Signed-off-by: Boris Brezillon 
Reviewed-by: Liviu Dudau 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panthor/panthor_heap.c | 3 +++
 include/uapi/drm/panthor_drm.h | 8 ++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_heap.c 
b/drivers/gpu/drm/panthor/panthor_heap.c
index c3c0ba744937..3be86ec383d6 100644
--- a/drivers/gpu/drm/panthor/panthor_heap.c
+++ b/drivers/gpu/drm/panthor/panthor_heap.c
@@ -281,6 +281,9 @@ int panthor_heap_create(struct panthor_heap_pool *pool,
if (initial_chunk_count == 0)
return -EINVAL;
 
+   if (initial_chunk_count > max_chunks)
+   return -EINVAL;
+
if (hweight32(chunk_size) != 1 ||
chunk_size < SZ_256K || chunk_size > SZ_2M)
return -EINVAL;
diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index dadb05ab1235..5db80a0682d5 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -895,13 +895,17 @@ struct drm_panthor_tiler_heap_create {
/** @vm_id: VM ID the tiler heap should be mapped to */
__u32 vm_id;
 
-   /** @initial_chunk_count: Initial number of chunks to allocate. */
+   /** @initial_chunk_count: Initial number of chunks to allocate. Must be 
at least one. */
__u32 initial_chunk_count;
 
/** @chunk_size: Chunk size. Must be a power of two at least 256KB 
large. */
__u32 chunk_size;
 
-   /** @max_chunks: Maximum number of chunks that can be allocated. */
+   /**
+* @max_chunks: Maximum number of chunks that can be allocated.
+*
+* Must be at least @initial_chunk_count.
+*/
__u32 max_chunks;
 
/**
-- 
2.44.0



[PATCH v4 3/5] drm/panthor: Relax the constraints on the tiler chunk size

2024-05-02 Thread Boris Brezillon
The field used to store the chunk size if 12 bits wide, and the encoding
is chunk_size = chunk_header.chunk_size << 12, which gives us a
theoretical [4k:8M] range. This range is further limited by
implementation constraints, and all known implementations seem to
impose a [128k:8M] range, so do the same here.

We also relax the power-of-two constraint, which doesn't seem to
exist on v10. This will allow userspace to fine-tune initial/max
tiler memory on memory-constrained devices.

v4:
- Actually fix the range in the kerneldoc

v3:
- Add R-bs
- Fix valid range in the kerneldoc

v2:
- Turn the power-of-two constraint into a page-aligned constraint to allow
  fine-tune of the initial/max heap memory size
- Fix the panthor_heap_create() kerneldoc

Fixes: 9cca48fa4f89 ("drm/panthor: Add the heap logical block")
Signed-off-by: Boris Brezillon 
Reviewed-by: Liviu Dudau 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panthor/panthor_heap.c | 8 
 include/uapi/drm/panthor_drm.h | 6 +-
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_heap.c 
b/drivers/gpu/drm/panthor/panthor_heap.c
index 3be86ec383d6..b0fc5b9ee847 100644
--- a/drivers/gpu/drm/panthor/panthor_heap.c
+++ b/drivers/gpu/drm/panthor/panthor_heap.c
@@ -253,8 +253,8 @@ int panthor_heap_destroy(struct panthor_heap_pool *pool, 
u32 handle)
  * @pool: Pool to instantiate the heap context from.
  * @initial_chunk_count: Number of chunk allocated at initialization time.
  * Must be at least 1.
- * @chunk_size: The size of each chunk. Must be a power of two between 256k
- * and 2M.
+ * @chunk_size: The size of each chunk. Must be page-aligned and lie in the
+ * [128k:8M] range.
  * @max_chunks: Maximum number of chunks that can be allocated.
  * @target_in_flight: Maximum number of in-flight render passes.
  * @heap_ctx_gpu_va: Pointer holding the GPU address of the allocated heap
@@ -284,8 +284,8 @@ int panthor_heap_create(struct panthor_heap_pool *pool,
if (initial_chunk_count > max_chunks)
return -EINVAL;
 
-   if (hweight32(chunk_size) != 1 ||
-   chunk_size < SZ_256K || chunk_size > SZ_2M)
+   if (!IS_ALIGNED(chunk_size, PAGE_SIZE) ||
+   chunk_size < SZ_128K || chunk_size > SZ_8M)
return -EINVAL;
 
down_read(&pool->lock);
diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index 5db80a0682d5..b8220d2e698f 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -898,7 +898,11 @@ struct drm_panthor_tiler_heap_create {
/** @initial_chunk_count: Initial number of chunks to allocate. Must be 
at least one. */
__u32 initial_chunk_count;
 
-   /** @chunk_size: Chunk size. Must be a power of two at least 256KB 
large. */
+   /**
+* @chunk_size: Chunk size.
+*
+* Must be page-aligned and lie in the [128k:8M] range.
+*/
__u32 chunk_size;
 
/**
-- 
2.44.0



[PATCH v4 4/5] drm/panthor: Fix an off-by-one in the heap context retrieval logic

2024-05-02 Thread Boris Brezillon
The heap ID is used to index the heap context pool, and allocating
in the [1:MAX_HEAPS_PER_POOL] leads to an off-by-one. This was
originally to avoid returning a zero heap handle, but given the handle
is formed with (vm_id << 16) | heap_id, with vm_id > 0, we already can't
end up with a valid heap handle that's zero.

v4:
- s/XA_FLAGS_ALLOC1/XA_FLAGS_ALLOC/

v3:
- Allocate in the [0:MAX_HEAPS_PER_POOL-1] range

v2:
- New patch

Fixes: 9cca48fa4f89 ("drm/panthor: Add the heap logical block")
Reported-by: Eric Smith 
Signed-off-by: Boris Brezillon 
Tested-by: Eric Smith 
---
 drivers/gpu/drm/panthor/panthor_heap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_heap.c 
b/drivers/gpu/drm/panthor/panthor_heap.c
index b0fc5b9ee847..95a1c6c9f35e 100644
--- a/drivers/gpu/drm/panthor/panthor_heap.c
+++ b/drivers/gpu/drm/panthor/panthor_heap.c
@@ -323,7 +323,8 @@ int panthor_heap_create(struct panthor_heap_pool *pool,
if (!pool->vm) {
ret = -EINVAL;
} else {
-   ret = xa_alloc(&pool->xa, &id, heap, XA_LIMIT(1, 
MAX_HEAPS_PER_POOL), GFP_KERNEL);
+   ret = xa_alloc(&pool->xa, &id, heap,
+  XA_LIMIT(0, MAX_HEAPS_PER_POOL - 1), GFP_KERNEL);
if (!ret) {
void *gpu_ctx = panthor_get_heap_ctx(pool, id);
 
@@ -543,7 +544,7 @@ panthor_heap_pool_create(struct panthor_device *ptdev, 
struct panthor_vm *vm)
pool->vm = vm;
pool->ptdev = ptdev;
init_rwsem(&pool->lock);
-   xa_init_flags(&pool->xa, XA_FLAGS_ALLOC1);
+   xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
kref_init(&pool->refcount);
 
pool->gpu_contexts = panthor_kernel_bo_create(ptdev, vm, bosize,
-- 
2.44.0



[PATCH v4 5/5] drm/panthor: Document drm_panthor_tiler_heap_destroy::handle validity constraints

2024-05-02 Thread Boris Brezillon
Make sure the user is aware that drm_panthor_tiler_heap_destroy::handle
must be a handle previously returned by
DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE.

v4:
- Add Steve's R-b

v3:
- New patch

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 include/uapi/drm/panthor_drm.h | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
index b8220d2e698f..aaed8e12ad0b 100644
--- a/include/uapi/drm/panthor_drm.h
+++ b/include/uapi/drm/panthor_drm.h
@@ -939,7 +939,11 @@ struct drm_panthor_tiler_heap_create {
  * struct drm_panthor_tiler_heap_destroy - Arguments passed to 
DRM_IOCTL_PANTHOR_TILER_HEAP_DESTROY
  */
 struct drm_panthor_tiler_heap_destroy {
-   /** @handle: Handle of the tiler heap to destroy */
+   /**
+* @handle: Handle of the tiler heap to destroy.
+*
+* Must be a valid heap handle returned by 
DRM_IOCTL_PANTHOR_TILER_HEAP_CREATE.
+*/
__u32 handle;
 
/** @pad: Padding field, MBZ. */
-- 
2.44.0



[PATCH v4 1/5] drm/panthor: Fix tiler OOM handling to allow incremental rendering

2024-05-02 Thread Boris Brezillon
From: Antonino Maniscalco 

If the kernel couldn't allocate memory because we reached the maximum
number of chunks but no render passes are in flight
(panthor_heap_grow() returning -ENOMEM), we should defer the OOM
handling to the FW by returning a NULL chunk. The FW will then call
the tiler OOM exception handler, which is supposed to implement
incremental rendering (execute an intermediate fragment job to flush
the pending primitives, release the tiler memory that was used to
store those primitives, and start over from where it stopped).

Instead of checking for both ENOMEM and EBUSY, make panthor_heap_grow()
return ENOMEM no matter the reason of this allocation failure, the FW
doesn't care anyway.

v3:
- Add R-bs

v2:
- Make panthor_heap_grow() return -ENOMEM for all kind of allocation
  failures
- Document the panthor_heap_grow() semantics

Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
Signed-off-by: Antonino Maniscalco 
Signed-off-by: Boris Brezillon 
Reviewed-by: Liviu Dudau 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panthor/panthor_heap.c  | 12 
 drivers/gpu/drm/panthor/panthor_sched.c |  7 ++-
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_heap.c 
b/drivers/gpu/drm/panthor/panthor_heap.c
index 143fa35f2e74..c3c0ba744937 100644
--- a/drivers/gpu/drm/panthor/panthor_heap.c
+++ b/drivers/gpu/drm/panthor/panthor_heap.c
@@ -410,6 +410,13 @@ int panthor_heap_return_chunk(struct panthor_heap_pool 
*pool,
  * @renderpasses_in_flight: Number of render passes currently in-flight.
  * @pending_frag_count: Number of fragment jobs waiting for 
execution/completion.
  * @new_chunk_gpu_va: Pointer used to return the chunk VA.
+ *
+ * Return:
+ * - 0 if a new heap was allocated
+ * - -ENOMEM if the tiler context reached the maximum number of chunks
+ *   or if too many render passes are in-flight
+ *   or if the allocation failed
+ * - -EINVAL if any of the arguments passed to panthor_heap_grow() is invalid
  */
 int panthor_heap_grow(struct panthor_heap_pool *pool,
  u64 heap_gpu_va,
@@ -439,10 +446,7 @@ int panthor_heap_grow(struct panthor_heap_pool *pool,
 * handler provided by the userspace driver, if any).
 */
if (renderpasses_in_flight > heap->target_in_flight ||
-   (pending_frag_count > 0 && heap->chunk_count >= heap->max_chunks)) {
-   ret = -EBUSY;
-   goto out_unlock;
-   } else if (heap->chunk_count >= heap->max_chunks) {
+   heap->chunk_count >= heap->max_chunks) {
ret = -ENOMEM;
goto out_unlock;
}
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
b/drivers/gpu/drm/panthor/panthor_sched.c
index 7f16a4a14e9a..c126251c5ba7 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -1385,7 +1385,12 @@ static int group_process_tiler_oom(struct panthor_group 
*group, u32 cs_id)
pending_frag_count, &new_chunk_va);
}
 
-   if (ret && ret != -EBUSY) {
+   /* If the heap context doesn't have memory for us, we want to let the
+* FW try to reclaim memory by waiting for fragment jobs to land or by
+* executing the tiler OOM exception handler, which is supposed to
+* implement incremental rendering.
+*/
+   if (ret && ret != -ENOMEM) {
drm_warn(&ptdev->base, "Failed to extend the tiler heap\n");
group->fatal_queues |= BIT(cs_id);
sched_queue_delayed_work(sched, tick, 0);
-- 
2.44.0



Re: [PATCH v3 3/5] drm/panthor: Relax the constraints on the tiler chunk size

2024-05-02 Thread Boris Brezillon
On Thu, 2 May 2024 16:47:56 +0100
Steven Price  wrote:

> On 02/05/2024 16:40, Boris Brezillon wrote:
> > The field used to store the chunk size if 12 bits wide, and the encoding
> > is chunk_size = chunk_header.chunk_size << 12, which gives us a
> > theoretical [4k:8M] range. This range is further limited by
> > implementation constraints, and all known implementations seem to
> > impose a [128k:8M] range, so do the same here.
> > 
> > We also relax the power-of-two constraint, which doesn't seem to
> > exist on v10. This will allow userspace to fine-tune initial/max
> > tiler memory on memory-constrained devices.
> > 
> > v3:
> > - Add R-bs
> > - Fix valid range in the kerneldoc  
> 
> Sadly the fixed range didn't make it to this posting... ;)

My bad, I was checking the uAPI header and thought I had already fixed
it the other day. Should be good in v4.

> 
> Steve
> 
> > 
> > v2:
> > - Turn the power-of-two constraint into a page-aligned constraint to allow
> >   fine-tune of the initial/max heap memory size
> > - Fix the panthor_heap_create() kerneldoc
> > 
> > Fixes: 9cca48fa4f89 ("drm/panthor: Add the heap logical block")
> > Signed-off-by: Boris Brezillon 
> > Reviewed-by: Liviu Dudau 
> > Reviewed-by: Steven Price 
> > ---
> >  drivers/gpu/drm/panthor/panthor_heap.c | 8 
> >  include/uapi/drm/panthor_drm.h | 6 +-
> >  2 files changed, 9 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panthor/panthor_heap.c 
> > b/drivers/gpu/drm/panthor/panthor_heap.c
> > index 3be86ec383d6..683bb94761bc 100644
> > --- a/drivers/gpu/drm/panthor/panthor_heap.c
> > +++ b/drivers/gpu/drm/panthor/panthor_heap.c
> > @@ -253,8 +253,8 @@ int panthor_heap_destroy(struct panthor_heap_pool 
> > *pool, u32 handle)
> >   * @pool: Pool to instantiate the heap context from.
> >   * @initial_chunk_count: Number of chunk allocated at initialization time.
> >   * Must be at least 1.
> > - * @chunk_size: The size of each chunk. Must be a power of two between 256k
> > - * and 2M.
> > + * @chunk_size: The size of each chunk. Must be page-aligned and lie in the
> > + * [128k:2M] range.
> >   * @max_chunks: Maximum number of chunks that can be allocated.
> >   * @target_in_flight: Maximum number of in-flight render passes.
> >   * @heap_ctx_gpu_va: Pointer holding the GPU address of the allocated heap
> > @@ -284,8 +284,8 @@ int panthor_heap_create(struct panthor_heap_pool *pool,
> > if (initial_chunk_count > max_chunks)
> > return -EINVAL;
> >  
> > -   if (hweight32(chunk_size) != 1 ||
> > -   chunk_size < SZ_256K || chunk_size > SZ_2M)
> > +   if (!IS_ALIGNED(chunk_size, PAGE_SIZE) ||
> > +   chunk_size < SZ_128K || chunk_size > SZ_8M)
> > return -EINVAL;
> >  
> > down_read(&pool->lock);
> > diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> > index 5db80a0682d5..b8220d2e698f 100644
> > --- a/include/uapi/drm/panthor_drm.h
> > +++ b/include/uapi/drm/panthor_drm.h
> > @@ -898,7 +898,11 @@ struct drm_panthor_tiler_heap_create {
> > /** @initial_chunk_count: Initial number of chunks to allocate. Must be 
> > at least one. */
> > __u32 initial_chunk_count;
> >  
> > -   /** @chunk_size: Chunk size. Must be a power of two at least 256KB 
> > large. */
> > +   /**
> > +* @chunk_size: Chunk size.
> > +*
> > +* Must be page-aligned and lie in the [128k:8M] range.
> > +*/
> > __u32 chunk_size;
> >  
> > /**  
> 



[PATCH 1/4] drm/panthor: Force an immediate reset on unrecoverable faults

2024-05-02 Thread Boris Brezillon
If the FW reports an unrecoverable fault, we need to reset the GPU
before we can start re-using it again.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_device.c |  1 +
 drivers/gpu/drm/panthor/panthor_device.h |  1 +
 drivers/gpu/drm/panthor/panthor_sched.c  | 11 ++-
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.c 
b/drivers/gpu/drm/panthor/panthor_device.c
index 75276cbeba20..4c5b54e7abb7 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -293,6 +293,7 @@ static const struct panthor_exception_info 
panthor_exception_infos[] = {
PANTHOR_EXCEPTION(ACTIVE),
PANTHOR_EXCEPTION(CS_RES_TERM),
PANTHOR_EXCEPTION(CS_CONFIG_FAULT),
+   PANTHOR_EXCEPTION(CS_UNRECOVERABLE),
PANTHOR_EXCEPTION(CS_ENDPOINT_FAULT),
PANTHOR_EXCEPTION(CS_BUS_FAULT),
PANTHOR_EXCEPTION(CS_INSTR_INVALID),
diff --git a/drivers/gpu/drm/panthor/panthor_device.h 
b/drivers/gpu/drm/panthor/panthor_device.h
index 2fdd671b38fd..e388c0472ba7 100644
--- a/drivers/gpu/drm/panthor/panthor_device.h
+++ b/drivers/gpu/drm/panthor/panthor_device.h
@@ -216,6 +216,7 @@ enum drm_panthor_exception_type {
DRM_PANTHOR_EXCEPTION_CS_RES_TERM = 0x0f,
DRM_PANTHOR_EXCEPTION_MAX_NON_FAULT = 0x3f,
DRM_PANTHOR_EXCEPTION_CS_CONFIG_FAULT = 0x40,
+   DRM_PANTHOR_EXCEPTION_CS_UNRECOVERABLE = 0x41,
DRM_PANTHOR_EXCEPTION_CS_ENDPOINT_FAULT = 0x44,
DRM_PANTHOR_EXCEPTION_CS_BUS_FAULT = 0x48,
DRM_PANTHOR_EXCEPTION_CS_INSTR_INVALID = 0x49,
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
b/drivers/gpu/drm/panthor/panthor_sched.c
index 7f16a4a14e9a..1d2708c3ab0a 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -1281,7 +1281,16 @@ cs_slot_process_fatal_event_locked(struct panthor_device 
*ptdev,
if (group)
group->fatal_queues |= BIT(cs_id);
 
-   sched_queue_delayed_work(sched, tick, 0);
+   if (CS_EXCEPTION_TYPE(fatal) == DRM_PANTHOR_EXCEPTION_CS_UNRECOVERABLE) 
{
+   /* If this exception is unrecoverable, queue a reset, and make
+* sure we stop scheduling groups until the reset has happened.
+*/
+   panthor_device_schedule_reset(ptdev);
+   cancel_delayed_work(&sched->tick_work);
+   } else {
+   sched_queue_delayed_work(sched, tick, 0);
+   }
+
drm_warn(&ptdev->base,
 "CSG slot %d CS slot: %d\n"
 "CS_FATAL.EXCEPTION_TYPE: 0x%x (%s)\n"
-- 
2.44.0



[PATCH 3/4] drm/panthor: Reset the FW VM to NULL on unplug

2024-05-02 Thread Boris Brezillon
This way get NULL derefs instead of use-after-free if the FW VM is
referenced after the device has been unplugged.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_fw.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/panthor/panthor_fw.c 
b/drivers/gpu/drm/panthor/panthor_fw.c
index b41685304a83..93165961a6b5 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -1141,6 +1141,7 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
 * state to keep the active_refcnt balanced.
 */
panthor_vm_put(ptdev->fw->vm);
+   ptdev->fw->vm = NULL;
 
panthor_gpu_power_off(ptdev, L2, ptdev->gpu_info.l2_present, 2);
 }
-- 
2.44.0



[PATCH 0/4] drm/panthor: More reset fixes

2024-05-02 Thread Boris Brezillon
Hello,

This is a collection of fixes for bugs found while chasing an
unrecoverable fault leading to a device unplug (because of some
other bugs that was introduced in my local dev branch).

The first patch makes sure we immediately reset the GPU on an
unrecoverable fault, and following patches are fixing various
NULL/invalid pointer derefs caused by use-after-free situations
following a device unplug.

Regards,

Boris

Boris Brezillon (4):
  drm/panthor: Force an immediate reset on unrecoverable faults
  drm/panthor: Keep a ref to the VM at the panthor_kernel_bo level
  drm/panthor: Reset the FW VM to NULL on unplug
  drm/panthor: Call panthor_sched_post_reset() even if the reset failed

 drivers/gpu/drm/panthor/panthor_device.c |  8 ++---
 drivers/gpu/drm/panthor/panthor_device.h |  1 +
 drivers/gpu/drm/panthor/panthor_fw.c |  5 +--
 drivers/gpu/drm/panthor/panthor_gem.c|  8 +++--
 drivers/gpu/drm/panthor/panthor_gem.h|  8 +++--
 drivers/gpu/drm/panthor/panthor_heap.c   |  8 ++---
 drivers/gpu/drm/panthor/panthor_sched.c  | 40 +---
 drivers/gpu/drm/panthor/panthor_sched.h  |  2 +-
 8 files changed, 51 insertions(+), 29 deletions(-)

-- 
2.44.0



[PATCH 2/4] drm/panthor: Keep a ref to the VM at the panthor_kernel_bo level

2024-05-02 Thread Boris Brezillon
Avoids use-after-free situations when panthor_fw_unplug() is called
and the kernel BO was mapped to the FW VM.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_fw.c|  4 ++--
 drivers/gpu/drm/panthor/panthor_gem.c   |  8 +---
 drivers/gpu/drm/panthor/panthor_gem.h   |  8 ++--
 drivers/gpu/drm/panthor/panthor_heap.c  |  8 
 drivers/gpu/drm/panthor/panthor_sched.c | 11 +--
 5 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_fw.c 
b/drivers/gpu/drm/panthor/panthor_fw.c
index 181395e2859a..b41685304a83 100644
--- a/drivers/gpu/drm/panthor/panthor_fw.c
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -453,7 +453,7 @@ panthor_fw_alloc_queue_iface_mem(struct panthor_device 
*ptdev,
 
ret = panthor_kernel_bo_vmap(mem);
if (ret) {
-   panthor_kernel_bo_destroy(panthor_fw_vm(ptdev), mem);
+   panthor_kernel_bo_destroy(mem);
return ERR_PTR(ret);
}
 
@@ -1133,7 +1133,7 @@ void panthor_fw_unplug(struct panthor_device *ptdev)
panthor_fw_stop(ptdev);
 
list_for_each_entry(section, &ptdev->fw->sections, node)
-   panthor_kernel_bo_destroy(panthor_fw_vm(ptdev), section->mem);
+   panthor_kernel_bo_destroy(section->mem);
 
/* We intentionally don't call panthor_vm_idle() and let
 * panthor_mmu_unplug() release the AS we acquired with
diff --git a/drivers/gpu/drm/panthor/panthor_gem.c 
b/drivers/gpu/drm/panthor/panthor_gem.c
index d6483266d0c2..38f560864879 100644
--- a/drivers/gpu/drm/panthor/panthor_gem.c
+++ b/drivers/gpu/drm/panthor/panthor_gem.c
@@ -26,18 +26,18 @@ static void panthor_gem_free_object(struct drm_gem_object 
*obj)
 
 /**
  * panthor_kernel_bo_destroy() - Destroy a kernel buffer object
- * @vm: The VM this BO was mapped to.
  * @bo: Kernel buffer object to destroy. If NULL or an ERR_PTR(), the 
destruction
  * is skipped.
  */
-void panthor_kernel_bo_destroy(struct panthor_vm *vm,
-  struct panthor_kernel_bo *bo)
+void panthor_kernel_bo_destroy(struct panthor_kernel_bo *bo)
 {
+   struct panthor_vm *vm;
int ret;
 
if (IS_ERR_OR_NULL(bo))
return;
 
+   vm = bo->vm;
panthor_kernel_bo_vunmap(bo);
 
if (drm_WARN_ON(bo->obj->dev,
@@ -53,6 +53,7 @@ void panthor_kernel_bo_destroy(struct panthor_vm *vm,
drm_gem_object_put(bo->obj);
 
 out_free_bo:
+   panthor_vm_put(vm);
kfree(bo);
 }
 
@@ -106,6 +107,7 @@ panthor_kernel_bo_create(struct panthor_device *ptdev, 
struct panthor_vm *vm,
if (ret)
goto err_free_va;
 
+   kbo->vm = panthor_vm_get(vm);
bo->exclusive_vm_root_gem = panthor_vm_root_gem(vm);
drm_gem_object_get(bo->exclusive_vm_root_gem);
bo->base.base.resv = bo->exclusive_vm_root_gem->resv;
diff --git a/drivers/gpu/drm/panthor/panthor_gem.h 
b/drivers/gpu/drm/panthor/panthor_gem.h
index 3bccba394d00..e43021cf6d45 100644
--- a/drivers/gpu/drm/panthor/panthor_gem.h
+++ b/drivers/gpu/drm/panthor/panthor_gem.h
@@ -61,6 +61,11 @@ struct panthor_kernel_bo {
 */
struct drm_gem_object *obj;
 
+   /**
+* @vm: VM this private buffer is attached to.
+*/
+   struct panthor_vm *vm;
+
/**
 * @va_node: VA space allocated to this GEM.
 */
@@ -136,7 +141,6 @@ panthor_kernel_bo_create(struct panthor_device *ptdev, 
struct panthor_vm *vm,
 size_t size, u32 bo_flags, u32 vm_map_flags,
 u64 gpu_va);
 
-void panthor_kernel_bo_destroy(struct panthor_vm *vm,
-  struct panthor_kernel_bo *bo);
+void panthor_kernel_bo_destroy(struct panthor_kernel_bo *bo);
 
 #endif /* __PANTHOR_GEM_H__ */
diff --git a/drivers/gpu/drm/panthor/panthor_heap.c 
b/drivers/gpu/drm/panthor/panthor_heap.c
index 143fa35f2e74..65921296a18c 100644
--- a/drivers/gpu/drm/panthor/panthor_heap.c
+++ b/drivers/gpu/drm/panthor/panthor_heap.c
@@ -127,7 +127,7 @@ static void panthor_free_heap_chunk(struct panthor_vm *vm,
heap->chunk_count--;
mutex_unlock(&heap->lock);
 
-   panthor_kernel_bo_destroy(vm, chunk->bo);
+   panthor_kernel_bo_destroy(chunk->bo);
kfree(chunk);
 }
 
@@ -183,7 +183,7 @@ static int panthor_alloc_heap_chunk(struct panthor_device 
*ptdev,
return 0;
 
 err_destroy_bo:
-   panthor_kernel_bo_destroy(vm, chunk->bo);
+   panthor_kernel_bo_destroy(chunk->bo);
 
 err_free_chunk:
kfree(chunk);
@@ -391,7 +391,7 @@ int panthor_heap_return_chunk(struct panthor_heap_pool 
*pool,
mutex_unlock(&heap->lock);
 
if (removed) {
-   panthor_kernel_bo_destroy(pool->vm, chunk->bo);
+   panthor_kernel_bo_destroy(chunk->bo);
kfree(chunk);
ret = 0;
} 

[PATCH 4/4] drm/panthor: Call panthor_sched_post_reset() even if the reset failed

2024-05-02 Thread Boris Brezillon
We need to undo what was done in panthor_sched_pre_reset() even if the
reset failed. We just flag all previously running groups as terminated
when that happens to unblock things.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_device.c |  7 +--
 drivers/gpu/drm/panthor/panthor_sched.c  | 19 ++-
 drivers/gpu/drm/panthor/panthor_sched.h  |  2 +-
 3 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_device.c 
b/drivers/gpu/drm/panthor/panthor_device.c
index 4c5b54e7abb7..4082c8f2951d 100644
--- a/drivers/gpu/drm/panthor/panthor_device.c
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -129,13 +129,8 @@ static void panthor_device_reset_work(struct work_struct 
*work)
panthor_gpu_l2_power_on(ptdev);
panthor_mmu_post_reset(ptdev);
ret = panthor_fw_post_reset(ptdev);
-   if (ret)
-   goto out_dev_exit;
-
atomic_set(&ptdev->reset.pending, 0);
-   panthor_sched_post_reset(ptdev);
-
-out_dev_exit:
+   panthor_sched_post_reset(ptdev, ret != 0);
drm_dev_exit(cookie);
 
if (ret) {
diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
b/drivers/gpu/drm/panthor/panthor_sched.c
index 6ea094b00cf9..fc43ff62c77d 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -2728,15 +2728,22 @@ void panthor_sched_pre_reset(struct panthor_device 
*ptdev)
mutex_unlock(&sched->reset.lock);
 }
 
-void panthor_sched_post_reset(struct panthor_device *ptdev)
+void panthor_sched_post_reset(struct panthor_device *ptdev, bool reset_failed)
 {
struct panthor_scheduler *sched = ptdev->scheduler;
struct panthor_group *group, *group_tmp;
 
mutex_lock(&sched->reset.lock);
 
-   list_for_each_entry_safe(group, group_tmp, 
&sched->reset.stopped_groups, run_node)
+   list_for_each_entry_safe(group, group_tmp, 
&sched->reset.stopped_groups, run_node) {
+   /* Consider all previously running group as terminated if the
+* reset failed.
+*/
+   if (reset_failed)
+   group->state = PANTHOR_CS_GROUP_TERMINATED;
+
panthor_group_start(group);
+   }
 
/* We're done resetting the GPU, clear the reset.in_progress bit so we 
can
 * kick the scheduler.
@@ -2744,9 +2751,11 @@ void panthor_sched_post_reset(struct panthor_device 
*ptdev)
atomic_set(&sched->reset.in_progress, false);
mutex_unlock(&sched->reset.lock);
 
-   sched_queue_delayed_work(sched, tick, 0);
-
-   sched_queue_work(sched, sync_upd);
+   /* No need to queue a tick and update syncs if the reset failed. */
+   if (!reset_failed) {
+   sched_queue_delayed_work(sched, tick, 0);
+   sched_queue_work(sched, sync_upd);
+   }
 }
 
 static void group_sync_upd_work(struct work_struct *work)
diff --git a/drivers/gpu/drm/panthor/panthor_sched.h 
b/drivers/gpu/drm/panthor/panthor_sched.h
index 66438b1f331f..3a30d2328b30 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.h
+++ b/drivers/gpu/drm/panthor/panthor_sched.h
@@ -40,7 +40,7 @@ void panthor_group_pool_destroy(struct panthor_file *pfile);
 int panthor_sched_init(struct panthor_device *ptdev);
 void panthor_sched_unplug(struct panthor_device *ptdev);
 void panthor_sched_pre_reset(struct panthor_device *ptdev);
-void panthor_sched_post_reset(struct panthor_device *ptdev);
+void panthor_sched_post_reset(struct panthor_device *ptdev, bool reset_failed);
 void panthor_sched_suspend(struct panthor_device *ptdev);
 void panthor_sched_resume(struct panthor_device *ptdev);
 
-- 
2.44.0



Re: [PATCH] drm/panthor: Fix the FW reset logic

2024-05-02 Thread Boris Brezillon
On Tue, 30 Apr 2024 13:37:27 +0200
Boris Brezillon  wrote:

> In the post_reset function, if the fast reset didn't succeed, we
> are not clearing the fast_reset flag, which prevents firmware
> sections from being reloaded. While at it, use panthor_fw_stop()
> instead of manually writing DISABLE to the MCU_CONTROL register.
> 
> Fixes: 2718d91816ee ("drm/panthor: Add the FW logical block")
> Signed-off-by: Boris Brezillon 

Queued to drm-misc-next-fixes.

> ---
>  drivers/gpu/drm/panthor/panthor_fw.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c 
> b/drivers/gpu/drm/panthor/panthor_fw.c
> index 181395e2859a..fedf9627453f 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -1083,10 +1083,11 @@ int panthor_fw_post_reset(struct panthor_device 
> *ptdev)
>   if (!ret)
>   goto out;
>  
> - /* Force a disable, so we get a fresh boot on the next
> -  * panthor_fw_start() call.
> + /* Forcibly reset the MCU and force a slow reset, so we get a
> +  * fresh boot on the next panthor_fw_start() call.
>*/
> - gpu_write(ptdev, MCU_CONTROL, MCU_CONTROL_DISABLE);
> + panthor_fw_stop(ptdev);
> + ptdev->fw->fast_reset = false;
>   drm_err(&ptdev->base, "FW fast reset failed, trying a slow 
> reset");
>   }
>  



Re: [PATCH 0/4] drm/panthor: More reset fixes

2024-05-13 Thread Boris Brezillon
On Thu,  2 May 2024 20:38:08 +0200
Boris Brezillon  wrote:

> Hello,
> 
> This is a collection of fixes for bugs found while chasing an
> unrecoverable fault leading to a device unplug (because of some
> other bugs that was introduced in my local dev branch).
> 
> The first patch makes sure we immediately reset the GPU on an
> unrecoverable fault, and following patches are fixing various
> NULL/invalid pointer derefs caused by use-after-free situations
> following a device unplug.
> 
> Regards,
> 
> Boris
> 
> Boris Brezillon (4):
>   drm/panthor: Force an immediate reset on unrecoverable faults
>   drm/panthor: Keep a ref to the VM at the panthor_kernel_bo level
>   drm/panthor: Reset the FW VM to NULL on unplug
>   drm/panthor: Call panthor_sched_post_reset() even if the reset failed

Queued to drm-misc-next-fixes.

> 
>  drivers/gpu/drm/panthor/panthor_device.c |  8 ++---
>  drivers/gpu/drm/panthor/panthor_device.h |  1 +
>  drivers/gpu/drm/panthor/panthor_fw.c |  5 +--
>  drivers/gpu/drm/panthor/panthor_gem.c|  8 +++--
>  drivers/gpu/drm/panthor/panthor_gem.h|  8 +++--
>  drivers/gpu/drm/panthor/panthor_heap.c   |  8 ++---
>  drivers/gpu/drm/panthor/panthor_sched.c  | 40 +---
>  drivers/gpu/drm/panthor/panthor_sched.h  |  2 +-
>  8 files changed, 51 insertions(+), 29 deletions(-)
> 



Re: [PATCH v4 0/5] drm/panthor: Collection of tiler heap related fixes

2024-05-13 Thread Boris Brezillon
On Thu,  2 May 2024 18:51:53 +0200
Boris Brezillon  wrote:

> This is a collection of tiler heap fixes for bugs/oddities found while
> looking at incremental rendering.
> 
> Ideally, we want to land those before 6.10 is released, so we don't need
> to increment the driver version to reflect the ABI changes.
> 
> Changelog detailed in each commit.
> 
> Regards,
> 
> Boris
> 
> Antonino Maniscalco (1):
>   drm/panthor: Fix tiler OOM handling to allow incremental rendering
> 
> Boris Brezillon (4):
>   drm/panthor: Make sure the tiler initial/max chunks are consistent
>   drm/panthor: Relax the constraints on the tiler chunk size
>   drm/panthor: Fix an off-by-one in the heap context retrieval logic
>   drm/panthor: Document drm_panthor_tiler_heap_destroy::handle validity
> constraints

Queued to drm-misc-next-fixes.

> 
>  drivers/gpu/drm/panthor/panthor_heap.c  | 28 -
>  drivers/gpu/drm/panthor/panthor_sched.c |  7 ++-
>  include/uapi/drm/panthor_drm.h  | 20 ++
>  3 files changed, 40 insertions(+), 15 deletions(-)
> 



Re: [PATCH v3 0/2] drm: Fix dma_resv deadlock at drm object pin time

2024-05-21 Thread Boris Brezillon
On Fri, 17 May 2024 19:16:21 +0100
Adrián Larumbe  wrote:

> Hi Boris and Thomas,
> 
> On 02.05.2024 14:18, Thomas Zimmermann wrote:
> > Hi
> > 
> > Am 02.05.24 um 14:00 schrieb Boris Brezillon:  
> > > On Thu, 2 May 2024 13:59:41 +0200
> > > Boris Brezillon  wrote:
> > >   
> > > > Hi Thomas,
> > > > 
> > > > On Thu, 2 May 2024 13:51:16 +0200
> > > > Thomas Zimmermann  wrote:
> > > >   
> > > > > Hi,
> > > > > 
> > > > > ignoring my r-b on patch 1, I'd like to rethink the current patches in
> > > > > general.
> > > > > 
> > > > > I think drm_gem_shmem_pin() should become the locked version of 
> > > > > _pin(),
> > > > > so that drm_gem_shmem_object_pin() can call it directly. The existing
> > > > > _pin_unlocked() would not be needed any longer. Same for the _unpin()
> > > > > functions. This change would also fix the consistency with the 
> > > > > semantics
> > > > > of the shmem _vmap() functions, which never take reservation locks.
> > > > > 
> > > > > There are only two external callers of drm_gem_shmem_pin(): the test
> > > > > case and panthor. These assume that drm_gem_shmem_pin() acquires the
> > > > > reservation lock. The test case should likely call drm_gem_pin()
> > > > > instead. That would acquire the reservation lock and the test would
> > > > > validate that shmem's pin helper integrates well into the overall GEM
> > > > > framework. The way panthor uses drm_gem_shmem_pin() looks wrong to me.
> > > > > For now, it could receive a wrapper that takes the lock and that's 
> > > > > it.  
> > > > I do agree that the current inconsistencies in the naming is
> > > > troublesome (sometimes _unlocked, sometimes _locked, with the version
> > > > without any suffix meaning either _locked or _unlocked depending on
> > > > what the suffixed version does), and that's the very reason I asked
> > > > Dmitry to address that in his shrinker series [1]. So, ideally I'd
> > > > prefer if patches from Dmitry's series were applied instead of
> > > > trying to fix that here (IIRC, we had an ack from Maxime).  
> > > With the link this time :-).
> > > 
> > > [1]https://lore.kernel.org/lkml/20240105184624.508603-1-dmitry.osipe...@collabora.com/T/
> > >   
> > 
> > Thanks. I remember these patches. Somehow I thought they would have been
> > merged already. I wasn't super happy about the naming changes in patch 5,
> > because the names of the GEM object callbacks do no longer correspond with
> > their implementations. But anyway.
> > 
> > If we go that direction, we should here simply push drm_gem_shmem_pin() and
> > drm_gem_shmem_unpin() into panthor and update the shmem tests with
> > drm_gem_pin(). Panfrost and lima would call drm_gem_shmem_pin_locked(). IMHO
> > we should not promote the use of drm_gem_shmem_object_*() functions, as they
> > are meant to be callbacks for struct drm_gem_object_funcs. (Auto-generating
> > them would be nice.)  
> 
> I'll be doing this in the next patch series iteration, casting the pin 
> function's
> drm object parameter to an shmem object.
> 
> Also for the sake of leaving things in a consistent state, and against Boris' 
> advice,
> I think I'll leave the drm WARN statement inside drm_gem_shmem_pin_locked.

Sure, that's fine.


Re: [PATCH v4 1/3] drm/panfrost: Fix dma_resv deadlock at drm object pin time

2024-05-23 Thread Boris Brezillon
On Thu, 23 May 2024 12:32:17 +0100
Adrián Larumbe  wrote:

> When Panfrost must pin an object that is being prepared a dma-buf
> attachment for on behalf of another driver, the core drm gem object pinning
> code already takes a lock on the object's dma reservation.
> 
> However, Panfrost GEM object's pinning callback would eventually try taking
> the lock on the same dma reservation when delegating pinning of the object
> onto the shmem subsystem, which led to a deadlock.
> 
> This can be shown by enabling CONFIG_DEBUG_WW_MUTEX_SLOWPATH, which throws
> the following recursive locking situation:
> 
> weston/3440 is trying to acquire lock:
> 00e235a0 (reservation_ww_class_mutex){+.+.}-{3:3}, at: 
> drm_gem_shmem_pin+0x34/0xb8 [drm_shmem_helper]
> but task is already holding lock:
> 00e235a0 (reservation_ww_class_mutex){+.+.}-{3:3}, at: 
> drm_gem_pin+0x2c/0x80 [drm]
> 
> Fix it by replacing drm_gem_shmem_pin with its locked version, as the lock
> had already been taken by drm_gem_pin().
> 
> Cc: Thomas Zimmermann 
> Cc: Dmitry Osipenko 
> Cc: Boris Brezillon 
> Cc: Steven Price 
> Fixes: a78027847226 ("drm/gem: Acquire reservation lock in 
> drm_gem_{pin/unpin}()")
> Signed-off-by: Adrián Larumbe 

Reviewed-by: Boris Brezillon 

> ---
>  drivers/gpu/drm/panfrost/panfrost_gem.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c 
> b/drivers/gpu/drm/panfrost/panfrost_gem.c
> index d47b40b82b0b..8e0ff3efede7 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_gem.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
> @@ -192,7 +192,7 @@ static int panfrost_gem_pin(struct drm_gem_object *obj)
>   if (bo->is_heap)
>   return -EINVAL;
>  
> - return drm_gem_shmem_pin(&bo->base);
> + return drm_gem_shmem_pin_locked(&bo->base);
>  }
>  
>  static enum drm_gem_object_status panfrost_gem_status(struct drm_gem_object 
> *obj)



Re: [PATCH v4 2/3] drm/lima: Fix dma_resv deadlock at drm object pin time

2024-05-23 Thread Boris Brezillon
On Thu, 23 May 2024 12:32:18 +0100
Adrián Larumbe  wrote:

> Commit a78027847226 ("drm/gem: Acquire reservation lock in
> drm_gem_{pin/unpin}()") moved locking the DRM object's dma reservation to
> drm_gem_pin(), but Lima's pin callback kept calling drm_gem_shmem_pin,
> which also tries to lock the same dma_resv, leading to a double lock
> situation.
> 
> As was already done for Panfrost in the previous commit, fix it by
> replacing drm_gem_shmem_pin() with its locked variant.
> 
> Cc: Thomas Zimmermann 
> Cc: Dmitry Osipenko 
> Cc: Boris Brezillon 
> Cc: Steven Price 
> Fixes: a78027847226 ("drm/gem: Acquire reservation lock in 
> drm_gem_{pin/unpin}()")
> Signed-off-by: Adrián Larumbe 

Reviewed-by: Boris Brezillon 

> ---
>  drivers/gpu/drm/lima/lima_gem.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/lima/lima_gem.c b/drivers/gpu/drm/lima/lima_gem.c
> index 7ea244d876ca..9bb997dbb4b9 100644
> --- a/drivers/gpu/drm/lima/lima_gem.c
> +++ b/drivers/gpu/drm/lima/lima_gem.c
> @@ -185,7 +185,7 @@ static int lima_gem_pin(struct drm_gem_object *obj)
>   if (bo->heap_size)
>   return -EINVAL;
>  
> - return drm_gem_shmem_pin(&bo->base);
> + return drm_gem_shmem_pin_locked(&bo->base);
>  }
>  
>  static int lima_gem_vmap(struct drm_gem_object *obj, struct iosys_map *map)



Re: [PATCH v4 3/3] drm/gem-shmem: Add import attachment warning to locked pin function

2024-05-23 Thread Boris Brezillon
On Thu, 23 May 2024 12:32:19 +0100
Adrián Larumbe  wrote:

> Commit ec144244a43f ("drm/gem-shmem: Acquire reservation lock in GEM
> pin/unpin callbacks") moved locking DRM object's dma reservation to
> drm_gem_shmem_object_pin, and made drm_gem_shmem_pin_locked public, so we
> need to make sure the non-NULL check warning is also added to the latter.
> 
> Cc: Thomas Zimmermann 
> Cc: Dmitry Osipenko 
> Cc: Boris Brezillon 
> Fixes: a78027847226 ("drm/gem: Acquire reservation lock in 
> drm_gem_{pin/unpin}()")
> Signed-off-by: Adrián Larumbe 

Reviewed-by: Boris Brezillon 

> ---
>  drivers/gpu/drm/drm_gem_shmem_helper.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index 13bcdbfd..ad5d9f704e15 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -233,6 +233,8 @@ int drm_gem_shmem_pin_locked(struct drm_gem_shmem_object 
> *shmem)
>  
>   dma_resv_assert_held(shmem->base.resv);
>  
> + drm_WARN_ON(shmem->base.dev, shmem->base.import_attach);
> +
>   ret = drm_gem_shmem_get_pages(shmem);
>  
>   return ret;



Re: [PATCH v4 0/3] drm: Fix dma_resv deadlock at drm object pin time

2024-05-29 Thread Boris Brezillon
On Thu, 23 May 2024 12:32:16 +0100
Adrián Larumbe  wrote:

> This is v4 of 
> https://lore.kernel.org/lkml/20240521181817.097af...@collabora.com/T/
> 
> The goal of this patch series is fixing a deadlock upon locking the dma 
> reservation
> of a DRM gem object when pinning it, at a prime import operation.
> 
> Changelog:
> v3:
>  - Split driver fixes into separate commits for Panfrost and Lima
>  - Make drivers call drm_gem_shmem_pin_locked instead of 
> drm_gem_shmem_object_pin
>  - Improved commit message for first patch to explain why dma resv locking in 
> the 
>  pin callback is no longer necessary.
> v2:
>  - Removed comment explaining reason why an already-locked
> pin function replaced the locked variant inside Panfrost's
> object pin callback.
>  - Moved already-assigned attachment warning into generic
> already-locked gem object pin function
> 
> 
> Adrián Larumbe (3):
>   drm/panfrost: Fix dma_resv deadlock at drm object pin time
>   drm/lima: Fix dma_resv deadlock at drm object pin time
>   drm/gem-shmem: Add import attachment warning to locked pin function

Queued to drm-misc-fixes.

Thanks!

Boris


Re: [PATCH v6 10/14] drm/panthor: Add the scheduler logical block

2024-09-04 Thread Boris Brezillon
On Tue, 3 Sep 2024 21:43:48 +0200
Simona Vetter  wrote:

> On Thu, Feb 29, 2024 at 05:22:24PM +0100, Boris Brezillon wrote:
> > - Add our job fence as DMA_RESV_USAGE_WRITE to all external objects
> >   (was previously DMA_RESV_USAGE_BOOKKEEP). I don't get why, given
> >   we're supposed to be fully-explicit, but other drivers do that, so
> >   there must be a good reason  
> 
> Just spotted this: They're wrong, or they're userspace is broken and
> doesn't use the dma_buf fence import/export ioctl in all the right places.
> For gl this simplifies things (but setting write fences when you're only
> reading is still bad, and setting fences on buffers you don't even touch
> is worse), for vulkan this is just bad.

For the record, I remember pointing that out in some drm_sched
discussion, and being told that this was done on purpose :-/.

> 
> I think you want a context creation flag for userspace that's not broken,
> which goes back to USAGE_BOOKKEEP for everything.

Honestly, given the only user (the gallium driver) is already designed
to do the explicit <-> implicit dance, and the fact the driver just got
merged in the last release, I'd rather go for a silent USAGE_WRITE ->
USAGE_BOOKKEEP if things keep working with that.


Re: [PATCH v5 1/4] drm/panthor: introduce job cycle and timestamp accounting

2024-09-04 Thread Boris Brezillon
e->profiling_info.profiling_seqno++;
> + if (queue->profiling_info.profiling_seqno == 
> queue->profiling_info.slot_count)
> + queue->profiling_info.profiling_seqno = 0;
> +
> + job->ringbuf.start = queue->iface.input->insert;
> +
> + get_job_cs_params(job, &cs_params);
> + prepare_job_instrs(&cs_params, &instrs);
> + copy_instrs_to_ringbuf(queue, job, &instrs);
> +
> + job->ringbuf.end = job->ringbuf.start + (instrs.count * sizeof(u64));
>  
>   panthor_job_get(&job->base);
>   spin_lock(&queue->fence_ctx.lock);
>   list_add_tail(&job->node, &queue->fence_ctx.in_flight_jobs);
>   spin_unlock(&queue->fence_ctx.lock);
>  
> - job->ringbuf.start = queue->iface.input->insert;
> - job->ringbuf.end = job->ringbuf.start + sizeof(call_instrs);
> -
>   /* Make sure the ring buffer is updated before the INSERT
>* register.
>*/
> @@ -3003,6 +3187,24 @@ static const struct drm_sched_backend_ops 
> panthor_queue_sched_ops = {
>   .free_job = queue_free_job,
>  };
>  
> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
> +u32 cs_ringbuf_size)
> +{
> + u32 min_profiled_job_instrs = U32_MAX;
> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
> +
> + for (u32 i = 0; i < last_flag; i++) {
> + if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
> + min_profiled_job_instrs =
> + min(min_profiled_job_instrs, 
> calc_job_credits(BIT(i)));
> + }

Okay, I think this loop deserves an explanation. The goal is to
calculate the minimal size of a profile job so we can deduce the
maximum number of profiling slots that will be used simultaneously. We
ignore PANTHOR_DEVICE_PROFILING_DISABLED, because those jobs won't use
a profiling slot in the first place.

> +
> + drm_WARN_ON(&ptdev->base,
> + !IS_ALIGNED(min_profiled_job_instrs, 
> NUM_INSTRS_PER_CACHE_LINE));

We can probably drop this WARN_ON(), it's supposed to be checked in
calc_job_credits().

> +
> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * 
> sizeof(u64));
> +}
> +
>  static struct panthor_queue *
>  group_create_queue(struct panthor_group *group,
>  const struct drm_panthor_queue_create *args)
> @@ -3056,9 +3258,38 @@ group_create_queue(struct panthor_group *group,
>   goto err_free_queue;
>   }
>  
> + queue->profiling_info.slot_count =
> + calc_profiling_ringbuf_num_slots(group->ptdev, 
> args->ringbuf_size);
> +
> + queue->profiling_info.slots =
> + panthor_kernel_bo_create(group->ptdev, group->vm,
> +  queue->profiling_info.slot_count *
> +      sizeof(struct 
> panthor_job_profiling_data),
> +  DRM_PANTHOR_BO_NO_MMAP,
> +  DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
> +  DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
> +  PANTHOR_VM_KERNEL_AUTO_VA);
> +
> + if (IS_ERR(queue->profiling_info.slots)) {
> + ret = PTR_ERR(queue->profiling_info.slots);
> + goto err_free_queue;
> + }
> +
> + ret = panthor_kernel_bo_vmap(queue->profiling_info.slots);
> + if (ret)
> + goto err_free_queue;
> +
> + memset(queue->profiling_info.slots->kmap, 0,
> +queue->profiling_info.slot_count * sizeof(struct 
> panthor_job_profiling_data));

I don't think we need to memset() the profiling buffer.

> +
> + /*
> +  * Credit limit argument tells us the total number of instructions
> +  * across all CS slots in the ringbuffer, with some jobs requiring
> +  * twice as many as others, depending on their profiling status.
> +  */
>   ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>group->ptdev->scheduler->wq, 1,
> -  args->ringbuf_size / (NUM_INSTRS_PER_SLOT * 
> sizeof(u64)),
> +  args->ringbuf_size / sizeof(u64),
>0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>group->ptdev->reset.wq,
>NULL, "panthor-queue", group->ptdev->base.dev);
> @@ -3354,6 +3585,7 @@ panthor_job_create(struct panthor_file *pfile,
>  {
>   struct panthor_group_pool *gpool = pfile->groups;
>   struct panthor_job *job;
> + u32 credits;
>   int ret;
>  
>   if (qsubmit->pad)
> @@ -3407,9 +3639,16 @@ panthor_job_create(struct panthor_file *pfile,
>   }
>   }
>  
> + job->profile_mask = pfile->ptdev->profile_mask;
> + credits = calc_job_credits(job->profile_mask);
> + if (credits == 0) {
> + ret = -EINVAL;
> + goto err_put_job;
> + }
> +
>   ret = drm_sched_job_init(&job->base,
>&job->group->queues[job->queue_idx]->entity,
> -  1, job->group);
> +  credits, job->group);
>   if (ret)
>   goto err_put_job;
>  

Just add a bunch of minor comments (mostly cosmetic changes), but the
implementation looks good to me.

Reviewed-by: Boris Brezillon 


Re: [PATCH v5 2/4] drm/panthor: add DRM fdinfo support

2024-09-04 Thread Boris Brezillon
On Tue,  3 Sep 2024 21:25:36 +0100
Adrián Larumbe  wrote:

> Drawing from the FW-calculated values in the previous commit, we can
> increase the numbers for an open file by collecting them from finished jobs
> when updating their group synchronisation objects.
> 
> Display of fdinfo key-value pairs is governed by a flag that is by default
> disabled in the present commit, and supporting manual toggle of it will be
> the matter of a later commit.
> 
> Signed-off-by: Adrián Larumbe 
> ---
>  drivers/gpu/drm/panthor/panthor_devfreq.c | 18 -
>  drivers/gpu/drm/panthor/panthor_device.h  | 14 +++
>  drivers/gpu/drm/panthor/panthor_drv.c | 35 ++
>  drivers/gpu/drm/panthor/panthor_sched.c   | 45 +++
>  4 files changed, 111 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c 
> b/drivers/gpu/drm/panthor/panthor_devfreq.c
> index c6d3c327cc24..9d0f891b9b53 100644
> --- a/drivers/gpu/drm/panthor/panthor_devfreq.c
> +++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
> @@ -62,14 +62,20 @@ static void panthor_devfreq_update_utilization(struct 
> panthor_devfreq *pdevfreq)
>  static int panthor_devfreq_target(struct device *dev, unsigned long *freq,
> u32 flags)
>  {
> + struct panthor_device *ptdev = dev_get_drvdata(dev);
>   struct dev_pm_opp *opp;
> + int err;
>  
>   opp = devfreq_recommended_opp(dev, freq, flags);
>   if (IS_ERR(opp))
>   return PTR_ERR(opp);
>   dev_pm_opp_put(opp);
>  
> - return dev_pm_opp_set_rate(dev, *freq);
> + err = dev_pm_opp_set_rate(dev, *freq);
> + if (!err)
> + ptdev->current_frequency = *freq;
> +
> + return err;
>  }
>  
>  static void panthor_devfreq_reset(struct panthor_devfreq *pdevfreq)
> @@ -130,6 +136,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
>   struct panthor_devfreq *pdevfreq;
>   struct dev_pm_opp *opp;
>   unsigned long cur_freq;
> + unsigned long freq = ULONG_MAX;
>   int ret;
>  
>   pdevfreq = drmm_kzalloc(&ptdev->base, sizeof(*ptdev->devfreq), 
> GFP_KERNEL);
> @@ -161,6 +168,7 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
>   return PTR_ERR(opp);
>  
>   panthor_devfreq_profile.initial_freq = cur_freq;
> + ptdev->current_frequency = cur_freq;
>  
>   /* Regulator coupling only takes care of synchronizing/balancing voltage
>* updates, but the coupled regulator needs to be enabled manually.
> @@ -204,6 +212,14 @@ int panthor_devfreq_init(struct panthor_device *ptdev)
>  
>   dev_pm_opp_put(opp);
>  
> + /* Find the fastest defined rate  */
> + opp = dev_pm_opp_find_freq_floor(dev, &freq);
> + if (IS_ERR(opp))
> + return PTR_ERR(opp);
> + ptdev->fast_rate = freq;
> +
> + dev_pm_opp_put(opp);
> +
>   /*
>* Setup default thresholds for the simple_ondemand governor.
>* The values are chosen based on experiments.
> diff --git a/drivers/gpu/drm/panthor/panthor_device.h 
> b/drivers/gpu/drm/panthor/panthor_device.h
> index a48e30d0af30..0e68f5a70d20 100644
> --- a/drivers/gpu/drm/panthor/panthor_device.h
> +++ b/drivers/gpu/drm/panthor/panthor_device.h
> @@ -184,6 +184,17 @@ struct panthor_device {
>  
>   /** @profile_mask: User-set profiling flags for job accounting. */
>   u32 profile_mask;
> +
> + /** @current_frequency: Device clock frequency at present. Set by DVFS*/
> + unsigned long current_frequency;
> +
> + /** @fast_rate: Maximum device clock frequency. Set by DVFS */
> + unsigned long fast_rate;
> +};

Can we move the current_frequency/fast_rate retrieval in a separate
patch?

> +
> +struct panthor_gpu_usage {
> + u64 time;
> + u64 cycles;
>  };
>  
>  /**
> @@ -198,6 +209,9 @@ struct panthor_file {
>  
>   /** @groups: Scheduling group pool attached to this file. */
>   struct panthor_group_pool *groups;
> +
> + /** @stats: cycle and timestamp measures for job execution. */
> + struct panthor_gpu_usage stats;
>  };
>  
>  int panthor_device_init(struct panthor_device *ptdev);
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c 
> b/drivers/gpu/drm/panthor/panthor_drv.c
> index b5e7b919f241..e18838754963 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -3,12 +3,17 @@
>  /* Copyright 2019 Linaro, Ltd., Rob Herring  */
>  /* Copyright 2019 Collabora ltd. */
>  
> +#ifdef CONFIG_ARM_ARCH_TIMER
> +#include 
> +#endif
> +
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -1351,6 +1356,34 @@ static int panthor_mmap(struct file *filp, struct 
> vm_area_struct *vma)
>   return ret;
>  }
>  
> +static void panthor_gpu_show_fdinfo(struct panthor_device *ptdev,
> + struct panthor_file *pfile,
> + struct drm_printer 

Re: [PATCH v5 3/4] drm/panthor: enable fdinfo for memory stats

2024-09-04 Thread Boris Brezillon
On Tue,  3 Sep 2024 21:25:37 +0100
Adrián Larumbe  wrote:

> Implement drm object's status callback.
> 
> Also, we consider a PRIME imported BO to be resident if its matching
> dma_buf has an open attachment, which means its backing storage had already
> been allocated.
> 
> Signed-off-by: Adrián Larumbe 
> Reviewed-by: Liviu Dudau 

Reviewed-by: Boris Brezillon 

> ---
>  drivers/gpu/drm/panthor/panthor_gem.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_gem.c 
> b/drivers/gpu/drm/panthor/panthor_gem.c
> index 38f560864879..c60b599665d8 100644
> --- a/drivers/gpu/drm/panthor/panthor_gem.c
> +++ b/drivers/gpu/drm/panthor/panthor_gem.c
> @@ -145,6 +145,17 @@ panthor_gem_prime_export(struct drm_gem_object *obj, int 
> flags)
>   return drm_gem_prime_export(obj, flags);
>  }
>  
> +static enum drm_gem_object_status panthor_gem_status(struct drm_gem_object 
> *obj)
> +{
> + struct panthor_gem_object *bo = to_panthor_bo(obj);
> + enum drm_gem_object_status res = 0;
> +
> + if (bo->base.base.import_attach || bo->base.pages)
> + res |= DRM_GEM_OBJECT_RESIDENT;
> +
> + return res;
> +}
> +
>  static const struct drm_gem_object_funcs panthor_gem_funcs = {
>   .free = panthor_gem_free_object,
>   .print_info = drm_gem_shmem_object_print_info,
> @@ -154,6 +165,7 @@ static const struct drm_gem_object_funcs 
> panthor_gem_funcs = {
>   .vmap = drm_gem_shmem_object_vmap,
>   .vunmap = drm_gem_shmem_object_vunmap,
>   .mmap = panthor_gem_mmap,
> + .status = panthor_gem_status,
>   .export = panthor_gem_prime_export,
>   .vm_ops = &drm_gem_shmem_vm_ops,
>  };



Re: [PATCH v5 4/4] drm/panthor: add sysfs knob for enabling job profiling

2024-09-04 Thread Boris Brezillon
On Tue,  3 Sep 2024 21:25:38 +0100
Adrián Larumbe  wrote:

> This commit introduces a DRM device sysfs attribute that lets UM control
> the job accounting status in the device. The knob variable had been brought
> in as part of a previous commit, but now we're able to fix it manually.
> 
> As sysfs files are part of a driver's uAPI, describe its legitimate input
> values and output format in a documentation file.
> 
> Signed-off-by: Adrián Larumbe 
> ---
>  Documentation/gpu/panthor.rst | 46 +++
>  drivers/gpu/drm/panthor/panthor_drv.c | 39 +++
>  2 files changed, 85 insertions(+)
>  create mode 100644 Documentation/gpu/panthor.rst
> 
> diff --git a/Documentation/gpu/panthor.rst b/Documentation/gpu/panthor.rst
> new file mode 100644
> index ..cbf5c4429a2d
> --- /dev/null
> +++ b/Documentation/gpu/panthor.rst
> @@ -0,0 +1,46 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +=
> + drm/Panthor CSF driver
> +=
> +
> +.. _panfrost-usage-stats:
> +
> +Panthor DRM client usage stats implementation
> +==
> +
> +The drm/Panthor driver implements the DRM client usage stats specification as
> +documented in :ref:`drm-client-usage-stats`.
> +
> +Example of the output showing the implemented key value pairs and entirety of
> +the currently possible format options:
> +
> +::
> + pos:0
> + flags:  0242
> + mnt_id: 29
> + ino:491
> + drm-driver: panthor
> + drm-client-id:  10
> + drm-engine-panthor: 10952750 ns
> + drm-cycles-panthor: 94439687187
> + drm-maxfreq-panthor:10 Hz
> + drm-curfreq-panthor:10 Hz
> + drm-total-memory:   16480 KiB
> + drm-shared-memory:  0
> + drm-active-memory:  16200 KiB
> + drm-resident-memory:16480 KiB
> + drm-purgeable-memory:   0
> +
> +Possible `drm-engine-` key names are: `panthor`.
> +`drm-curfreq-` values convey the current operating frequency for that engine.
> +
> +Users must bear in mind that engine and cycle sampling are disabled by 
> default,
> +because of power saving concerns. `fdinfo` users and benchmark applications 
> which
> +query the fdinfo file must make sure to toggle the job profiling status of 
> the
> +driver by writing into the appropriate sysfs node::
> +
> +echo  > /sys/bus/platform/drivers/panthor/[a-f0-9]*.gpu/profiling
> +
> +Where `N` is a bit mask where cycle and timestamp sampling are respectively
> +enabled by the first and second bits.

This should probably be documented in
Documentation/ABI/testing/sysfs-driver-panthor too.

> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c 
> b/drivers/gpu/drm/panthor/panthor_drv.c
> index e18838754963..26475db96c41 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -1450,6 +1450,44 @@ static void panthor_remove(struct platform_device 
> *pdev)
>   panthor_device_unplug(ptdev);
>  }
>  
> +static ssize_t profiling_show(struct device *dev,
> +   struct device_attribute *attr,
> +   char *buf)
> +{
> + struct panthor_device *ptdev = dev_get_drvdata(dev);
> +
> + return sysfs_emit(buf, "%d\n", ptdev->profile_mask);
> +}
> +
> +static ssize_t profiling_store(struct device *dev,
> +struct device_attribute *attr,
> +const char *buf, size_t len)
> +{
> + struct panthor_device *ptdev = dev_get_drvdata(dev);
> + u32 value;
> + int err;
> +
> + err = kstrtou32(buf, 0, &value);
> + if (err)
> + return err;
> +
> + if ((value & ~PANTHOR_DEVICE_PROFILING_ALL) != 0)
> + return -EINVAL;
> +
> + ptdev->profile_mask = value;
> +
> + return len;
> +}
> +
> +static DEVICE_ATTR_RW(profiling);
> +
> +static struct attribute *panthor_attrs[] = {
> + &dev_attr_profiling.attr,
> + NULL,
> +};
> +
> +ATTRIBUTE_GROUPS(panthor);
> +
>  static const struct of_device_id dt_match[] = {
>   { .compatible = "rockchip,rk3588-mali" },
>   { .compatible = "arm,mali-valhall-csf" },
> @@ -1469,6 +1507,7 @@ static struct platform_driver panthor_driver = {
>   .name = "panthor",
>   .pm = pm_ptr(&panthor_pm_ops),
>   .of_match_table = dt_match,
> + .dev_groups = panthor_groups,
>   },
>  };
>  



Re: [PATCH v4] drm/panthor: Add DEV_QUERY_TIMESTAMP_INFO dev query

2024-09-04 Thread Boris Brezillon
On Fri, 30 Aug 2024 10:03:50 +0200
Mary Guillemard  wrote:

> Expose timestamp information supported by the GPU with a new device
> query.
> 
> Mali uses an external timer as GPU system time. On ARM, this is wired to
> the generic arch timer so we wire cntfrq_el0 as device frequency.
> 
> This new uAPI will be used in Mesa to implement timestamp queries and
> VK_KHR_calibrated_timestamps.
> 
> Since this extends the uAPI and because userland needs a way to advertise
> those features conditionally, this also bumps the driver minor version.
> 
> v2:
> - Rewrote to use GPU timestamp register
> - Added timestamp_offset to drm_panthor_timestamp_info
> - Add missing include for arch_timer_get_cntfrq
> - Rework commit message
> 
> v3:
> - Add panthor_gpu_read_64bit_counter
> - Change panthor_gpu_read_timestamp to use
>   panthor_gpu_read_64bit_counter
> 
> v4:
> - Fix multiple typos in uAPI documentation
> - Mention behavior when the timestamp frequency is unknown
> - Use u64 instead of unsigned long long
>   for panthor_gpu_read_timestamp
> - Apply r-b from Mihail
> 
> Signed-off-by: Mary Guillemard 
> Reviewed-by: Mihail Atanassov 

Reviewed-by: Boris Brezillon 

> ---
>  drivers/gpu/drm/panthor/panthor_drv.c | 43 +++-
>  drivers/gpu/drm/panthor/panthor_gpu.c | 47 +++
>  drivers/gpu/drm/panthor/panthor_gpu.h |  4 +++
>  include/uapi/drm/panthor_drm.h| 22 +
>  4 files changed, 115 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c 
> b/drivers/gpu/drm/panthor/panthor_drv.c
> index b5e7b919f241..444e3bb1cfb5 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -3,6 +3,10 @@
>  /* Copyright 2019 Linaro, Ltd., Rob Herring  */
>  /* Copyright 2019 Collabora ltd. */
>  
> +#ifdef CONFIG_ARM_ARCH_TIMER
> +#include 
> +#endif
> +
>  #include 
>  #include 
>  #include 
> @@ -164,6 +168,7 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array 
> *in, u32 min_stride,
>   _Generic(_obj_name, \
>PANTHOR_UOBJ_DECL(struct drm_panthor_gpu_info, tiler_present), 
> \
>PANTHOR_UOBJ_DECL(struct drm_panthor_csif_info, pad), \
> +  PANTHOR_UOBJ_DECL(struct drm_panthor_timestamp_info, 
> current_timestamp), \
>PANTHOR_UOBJ_DECL(struct drm_panthor_sync_op, timeline_value), 
> \
>PANTHOR_UOBJ_DECL(struct drm_panthor_queue_submit, syncs), \
>PANTHOR_UOBJ_DECL(struct drm_panthor_queue_create, 
> ringbuf_size), \
> @@ -750,10 +755,33 @@ static void panthor_submit_ctx_cleanup(struct 
> panthor_submit_ctx *ctx,
>   kvfree(ctx->jobs);
>  }
>  
> +static int panthor_query_timestamp_info(struct panthor_device *ptdev,
> + struct drm_panthor_timestamp_info *arg)
> +{
> + int ret;
> +
> + ret = pm_runtime_resume_and_get(ptdev->base.dev);
> + if (ret)
> + return ret;
> +
> +#ifdef CONFIG_ARM_ARCH_TIMER
> + arg->timestamp_frequency = arch_timer_get_cntfrq();
> +#else
> + arg->timestamp_frequency = 0;
> +#endif
> + arg->current_timestamp = panthor_gpu_read_timestamp(ptdev);
> + arg->timestamp_offset = panthor_gpu_read_timestamp_offset(ptdev);
> +
> + pm_runtime_put(ptdev->base.dev);
> + return 0;
> +}
> +
>  static int panthor_ioctl_dev_query(struct drm_device *ddev, void *data, 
> struct drm_file *file)
>  {
>   struct panthor_device *ptdev = container_of(ddev, struct 
> panthor_device, base);
>   struct drm_panthor_dev_query *args = data;
> + struct drm_panthor_timestamp_info timestamp_info;
> + int ret;
>  
>   if (!args->pointer) {
>   switch (args->type) {
> @@ -765,6 +793,10 @@ static int panthor_ioctl_dev_query(struct drm_device 
> *ddev, void *data, struct d
>   args->size = sizeof(ptdev->csif_info);
>   return 0;
>  
> + case DRM_PANTHOR_DEV_QUERY_TIMESTAMP_INFO:
> + args->size = sizeof(timestamp_info);
> + return 0;
> +
>   default:
>   return -EINVAL;
>   }
> @@ -777,6 +809,14 @@ static int panthor_ioctl_dev_query(struct drm_device 
> *ddev, void *data, struct d
>   case DRM_PANTHOR_DEV_QUERY_CSIF_INFO:
>   return PANTHOR_UOBJ_SET(args->pointer, args->size, 
> ptdev->csif_info);
>  
> + case DRM_PANTHOR_DEV_QUERY_TIMESTAMP_INFO:
> + ret = panthor_query_timestamp_info(ptdev, ×tamp_info);

Re: [RFC PATCH] drm/sched: Fix a UAF on drm_sched_fence::sched

2024-09-04 Thread Boris Brezillon
On Wed, 4 Sep 2024 11:46:54 +0200
Simona Vetter  wrote:

> On Wed, Sep 04, 2024 at 09:40:36AM +0200, Christian König wrote:
> > Am 03.09.24 um 10:13 schrieb Simona Vetter:  
> > > [SNIP]  
> > > > > So I think the issue is much, much bigger, and there's more. And the
> > > > > issue is I think a fundamental design issue of dma_fence itself, not
> > > > > individual users.  
> > > > IIRC both Alex and me pointed out this issue on the very first dma_fence
> > > > code and nobody really cared.  
> > > I guess way back then we didn't really sort out any of the hotunplug
> > > issues, and there wasn't any fw ctx schedulers at least on our horizons
> > > yet. Thin excuse, I know ...  
> > 
> > Well it's just when you have a bee string and a broken leg, what do you
> > attend first? :)  
> 
> Yeah ...
> 
> > > > >I think at the core it's two constraints:
> > > > > 
> > > > > - dma_fence can stick around practically forever in varios container
> > > > > objects. We only garbage collect when someone looks, and not even 
> > > > > then
> > > > > consistently.
> > > > > 
> > > > > - fences are meant to be cheap, so they do not have the big refcount 
> > > > > going
> > > > > on like other shared objects like dma_buf
> > > > > 
> > > > > Specifically there's also no refcounting on the module itself with 
> > > > > the  
> > > > > ->owner and try_module_get stuff. So even if we fix all these issues 
> > > > > on  
> > > > > the data structure lifetime side of things, you might still oops 
> > > > > calling
> > > > > into dma_fence->ops->release.
> > > > > 
> > > > > Oops.  
> > > > Yes, exactly that. I'm a bit surprised that you realize that only now :)
> > > > 
> > > > We have the issue for at least 10 years or so and it pops up every now 
> > > > and
> > > > then on my desk because people complain that unloading amdgpu crashes.  
> > > Yeah I knew about the issue. The new idea that popped into my mind is that
> > > I think we cannot plug this properly unless we do it in dma_fence.c for
> > > everyone, and essentially reshape the lifetime rules for that from yolo
> > > to something actually well-defined.
> > > 
> > > Kinda similar work to how dma_resv locking rules and fence book-keeping
> > > were unified to something that actually works across drivers ...  
> > 
> > Well sounds like I've just got more items on my TODO list.
> > 
> > I have patches waiting to be send out going into this direction anyway, will
> > try to get them out by the end of the week and then we can discuss what's
> > still missing.  
> 
> Quick addition, another motivator from the panthor userspace submit
> discussion: If the preempt ctx fence concept spreads, that's another
> non-drm_sched fence that drivers will need and are pretty much guaranteed
> to get wrong.
> 
> Also maybe Boris volunteers to help out with some of the work here?

Sure, I can review/test what Christian comes up with, since he already
seems to have a draft for the new implementation.


Re: [RFC PATCH] drm/sched: Fix a UAF on drm_sched_fence::sched

2024-09-04 Thread Boris Brezillon
On Wed, 4 Sep 2024 12:03:24 +0200
Simona Vetter  wrote:

> On Wed, Sep 04, 2024 at 11:46:54AM +0200, Simona Vetter wrote:
> > On Wed, Sep 04, 2024 at 09:40:36AM +0200, Christian König wrote:  
> > > Am 03.09.24 um 10:13 schrieb Simona Vetter:  
> > > > [SNIP]  
> > > > > > So I think the issue is much, much bigger, and there's more. And the
> > > > > > issue is I think a fundamental design issue of dma_fence itself, not
> > > > > > individual users.  
> > > > > IIRC both Alex and me pointed out this issue on the very first 
> > > > > dma_fence
> > > > > code and nobody really cared.  
> > > > I guess way back then we didn't really sort out any of the hotunplug
> > > > issues, and there wasn't any fw ctx schedulers at least on our horizons
> > > > yet. Thin excuse, I know ...  
> > > 
> > > Well it's just when you have a bee string and a broken leg, what do you
> > > attend first? :)  
> > 
> > Yeah ...
> >   
> > > > > >I think at the core it's two constraints:
> > > > > > 
> > > > > > - dma_fence can stick around practically forever in varios container
> > > > > > objects. We only garbage collect when someone looks, and not 
> > > > > > even then
> > > > > > consistently.
> > > > > > 
> > > > > > - fences are meant to be cheap, so they do not have the big 
> > > > > > refcount going
> > > > > > on like other shared objects like dma_buf
> > > > > > 
> > > > > > Specifically there's also no refcounting on the module itself with 
> > > > > > the  
> > > > > > ->owner and try_module_get stuff. So even if we fix all these 
> > > > > > issues on  
> > > > > > the data structure lifetime side of things, you might still oops 
> > > > > > calling
> > > > > > into dma_fence->ops->release.
> > > > > > 
> > > > > > Oops.  
> > > > > Yes, exactly that. I'm a bit surprised that you realize that only now 
> > > > > :)
> > > > > 
> > > > > We have the issue for at least 10 years or so and it pops up every 
> > > > > now and
> > > > > then on my desk because people complain that unloading amdgpu 
> > > > > crashes.  
> > > > Yeah I knew about the issue. The new idea that popped into my mind is 
> > > > that
> > > > I think we cannot plug this properly unless we do it in dma_fence.c for
> > > > everyone, and essentially reshape the lifetime rules for that from yolo
> > > > to something actually well-defined.
> > > > 
> > > > Kinda similar work to how dma_resv locking rules and fence book-keeping
> > > > were unified to something that actually works across drivers ...  
> > > 
> > > Well sounds like I've just got more items on my TODO list.
> > > 
> > > I have patches waiting to be send out going into this direction anyway, 
> > > will
> > > try to get them out by the end of the week and then we can discuss what's
> > > still missing.  
> > 
> > Quick addition, another motivator from the panthor userspace submit
> > discussion: If the preempt ctx fence concept spreads, that's another
> > non-drm_sched fence that drivers will need and are pretty much guaranteed
> > to get wrong.
> > 
> > Also maybe Boris volunteers to help out with some of the work here? Or
> > perhaps some of the nova folks, it seems to be even more a pain for rust
> > drivers ...  
> 
> I forgot to add: I think it'd be really good to record the rough consensus
> on the problem and the long term solution we're aiming for an a kerneldoc
> or TODO patch. I think recording those design goals helped us a _lot_ in
> making the dma_resv_usage/lock and dma_buf api cleanups and cross-driver
> consistent semantics happen. Maybe as a WARNING/TODO block in the
> dma_fence_ops kerneldoc?
> 
> Boris, can you volunteer perhaps?

Sure, I won't be able to do that this week though.


Re: [PATCH] drm/panthor: Restrict high priorities on group_create

2024-09-04 Thread Boris Brezillon
On Tue,  3 Sep 2024 16:49:55 +0200
Mary Guillemard  wrote:

> We were allowing any users to create a high priority group without any
> permission checks. As a result, this was allowing possible denial of
> service.
> 
> We now only allow the DRM master or users with the CAP_SYS_NICE
> capability to set higher priorities than PANTHOR_GROUP_PRIORITY_MEDIUM.
> 
> As the sole user of that uAPI lives in Mesa and hardcode a value of
> MEDIUM [1], this should be safe to do.
> 
> Additionally, as those checks are performed at the ioctl level,
> panthor_group_create now only check for priority level validity.
> 
> [1]https://gitlab.freedesktop.org/mesa/mesa/-/blob/f390835074bdf162a63deb0311d1a6de527f9f89/src/gallium/drivers/panfrost/pan_csf.c#L1038
> 
> Signed-off-by: Mary Guillemard 
> Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
> Cc: sta...@vger.kernel.org

Reviewed-by: Boris Brezillon 

> ---
>  drivers/gpu/drm/panthor/panthor_drv.c   | 23 +++
>  drivers/gpu/drm/panthor/panthor_sched.c |  2 +-
>  include/uapi/drm/panthor_drm.h  |  6 +-
>  3 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c 
> b/drivers/gpu/drm/panthor/panthor_drv.c
> index b5e7b919f241..34182f67136c 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -996,6 +997,24 @@ static int panthor_ioctl_group_destroy(struct drm_device 
> *ddev, void *data,
>   return panthor_group_destroy(pfile, args->group_handle);
>  }
>  
> +static int group_priority_permit(struct drm_file *file,
> +  u8 priority)
> +{
> + /* Ensure that priority is valid */
> + if (priority > PANTHOR_GROUP_PRIORITY_HIGH)
> + return -EINVAL;
> +
> + /* Medium priority and below are always allowed */
> + if (priority <= PANTHOR_GROUP_PRIORITY_MEDIUM)
> + return 0;
> +
> + /* Higher priorities require CAP_SYS_NICE or DRM_MASTER */
> + if (capable(CAP_SYS_NICE) || drm_is_current_master(file))
> + return 0;
> +
> + return -EACCES;
> +}
> +
>  static int panthor_ioctl_group_create(struct drm_device *ddev, void *data,
> struct drm_file *file)
>  {
> @@ -1011,6 +1030,10 @@ static int panthor_ioctl_group_create(struct 
> drm_device *ddev, void *data,
>   if (ret)
>   return ret;
>  
> + ret = group_priority_permit(file, args->priority);
> + if (ret)
> + return ret;
> +
>   ret = panthor_group_create(pfile, args, queue_args);
>   if (ret >= 0) {
>   args->group_handle = ret;
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
> b/drivers/gpu/drm/panthor/panthor_sched.c
> index c426a392b081..91a31b70c037 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -3092,7 +3092,7 @@ int panthor_group_create(struct panthor_file *pfile,
>   if (group_args->pad)
>   return -EINVAL;
>  
> - if (group_args->priority > PANTHOR_CSG_PRIORITY_HIGH)
> + if (group_args->priority >= PANTHOR_CSG_PRIORITY_COUNT)
>   return -EINVAL;
>  
>   if ((group_args->compute_core_mask & ~ptdev->gpu_info.shader_present) ||
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> index 926b1deb1116..e23a7f9b0eac 100644
> --- a/include/uapi/drm/panthor_drm.h
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -692,7 +692,11 @@ enum drm_panthor_group_priority {
>   /** @PANTHOR_GROUP_PRIORITY_MEDIUM: Medium priority group. */
>   PANTHOR_GROUP_PRIORITY_MEDIUM,
>  
> - /** @PANTHOR_GROUP_PRIORITY_HIGH: High priority group. */
> + /**
> +  * @PANTHOR_GROUP_PRIORITY_HIGH: High priority group.
> +  *
> +  * Requires CAP_SYS_NICE or DRM_MASTER.
> +  */
>   PANTHOR_GROUP_PRIORITY_HIGH,
>  };
>  
> 
> base-commit: a15710027afb40c7c1e352902fa5b8c949f021de



Re: [RFC PATCH 00/10] drm/panthor: Add user submission

2024-09-04 Thread Boris Brezillon
On Wed, 4 Sep 2024 10:31:36 +0100
Steven Price  wrote:

> On 04/09/2024 08:49, Christian König wrote:
> > Am 03.09.24 um 23:11 schrieb Simona Vetter:  
> >> On Tue, Sep 03, 2024 at 03:46:43PM +0200, Christian König wrote:  
> >>> Hi Steven,
> >>>
> >>> Am 29.08.24 um 15:37 schrieb Steven Price:  
>  Hi Christian,
> 
>  Mihail should be able to give more definitive answers, but I think I
>  can
>  answer your questions.
> 
>  On 29/08/2024 10:40, Christian König wrote:  
> > Am 28.08.24 um 19:25 schrieb Mihail Atanassov:  
> >> Hello all,
> >>
> >> This series implements a mechanism to expose Mali CSF GPUs' queue
> >> ringbuffers directly to userspace, along with paraphernalia to allow
> >> userspace to control job synchronisation between the CPU and GPU.
> >>
> >> The goal of these changes is to allow userspace to control work
> >> submission to the FW/HW directly without kernel intervention in the
> >> common case, thereby reducing context switching overhead. It also
> >> allows
> >> for greater flexibility in the way work is enqueued in the ringbufs.
> >> For example, the current kernel submit path only supports indirect
> >> calls, which is inefficient for small command buffers. Userspace can
> >> also skip unnecessary sync operations.  
> > Question is how do you guarantee forward progress for fence signaling?  
>  A timeout. Although looking at it I think it's probably set too high
>  currently:
>   
> > +#define JOB_TIMEOUT_MS    5000  
>  But basically the XGS queue is a DRM scheduler just like a normal GPU
>  queue and the jobs have a timeout. If the timeout is hit then any
>  fences
>  will be signalled (with an error).  
> >>> Mhm, that is unfortunately exactly what I feared.
> >>>  
> > E.g. when are fences created and published? How do they signal?
> >
> > How are dependencies handled? How can the kernel suspend an userspace
> > queue?  
>  The actual userspace queue can be suspended. This is actually a
>  combination of firmware and kernel driver, and this functionality is
>  already present without the user submission. The firmware will
>  multiplex
>  multiple 'groups' onto the hardware, and if there are too many for the
>  firmware then the kernel multiplexes the extra groups onto the ones the
>  firmware supports.  
> >>> How do you guarantee forward progress and that resuming of suspended
> >>> queues
> >>> doesn't end up in a circle dependency?  
> 
> I'm not entirely sure what you mean by "guarantee" here - the kernel by
> itself only guarantees forward progress by the means of timeouts. User
> space can 'easily' shoot itself in the foot by using a XGS queue to
> block waiting on a GPU event which will never happen.
> 
> However dependencies between applications (and/or other device drivers)
> will only occur via dma fences and an unsignalled fence will only be
> returned when there is a path forward to signal it. So it shouldn't be
> possible to create a dependency loop between contexts (or command stream
> groups to use the Mali jargon).
> 
> Because the groups can't have dependency cycles it should be possible to
> suspend/resume them without deadlocks.
> 
>  I haven't studied Mihail's series in detail yet, but if I understand
>  correctly, the XGS queues are handled separately and are not suspended
>  when the hardware queues are suspended. I guess this might be an area
>  for improvement and might explain the currently very high timeout (to
>  deal with the case where the actual GPU work has been suspended).
>   
> > How does memory management work in this case?  
>  I'm not entirely sure what you mean here. If you are referring to the
>  potential memory issues with signalling path then this should be
>  handled
>  by the timeout - although I haven't studied the code to check for
>  bugs here.  
> >>> You might have misunderstood my question (and I might misunderstand the
> >>> code), but on first glance it strongly sounds like the current
> >>> approach will
> >>> be NAKed.
> >>>  
>  The actual new XGS queues don't allocate/free memory during the queue
>  execution - so it's just the memory usage related to fences (and the
>  other work which could be blocked on the fence).  
> >>> But the kernel and the hardware could suspend the queues, right?
> >>>  
>  In terms of memory management for the GPU work itself, this is handled
>  the same as before. The VM_BIND mechanism allows dependencies to be
>  created between syncobjs and VM operations, with XGS these can then be
>  tied to GPU HW events.  
> >>> I don't know the details, but that again strongly sounds like that won't
> >>> work.
> >>>
> >>> What you need is to somehow guarantee that work doesn't run into memory
> >>> management deadlocks which are resolved by timeouts.
> 

Re: [RFC PATCH 00/10] drm/panthor: Add user submission

2024-09-04 Thread Boris Brezillon
+ Adrian, who has been looking at the shrinker stuff for Panthor

On Wed, 4 Sep 2024 13:46:12 +0100
Steven Price  wrote:

> On 04/09/2024 12:34, Christian König wrote:
> > Hi Boris,
> > 
> > Am 04.09.24 um 13:23 schrieb Boris Brezillon:  
> >>>>>> Please read up here on why that stuff isn't allowed:
> >>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> >>>>>> 
> >>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which 
> >>>>> means
> >>>>> memory management easy mode.
> >>>> Ok, that at least makes things work for the moment.
> >>> Ah, perhaps this should have been spelt out more clearly ;)
> >>>
> >>> The VM_BIND mechanism that's already in place jumps through some hoops
> >>> to ensure that memory is preallocated when the memory operations are
> >>> enqueued. So any memory required should have been allocated before any
> >>> sync object is returned. We're aware of the issue with memory
> >>> allocations on the signalling path and trying to ensure that we don't
> >>> have that.
> >>>
> >>> I'm hoping that we don't need a shrinker which deals with (active) GPU
> >>> memory with our design.  
> >> That's actually what we were planning to do: the panthor shrinker was
> >> about to rely on fences attached to GEM objects to know if it can
> >> reclaim the memory. This design relies on each job attaching its fence
> >> to the GEM mapped to the VM at the time the job is submitted, such that
> >> memory that's in-use or about-to-be-used doesn't vanish before the GPU
> >> is done.  
> 
> How progressed is this shrinker?

We don't have code yet. All we know is that we want to re-use Dmitry's
generic GEM-SHMEM shrinker implementation [1], and adjust it to match
the VM model, which means not tracking things at the BO granularity,
but at the VM granularity. Actually it has to be an hybrid model, where
shared BOs (those imported/exported) are tracked individually, while
all private BOs are checked simultaneously (since they all share the VM
resv object).

> It would be good to have an RFC so that
> we can look to see how user submission could fit in with it.

Unfortunately, we don't have that yet :-(. All we have is a rough idea
of how things will work, which is basically how TTM reclaim works, but
adapted to GEM.

> 
> > Yeah and exactly that doesn't work any more when you are using user
> > queues, because the kernel has no opportunity to attach a fence for each
> > submission.  
> 
> User submission requires a cooperating user space[1]. So obviously user
> space would need to ensure any BOs that it expects will be accessed to
> be in some way pinned. Since the expectation of user space submission is
> that we're reducing kernel involvement, I'd also expect these to be
> fairly long-term pins.
> 
> [1] Obviously with a timer to kill things from a malicious user space.
> 
> The (closed) 'kbase' driver has a shrinker but is only used on a subset
> of memory and it's up to user space to ensure that it keeps the relevant
> parts pinned (or more specifically not marking them to be discarded if
> there's memory pressure). Not that I think we should be taking it's
> model as a reference here.
> 
> >>> Memory which user space thinks the GPU might
> >>> need should be pinned before the GPU work is submitted. APIs which
> >>> require any form of 'paging in' of data would need to be implemented by
> >>> the GPU work completing and being resubmitted by user space after the
> >>> memory changes (i.e. there could be a DMA fence pending on the GPU work). 
> >>>  
> >> Hard pinning memory could work (ioctl() around gem_pin/unpin()), but
> >> that means we can't really transparently swap out GPU memory, or we
> >> have to constantly pin/unpin around each job, which means even more
> >> ioctl()s than we have now. Another option would be to add the XGS fence
> >> to the BOs attached to the VM, assuming it's created before the job
> >> submission itself, but you're no longer reducing the number of user <->
> >> kernel round trips if you do that, because you now have to create an
> >> XSG job for each submission, so you basically get back to one ioctl()
> >> per submission.  
> 
> As you say the granularity of pinning has to be fairly coarse for user
> space su

Re: [RFC PATCH 00/10] drm/panthor: Add user submission

2024-09-04 Thread Boris Brezillon
On Wed, 4 Sep 2024 14:35:12 +0100
Steven Price  wrote:

> On 04/09/2024 14:20, Boris Brezillon wrote:
> > + Adrian, who has been looking at the shrinker stuff for Panthor
> > 
> > On Wed, 4 Sep 2024 13:46:12 +0100
> > Steven Price  wrote:
> >   
> >> On 04/09/2024 12:34, Christian König wrote:  
> >>> Hi Boris,
> >>>
> >>> Am 04.09.24 um 13:23 schrieb Boris Brezillon:
> >>>>>>>> Please read up here on why that stuff isn't allowed:
> >>>>>>>> https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html#indefinite-dma-fences
> >>>>>>>>   
> >>>>>>> panthor doesn't yet have a shrinker, so all memory is pinned, which 
> >>>>>>> means
> >>>>>>> memory management easy mode.  
> >>>>>> Ok, that at least makes things work for the moment.  
> >>>>> Ah, perhaps this should have been spelt out more clearly ;)
> >>>>>
> >>>>> The VM_BIND mechanism that's already in place jumps through some hoops
> >>>>> to ensure that memory is preallocated when the memory operations are
> >>>>> enqueued. So any memory required should have been allocated before any
> >>>>> sync object is returned. We're aware of the issue with memory
> >>>>> allocations on the signalling path and trying to ensure that we don't
> >>>>> have that.
> >>>>>
> >>>>> I'm hoping that we don't need a shrinker which deals with (active) GPU
> >>>>> memory with our design.
> >>>> That's actually what we were planning to do: the panthor shrinker was
> >>>> about to rely on fences attached to GEM objects to know if it can
> >>>> reclaim the memory. This design relies on each job attaching its fence
> >>>> to the GEM mapped to the VM at the time the job is submitted, such that
> >>>> memory that's in-use or about-to-be-used doesn't vanish before the GPU
> >>>> is done.
> >>
> >> How progressed is this shrinker?  
> > 
> > We don't have code yet. All we know is that we want to re-use Dmitry's
> > generic GEM-SHMEM shrinker implementation [1], and adjust it to match
> > the VM model, which means not tracking things at the BO granularity,
> > but at the VM granularity. Actually it has to be an hybrid model, where
> > shared BOs (those imported/exported) are tracked individually, while
> > all private BOs are checked simultaneously (since they all share the VM
> > resv object).
> >   
> >> It would be good to have an RFC so that
> >> we can look to see how user submission could fit in with it.  
> > 
> > Unfortunately, we don't have that yet :-(. All we have is a rough idea
> > of how things will work, which is basically how TTM reclaim works, but
> > adapted to GEM.  
> 
> Fair enough, thanks for the description. We might need to coordinate to
> get this looked at sooner if it's going to be blocking user submission.
> 
> >>  
> >>> Yeah and exactly that doesn't work any more when you are using user
> >>> queues, because the kernel has no opportunity to attach a fence for each
> >>> submission.
> >>
> >> User submission requires a cooperating user space[1]. So obviously user
> >> space would need to ensure any BOs that it expects will be accessed to
> >> be in some way pinned. Since the expectation of user space submission is
> >> that we're reducing kernel involvement, I'd also expect these to be
> >> fairly long-term pins.
> >>
> >> [1] Obviously with a timer to kill things from a malicious user space.
> >>
> >> The (closed) 'kbase' driver has a shrinker but is only used on a subset
> >> of memory and it's up to user space to ensure that it keeps the relevant
> >> parts pinned (or more specifically not marking them to be discarded if
> >> there's memory pressure). Not that I think we should be taking it's
> >> model as a reference here.
> >>  
> >>>>> Memory which user space thinks the GPU might
> >>>>> need should be pinned before the GPU work is submitted. APIs which
> >>>>> require any form of 'paging in' of data would need to be implemented by
> >>>>> the GPU work completing and being resubmitted by user space afte

[PATCH] drm/panthor: Don't add write fences to the shared BOs

2024-09-05 Thread Boris Brezillon
The only user (the mesa gallium driver) is already assuming explicit
synchronization and doing the export/import dance on shared BOs. The
only reason we were registering ourselves as writers on external BOs
is because Xe, which was the reference back when we developed Panthor,
was doing so. Turns out Xe was wrong, and we really want bookkeep on
all registered fences, so userspace can explicitly upgrade those to
read/write when needed.

Fixes: 4bdca1150792 ("drm/panthor: Add the driver frontend block")
Cc: Matthew Brost 
Cc: Simona Vetter 
Cc: 
Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_sched.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
b/drivers/gpu/drm/panthor/panthor_sched.c
index 9a0ff48f7061..41260cf4beb8 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -3423,13 +3423,8 @@ void panthor_job_update_resvs(struct drm_exec *exec, 
struct drm_sched_job *sched
 {
struct panthor_job *job = container_of(sched_job, struct panthor_job, 
base);
 
-   /* Still not sure why we want USAGE_WRITE for external objects, since I
-* was assuming this would be handled through explicit syncs being 
imported
-* to external BOs with DMA_BUF_IOCTL_IMPORT_SYNC_FILE, but other 
drivers
-* seem to pass DMA_RESV_USAGE_WRITE, so there must be a good reason.
-*/
panthor_vm_update_resvs(job->group->vm, exec, 
&sched_job->s_fence->finished,
-   DMA_RESV_USAGE_BOOKKEEP, DMA_RESV_USAGE_WRITE);
+   DMA_RESV_USAGE_BOOKKEEP, 
DMA_RESV_USAGE_BOOKKEEP);
 }
 
 void panthor_sched_unplug(struct panthor_device *ptdev)
-- 
2.46.0



[PATCH] drm/panthor: Don't declare a queue blocked if deferred operations are pending

2024-09-05 Thread Boris Brezillon
If deferred operations are pending, we want to wait for those to
land before declaring the queue blocked on a SYNC_WAIT. We need
this to deal with the case where the sync object is signalled through
a deferred SYNC_{ADD,SET} from the same queue. If we don't do that
and the group gets scheduled out before the deferred SYNC_{SET,ADD}
is executed, we'll end up with a timeout, because no external
SYNC_{SET,ADD} will make the scheduler reconsider the group for
execution.

Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
Cc: 
Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_sched.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
b/drivers/gpu/drm/panthor/panthor_sched.c
index 41260cf4beb8..201d5e7a921e 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -1103,7 +1103,13 @@ cs_slot_sync_queue_state_locked(struct panthor_device 
*ptdev, u32 csg_id, u32 cs
list_move_tail(&group->wait_node,
   
&group->ptdev->scheduler->groups.waiting);
}
-   group->blocked_queues |= BIT(cs_id);
+
+   /* The queue is only blocked if there's no deferred operation
+* pending, which can be checked through the scoreboard status.
+*/
+   if (!cs_iface->output->status_scoreboards)
+   group->blocked_queues |= BIT(cs_id);
+
queue->syncwait.gpu_va = cs_iface->output->status_wait_sync_ptr;
queue->syncwait.ref = cs_iface->output->status_wait_sync_value;
status_wait_cond = cs_iface->output->status_wait & 
CS_STATUS_WAIT_SYNC_COND_MASK;
-- 
2.46.0



Re: [PATCH] drm/panthor: Restrict high priorities on group_create

2024-09-05 Thread Boris Brezillon
On Tue,  3 Sep 2024 16:49:55 +0200
Mary Guillemard  wrote:

> We were allowing any users to create a high priority group without any
> permission checks. As a result, this was allowing possible denial of
> service.
> 
> We now only allow the DRM master or users with the CAP_SYS_NICE
> capability to set higher priorities than PANTHOR_GROUP_PRIORITY_MEDIUM.
> 
> As the sole user of that uAPI lives in Mesa and hardcode a value of
> MEDIUM [1], this should be safe to do.
> 
> Additionally, as those checks are performed at the ioctl level,
> panthor_group_create now only check for priority level validity.
> 
> [1]https://gitlab.freedesktop.org/mesa/mesa/-/blob/f390835074bdf162a63deb0311d1a6de527f9f89/src/gallium/drivers/panfrost/pan_csf.c#L1038
> 
> Signed-off-by: Mary Guillemard 
> Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
> Cc: sta...@vger.kernel.org

Queued to drm-misc-fixes.

Thanks!

> ---
>  drivers/gpu/drm/panthor/panthor_drv.c   | 23 +++
>  drivers/gpu/drm/panthor/panthor_sched.c |  2 +-
>  include/uapi/drm/panthor_drm.h  |  6 +-
>  3 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c 
> b/drivers/gpu/drm/panthor/panthor_drv.c
> index b5e7b919f241..34182f67136c 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -996,6 +997,24 @@ static int panthor_ioctl_group_destroy(struct drm_device 
> *ddev, void *data,
>   return panthor_group_destroy(pfile, args->group_handle);
>  }
>  
> +static int group_priority_permit(struct drm_file *file,
> +  u8 priority)
> +{
> + /* Ensure that priority is valid */
> + if (priority > PANTHOR_GROUP_PRIORITY_HIGH)
> + return -EINVAL;
> +
> + /* Medium priority and below are always allowed */
> + if (priority <= PANTHOR_GROUP_PRIORITY_MEDIUM)
> + return 0;
> +
> + /* Higher priorities require CAP_SYS_NICE or DRM_MASTER */
> + if (capable(CAP_SYS_NICE) || drm_is_current_master(file))
> + return 0;
> +
> + return -EACCES;
> +}
> +
>  static int panthor_ioctl_group_create(struct drm_device *ddev, void *data,
> struct drm_file *file)
>  {
> @@ -1011,6 +1030,10 @@ static int panthor_ioctl_group_create(struct 
> drm_device *ddev, void *data,
>   if (ret)
>   return ret;
>  
> + ret = group_priority_permit(file, args->priority);
> + if (ret)
> + return ret;
> +
>   ret = panthor_group_create(pfile, args, queue_args);
>   if (ret >= 0) {
>   args->group_handle = ret;
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
> b/drivers/gpu/drm/panthor/panthor_sched.c
> index c426a392b081..91a31b70c037 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -3092,7 +3092,7 @@ int panthor_group_create(struct panthor_file *pfile,
>   if (group_args->pad)
>   return -EINVAL;
>  
> - if (group_args->priority > PANTHOR_CSG_PRIORITY_HIGH)
> + if (group_args->priority >= PANTHOR_CSG_PRIORITY_COUNT)
>   return -EINVAL;
>  
>   if ((group_args->compute_core_mask & ~ptdev->gpu_info.shader_present) ||
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> index 926b1deb1116..e23a7f9b0eac 100644
> --- a/include/uapi/drm/panthor_drm.h
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -692,7 +692,11 @@ enum drm_panthor_group_priority {
>   /** @PANTHOR_GROUP_PRIORITY_MEDIUM: Medium priority group. */
>   PANTHOR_GROUP_PRIORITY_MEDIUM,
>  
> - /** @PANTHOR_GROUP_PRIORITY_HIGH: High priority group. */
> + /**
> +  * @PANTHOR_GROUP_PRIORITY_HIGH: High priority group.
> +  *
> +  * Requires CAP_SYS_NICE or DRM_MASTER.
> +  */
>   PANTHOR_GROUP_PRIORITY_HIGH,
>  };
>  
> 
> base-commit: a15710027afb40c7c1e352902fa5b8c949f021de



Re: [PATCH v4] drm/panthor: Add DEV_QUERY_TIMESTAMP_INFO dev query

2024-09-05 Thread Boris Brezillon
On Wed, 4 Sep 2024 12:19:11 +0200
Boris Brezillon  wrote:

> On Fri, 30 Aug 2024 10:03:50 +0200
> Mary Guillemard  wrote:
> 
> > Expose timestamp information supported by the GPU with a new device
> > query.
> > 
> > Mali uses an external timer as GPU system time. On ARM, this is wired to
> > the generic arch timer so we wire cntfrq_el0 as device frequency.
> > 
> > This new uAPI will be used in Mesa to implement timestamp queries and
> > VK_KHR_calibrated_timestamps.
> > 
> > Since this extends the uAPI and because userland needs a way to advertise
> > those features conditionally, this also bumps the driver minor version.
> > 
> > v2:
> > - Rewrote to use GPU timestamp register
> > - Added timestamp_offset to drm_panthor_timestamp_info
> > - Add missing include for arch_timer_get_cntfrq
> > - Rework commit message
> > 
> > v3:
> > - Add panthor_gpu_read_64bit_counter
> > - Change panthor_gpu_read_timestamp to use
> >   panthor_gpu_read_64bit_counter
> > 
> > v4:
> > - Fix multiple typos in uAPI documentation
> > - Mention behavior when the timestamp frequency is unknown
> > - Use u64 instead of unsigned long long
> >   for panthor_gpu_read_timestamp
> > - Apply r-b from Mihail
> > 
> > Signed-off-by: Mary Guillemard 
> > Reviewed-by: Mihail Atanassov   
> 
> Reviewed-by: Boris Brezillon 

Queued to drm-misc-next.


Re: [PATCH v2] drm/panthor: flush FW AS caches in slow reset path

2024-09-05 Thread Boris Brezillon
On Mon, 2 Sep 2024 16:11:51 +0100
Steven Price  wrote:

> On 02/09/2024 14:02, Adrián Larumbe wrote:
> > In the off-chance that waiting for the firmware to signal its booted status
> > timed out in the fast reset path, one must flush the cache lines for the
> > entire FW VM address space before reloading the regions, otherwise stale
> > values eventually lead to a scheduler job timeout.
> > 
> > Fixes: 647810ec2476 ("drm/panthor: Add the MMU/VM logical block")
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Adrián Larumbe 
> > Acked-by: Liviu Dudau   
> 
> Reviewed-by: Steven Price 

Pushed to drm-misc-fixes.


Re: [PATCH] drm/panthor: Display FW version information

2024-09-05 Thread Boris Brezillon
On Thu,  5 Sep 2024 16:51:44 +0100
Steven Price  wrote:

> The firmware binary has a git SHA embedded into it which can be used to
> identify which firmware binary is being loaded. Output this as a
> drm_info() so that it's obvious from a dmesg log which firmware binary
> is being used.
> 
> Signed-off-by: Steven Price 

Just one formatting issue mentioned below, looks good otherwise.

Reviewed-by: Boris Brezillon 

> ---
>  drivers/gpu/drm/panthor/panthor_fw.c | 55 
>  1 file changed, 55 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_fw.c 
> b/drivers/gpu/drm/panthor/panthor_fw.c
> index 857f3f11258a..ef007287575c 100644
> --- a/drivers/gpu/drm/panthor/panthor_fw.c
> +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> @@ -78,6 +78,12 @@ enum panthor_fw_binary_entry_type {
>  
>   /** @CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA: Timeline metadata 
> interface. */
>   CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA = 4,
> +
> + /**
> +  * @CSF_FW_BINARY_ENTRY_TYPE_BUILD_INFO_METADATA: Metadata about how
> +  * the FW binary was built.
> +  */
> + CSF_FW_BINARY_ENTRY_TYPE_BUILD_INFO_METADATA = 6
>  };
>  
>  #define CSF_FW_BINARY_ENTRY_TYPE(ehdr)   
> ((ehdr) & 0xff)
> @@ -132,6 +138,13 @@ struct panthor_fw_binary_section_entry_hdr {
>   } data;
>  };
>  
> +struct panthor_fw_build_info_hdr {
> + /** @meta_start: Offset of the build info data in the FW binary */
> + u32 meta_start;
> + /** @meta_size: Size of the build info data in the FW binary */
> + u32 meta_size;
> +};
> +
>  /**
>   * struct panthor_fw_binary_iter - Firmware binary iterator
>   *
> @@ -628,6 +641,46 @@ static int panthor_fw_load_section_entry(struct 
> panthor_device *ptdev,
>   return 0;
>  }
>  
> +static int panthor_fw_read_build_info(struct panthor_device *ptdev,
> +   const struct firmware *fw,
> +   struct panthor_fw_binary_iter *iter,
> +   u32 ehdr)
> +{
> + struct panthor_fw_build_info_hdr hdr;
> + char header[9];
> + const char git_sha_header[sizeof(header)] = "git_sha: ";
> + int ret;
> +
> + ret = panthor_fw_binary_iter_read(ptdev, iter, &hdr, sizeof(hdr));
> + if (ret)
> + return ret;
> +
> + if (hdr.meta_start > fw->size ||
> + hdr.meta_start + hdr.meta_size > fw->size) {
> + drm_err(&ptdev->base, "Firmware build info corrupt\n");
> + /* We don't need the build info, so continue */
> + return 0;
> + }
> +
> + if (memcmp(git_sha_header, fw->data + hdr.meta_start,
> + sizeof(git_sha_header))) {

Indentation seems broken here:

if (memcmp(git_sha_header, fw->data + hdr.meta_start,
   sizeof(git_sha_header))) {

> + /* Not the expected header, this isn't metadata we understand */
> + return 0;
> + }
> +
> + /* Check that the git SHA is NULL terminated as expected */
> + if (fw->data[hdr.meta_start + hdr.meta_size - 1] != '\0') {
> + drm_warn(&ptdev->base, "Firmware's git sha is not NULL 
> terminated\n");
> + /* Don't treat as fatal */
> + return 0;
> + }
> +
> + drm_info(&ptdev->base, "Firmware git sha: %s\n",
> +  fw->data + hdr.meta_start + sizeof(git_sha_header));

Maybe we should also change the "FW vX.Y.Z" message into "FW interface
vX.Y.Z" to clarify things.

> +
> + return 0;
> +}
> +
>  static void
>  panthor_reload_fw_sections(struct panthor_device *ptdev, bool full_reload)
>  {
> @@ -672,6 +725,8 @@ static int panthor_fw_load_entry(struct panthor_device 
> *ptdev,
>   switch (CSF_FW_BINARY_ENTRY_TYPE(ehdr)) {
>   case CSF_FW_BINARY_ENTRY_TYPE_IFACE:
>   return panthor_fw_load_section_entry(ptdev, fw, &eiter, ehdr);
> + case CSF_FW_BINARY_ENTRY_TYPE_BUILD_INFO_METADATA:
> + return panthor_fw_read_build_info(ptdev, fw, &eiter, ehdr);
>  
>   /* FIXME: handle those entry types? */
>   case CSF_FW_BINARY_ENTRY_TYPE_CONFIG:



Re: [PATCH v2 2/2] drm/panthor: Add DEV_QUERY_GROUP_PRIORITIES_INFO dev query

2024-09-05 Thread Boris Brezillon
On Thu,  5 Sep 2024 19:32:23 +0200
Mary Guillemard  wrote:

> Expose allowed group priorities with a new device query.
> 
> This new uAPI will be used in Mesa to properly report what priorities a
> user can use for EGL_IMG_context_priority.
> 
> Since this extends the uAPI and because userland needs a way to
> advertise priorities accordingly, this also bumps the driver minor
> version.
> 
> v2:
> - Remove drm_panthor_group_allow_priority_flags definition
> - Document that allowed_mask is a bitmask of drm_panthor_group_priority
> 
> Signed-off-by: Mary Guillemard 
> ---
>  drivers/gpu/drm/panthor/panthor_drv.c | 61 ++-
>  include/uapi/drm/panthor_drm.h| 22 ++
>  2 files changed, 64 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c 
> b/drivers/gpu/drm/panthor/panthor_drv.c
> index 7b1db2adcb4c..f85aa2d99f09 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -170,6 +170,7 @@ panthor_get_uobj_array(const struct drm_panthor_obj_array 
> *in, u32 min_stride,
>PANTHOR_UOBJ_DECL(struct drm_panthor_gpu_info, tiler_present), 
> \
>PANTHOR_UOBJ_DECL(struct drm_panthor_csif_info, pad), \
>PANTHOR_UOBJ_DECL(struct drm_panthor_timestamp_info, 
> current_timestamp), \
> +  PANTHOR_UOBJ_DECL(struct drm_panthor_group_priorities_info, 
> pad), \
>PANTHOR_UOBJ_DECL(struct drm_panthor_sync_op, timeline_value), 
> \
>PANTHOR_UOBJ_DECL(struct drm_panthor_queue_submit, syncs), \
>PANTHOR_UOBJ_DECL(struct drm_panthor_queue_create, 
> ringbuf_size), \
> @@ -777,11 +778,41 @@ static int panthor_query_timestamp_info(struct 
> panthor_device *ptdev,
>   return 0;
>  }
>  
> +static int group_priority_permit(struct drm_file *file,
> +  u8 priority)
> +{
> + /* Ensure that priority is valid */
> + if (priority > PANTHOR_GROUP_PRIORITY_REALTIME)
> + return -EINVAL;
> +
> + /* Medium priority and below are always allowed */
> + if (priority <= PANTHOR_GROUP_PRIORITY_MEDIUM)
> + return 0;
> +
> + /* Higher priorities require CAP_SYS_NICE or DRM_MASTER */
> + if (capable(CAP_SYS_NICE) || drm_is_current_master(file))
> + return 0;
> +
> + return -EACCES;
> +}
> +
> +static void panthor_query_group_priorities_info(struct drm_file *file,
> + struct 
> drm_panthor_group_priorities_info *arg)
> +{
> + int prio;
> +
> + for (prio = PANTHOR_GROUP_PRIORITY_REALTIME; prio >= 0; prio--) {
> +     if (!group_priority_permit(file, prio))
> + arg->allowed_mask |= 1 << prio;

nit: we have a BIT() macro for that ;-). Other than that, it looks good
to me.

Reviewed-by: Boris Brezillon 


Re: [PATCH v2 1/2] drm/panthor: Add PANTHOR_GROUP_PRIORITY_REALTIME group priority

2024-09-05 Thread Boris Brezillon
On Thu,  5 Sep 2024 19:32:22 +0200
Mary Guillemard  wrote:

> This adds a new value to drm_panthor_group_priority exposing the
> realtime priority to userspace.
> 
> This is required to implement NV_context_priority_realtime in Mesa.
> 
> v2:
> - Add Steven Price r-b
> 
> Signed-off-by: Mary Guillemard 
> Reviewed-by: Steven Price 

Reviewed-by: Boris Brezillon 

> ---
>  drivers/gpu/drm/panthor/panthor_drv.c   | 2 +-
>  drivers/gpu/drm/panthor/panthor_sched.c | 2 --
>  include/uapi/drm/panthor_drm.h  | 7 +++
>  3 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panthor/panthor_drv.c 
> b/drivers/gpu/drm/panthor/panthor_drv.c
> index 0caf9e9a8c45..7b1db2adcb4c 100644
> --- a/drivers/gpu/drm/panthor/panthor_drv.c
> +++ b/drivers/gpu/drm/panthor/panthor_drv.c
> @@ -1041,7 +1041,7 @@ static int group_priority_permit(struct drm_file *file,
>u8 priority)
>  {
>   /* Ensure that priority is valid */
> - if (priority > PANTHOR_GROUP_PRIORITY_HIGH)
> + if (priority > PANTHOR_GROUP_PRIORITY_REALTIME)
>   return -EINVAL;
>  
>   /* Medium priority and below are always allowed */
> diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
> b/drivers/gpu/drm/panthor/panthor_sched.c
> index 91a31b70c037..86908ada7335 100644
> --- a/drivers/gpu/drm/panthor/panthor_sched.c
> +++ b/drivers/gpu/drm/panthor/panthor_sched.c
> @@ -137,8 +137,6 @@ enum panthor_csg_priority {
>* non-real-time groups. When such a group becomes executable,
>* it will evict the group with the lowest non-rt priority if
>* there's no free group slot available.
> -  *
> -  * Currently not exposed to userspace.
>*/
>   PANTHOR_CSG_PRIORITY_RT,
>  
> diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
> index 1fd8473548ac..011a555e4674 100644
> --- a/include/uapi/drm/panthor_drm.h
> +++ b/include/uapi/drm/panthor_drm.h
> @@ -720,6 +720,13 @@ enum drm_panthor_group_priority {
>* Requires CAP_SYS_NICE or DRM_MASTER.
>*/
>   PANTHOR_GROUP_PRIORITY_HIGH,
> +
> + /**
> +  * @PANTHOR_GROUP_PRIORITY_REALTIME: Realtime priority group.
> +  *
> +  * Requires CAP_SYS_NICE or DRM_MASTER.
> +  */
> + PANTHOR_GROUP_PRIORITY_REALTIME,
>  };
>  
>  /**



Re: [PATCH v2] drm/panthor: Display FW version information

2024-09-09 Thread Boris Brezillon
Hi Thomas,

On Mon, 9 Sep 2024 16:14:32 +0200
Thomas Zimmermann  wrote:

> Hi
> 
> Am 06.09.24 um 11:40 schrieb Steven Price:
> > The version number output when loading the firmware is actually the
> > interface version not the version of the firmware itself. Update the
> > message to make this clearer.
> >
> > However, the firmware binary has a git SHA embedded into it which can be
> > used to identify which firmware binary is being loaded. So output this
> > as a drm_info() so that it's obvious from a dmesg log which firmware
> > binary is being used.
> >
> > Reviewed-by: Boris Brezillon 
> > Reviewed-by: Liviu Dudau 
> > Signed-off-by: Steven Price 
> > ---
> > v2:
> >   * Fix indentation
> >   * Also update the FW interface message to include "using interface" to
> > make it clear it's not the FW version
> >   * Add Reviewed-bys
> >
> >   drivers/gpu/drm/panthor/panthor_fw.c | 57 +++-
> >   1 file changed, 56 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/panthor/panthor_fw.c 
> > b/drivers/gpu/drm/panthor/panthor_fw.c
> > index 857f3f11258a..aea5dd9a4969 100644
> > --- a/drivers/gpu/drm/panthor/panthor_fw.c
> > +++ b/drivers/gpu/drm/panthor/panthor_fw.c
> > @@ -78,6 +78,12 @@ enum panthor_fw_binary_entry_type {
> >   
> > /** @CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA: Timeline metadata 
> > interface. */
> > CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA = 4,
> > +
> > +   /**
> > +* @CSF_FW_BINARY_ENTRY_TYPE_BUILD_INFO_METADATA: Metadata about how
> > +* the FW binary was built.
> > +*/
> > +   CSF_FW_BINARY_ENTRY_TYPE_BUILD_INFO_METADATA = 6
> >   };
> >   
> >   #define CSF_FW_BINARY_ENTRY_TYPE(ehdr)
> > ((ehdr) & 0xff)
> > @@ -132,6 +138,13 @@ struct panthor_fw_binary_section_entry_hdr {
> > } data;
> >   };
> >   
> > +struct panthor_fw_build_info_hdr {
> > +   /** @meta_start: Offset of the build info data in the FW binary */
> > +   u32 meta_start;
> > +   /** @meta_size: Size of the build info data in the FW binary */
> > +   u32 meta_size;
> > +};
> > +
> >   /**
> >* struct panthor_fw_binary_iter - Firmware binary iterator
> >*
> > @@ -628,6 +641,46 @@ static int panthor_fw_load_section_entry(struct 
> > panthor_device *ptdev,
> > return 0;
> >   }
> >   
> > +static int panthor_fw_read_build_info(struct panthor_device *ptdev,
> > + const struct firmware *fw,
> > + struct panthor_fw_binary_iter *iter,
> > + u32 ehdr)
> > +{
> > +   struct panthor_fw_build_info_hdr hdr;
> > +   char header[9];
> > +   const char git_sha_header[sizeof(header)] = "git_sha: ";
> > +   int ret;
> > +
> > +   ret = panthor_fw_binary_iter_read(ptdev, iter, &hdr, sizeof(hdr));
> > +   if (ret)
> > +   return ret;
> > +
> > +   if (hdr.meta_start > fw->size ||
> > +   hdr.meta_start + hdr.meta_size > fw->size) {
> > +   drm_err(&ptdev->base, "Firmware build info corrupt\n");
> > +   /* We don't need the build info, so continue */
> > +   return 0;
> > +   }
> > +
> > +   if (memcmp(git_sha_header, fw->data + hdr.meta_start,
> > +  sizeof(git_sha_header))) {
> > +   /* Not the expected header, this isn't metadata we understand */
> > +   return 0;
> > +   }
> > +
> > +   /* Check that the git SHA is NULL terminated as expected */
> > +   if (fw->data[hdr.meta_start + hdr.meta_size - 1] != '\0') {
> > +   drm_warn(&ptdev->base, "Firmware's git sha is not NULL 
> > terminated\n");
> > +   /* Don't treat as fatal */
> > +   return 0;
> > +   }
> > +
> > +   drm_info(&ptdev->base, "Firmware git sha: %s\n",
> > +fw->data + hdr.meta_start + sizeof(git_sha_header));  
> 
> Please consider making this debugging-only information. Printing takes 
> time and the information is not essential unless for debugging.

Sounds like someone working on boot time optimization :-). More
seriously, I don't mind downgrading those to debug messages, as long as
we have the same information exposed through sysfs or DEV_QUERY, but
I'd prefer doing that in a follow-up patch that takes care of all
drm_info()s we have in panthor rather than addressing the two messages
we're modifying in this patch.

Regards,

Boris


Re: [RFC PATCH] drm/pancsf: Add a new driver for Mali CSF-based GPUs

2023-02-03 Thread Boris Brezillon
Hi Steven,

On Fri, 3 Feb 2023 15:41:38 +
Steven Price  wrote:

> Hi Boris,
> 
> Thanks for the post - it's great to see the progress!

Thanks for chiming in!

> 
> On 01/02/2023 08:48, Boris Brezillon wrote:
> > Mali v10 (second Valhal iteration) and later GPUs replaced the Job
> > Manager block by a command stream based interface called CSF (for
> > Command Stream Frontend). This interface is not only turning the job
> > chain based submission model into a command stream based one, but also
> > introducing FW-assisted scheduling of command stream queues. This is a
> > fundamental shift in both how userspace is supposed to submit jobs, but
> > also how the driver is architectured. We initially tried to retrofit the
> > CSF model into panfrost, but this ended up introducing unneeded
> > complexity to the existing driver, which we all know is a potential
> > source of regression.  
> 
> While I agree there's some big differences which effectively mandate
> splitting the driver I do think there are some parts which make a lot of
> sense to share.
> 
> For example pancsf_regs.h and panfrost_regs.h are really quite similar
> and I think could easily be combined.

For registers, I'm not so sure. I mean, yes, most of them are identical,
but some disappeared, while others were kept but with a different
layout (see GPU_CMD), or bits re-purposed for different meaning
(MMU_INT_STAT where BUS_FAULT just became OPERATION_COMPLETE). This
makes the whole thing very confusing, so I'd rather keep those definitions
separate for my own sanity (spent a bit of time trying to understand
why my GPU command was doing nothing or why I was receiving BUS_FAULT
interrupts, before realizing I was referring to the old layout) :-).

> The clock/regulator code is pretty
> much a direct copy/paste (just adding support for more clocks), etc.

Clock and regulators, maybe, but there's not much to be shared here. I
mean, Linux already provides quite a few helpers making the
clk/regulator retrieval/enabling/disabling pretty straightforward. Not
to mention that, by keeping them separate, we don't really need to deal
with old Mali HW quirks, and we can focus on new HW bugs instead :-).

> 
> What would be ideal is factoring out 'generic' parts from panfrost and
> then being able to use them from pancsf.

I've been refactoring this pancsf driver a few times already, and I must
admit I'd prefer to keep things separate, at least for the initial
submission. If we see things that should be shared, then we can do that
in a follow-up series, but I think it's a bit premature to do it now.

> 
> I had a go at starting that:
> 
> https://gitlab.arm.com/linux-arm/linux-sp/-/tree/pancsf-refactor
> 
> (lightly tested for Panfrost, only build tested for pancsf).

Thanks, I'll have a look.

> 
> That saves around 200 lines overall and avoids needing to maintain two
> lots of clock/regulator code. There's definite scope for sharing (most)
> register definitions between the drivers and quite possibly some of the
> MMU/memory code (although there's diminishing returns there).

Yeah, actually the MMU code is likely to diverge even more if we want
to support VM_BIND (require pre-allocating pages for the page-table
updates, so the map/unmap operations can't fail in the run_job path),
so I'm not sure it's a good idea to share that bit, at least not until
we have a clearer idea of how we want things done.

> 
> > So here comes a brand new driver for CSF-based hardware. This is a
> > preliminary version and some important features are missing (like devfreq
> > , PM support and a memory shrinker implem, to name a few). The goal of
> > this RFC is to gather some preliminary feedback on both the uAPI and some
> > basic building blocks, like the MMU/VM code, the tiler heap allocation
> > logic...  
> 
> At the moment I don't have any CSF hardware available, so this review is
> a pure code review.

That's still very useful.

> I'll try to organise some hardware and do some
> testing, but it's probably going to take a while to arrive and get setup.

I'm actively working on the driver, and fixing things as I go, so let
me know when you're ready to test and I'll point you to the latest
version.

> > +#define DRM_PANCSF_SYNC_OP_MIN_SIZE24
> > +#define DRM_PANCSF_QUEUE_SUBMIT_MIN_SIZE   32
> > +#define DRM_PANCSF_QUEUE_CREATE_MIN_SIZE   8
> > +#define DRM_PANCSF_VM_BIND_OP_MIN_SIZE 48  
> 
> I'm not sure why these are #defines rather than using sizeof()?

Yeah, I don't really like that. Those min sizes are here to deal with
potential new versions of the various objects passed as arrays
ref

Re: [PATCH drm-next v2 00/16] [RFC] DRM GPUVA Manager & Nouveau VM_BIND UAPI

2023-03-09 Thread Boris Brezillon
Hi Danilo,

On Fri, 17 Feb 2023 14:44:06 +0100
Danilo Krummrich  wrote:

> Changes in V2:
> ==
>   Nouveau:
> - Reworked the Nouveau VM_BIND UAPI to avoid memory allocations in fence
>   signalling critical sections. Updates to the VA space are split up in 
> three
>   separate stages, where only the 2. stage executes in a fence signalling
>   critical section:
> 
> 1. update the VA space, allocate new structures and page tables

Sorry for the silly question, but I didn't find where the page tables
pre-allocation happens. Mind pointing it to me? It's also unclear when
this step happens. Is this at bind-job submission time, when the job is
not necessarily ready to run, potentially waiting for other deps to be
signaled. Or is it done when all deps are met, as an extra step before
jumping to step 2. If that's the former, then I don't see how the VA
space update can happen, since the bind-job might depend on other
bind-jobs modifying the same portion of the VA space (unbind ops might
lead to intermediate page table levels disappearing while we were
waiting for deps). If it's the latter, I wonder why this is not
considered as an allocation in the fence signaling path (for the
bind-job out-fence to be signaled, you need these allocations to
succeed, unless failing to allocate page-tables is considered like a HW
misbehavior and the fence is signaled with an error in that case).

Note that I'm not familiar at all with Nouveau or TTM, and it might
be something that's solved by another component, or I'm just
misunderstanding how the whole thing is supposed to work. This being
said, I'd really like to implement a VM_BIND-like uAPI in pancsf using
the gpuva_manager infra you're proposing here, so please bare with me
:-).

> 2. (un-)map the requested memory bindings
> 3. free structures and page tables
> 
> - Separated generic job scheduler code from specific job implementations.
> - Separated the EXEC and VM_BIND implementation of the UAPI.
> - Reworked the locking parts of the nvkm/vmm RAW interface, such that
>   (un-)map operations can be executed in fence signalling critical 
> sections.
> 

Regards,

Boris



Re: [PATCH drm-next v2 00/16] [RFC] DRM GPUVA Manager & Nouveau VM_BIND UAPI

2023-03-09 Thread Boris Brezillon
On Thu, 9 Mar 2023 10:12:43 +0100
Boris Brezillon  wrote:

> Hi Danilo,
> 
> On Fri, 17 Feb 2023 14:44:06 +0100
> Danilo Krummrich  wrote:
> 
> > Changes in V2:
> > ==
> >   Nouveau:
> > - Reworked the Nouveau VM_BIND UAPI to avoid memory allocations in fence
> >   signalling critical sections. Updates to the VA space are split up in 
> > three
> >   separate stages, where only the 2. stage executes in a fence 
> > signalling
> >   critical section:
> > 
> > 1. update the VA space, allocate new structures and page tables  
> 
> Sorry for the silly question, but I didn't find where the page tables
> pre-allocation happens. Mind pointing it to me? It's also unclear when
> this step happens. Is this at bind-job submission time, when the job is
> not necessarily ready to run, potentially waiting for other deps to be
> signaled. Or is it done when all deps are met, as an extra step before
> jumping to step 2. If that's the former, then I don't see how the VA
> space update can happen, since the bind-job might depend on other
> bind-jobs modifying the same portion of the VA space (unbind ops might
> lead to intermediate page table levels disappearing while we were
> waiting for deps). If it's the latter, I wonder why this is not
> considered as an allocation in the fence signaling path (for the
> bind-job out-fence to be signaled, you need these allocations to
> succeed, unless failing to allocate page-tables is considered like a HW
> misbehavior and the fence is signaled with an error in that case).

Ok, so I just noticed you only have one bind queue per drm_file
(cli->sched_entity), and jobs are executed in-order on a given queue,
so I guess that allows you to modify the VA space at submit time
without risking any modifications to the VA space coming from other
bind-queues targeting the same VM. And, if I'm correct, synchronous
bind/unbind ops take the same path, so no risk for those to modify the
VA space either (just wonder if it's a good thing to have to sync
bind/unbind operations waiting on async ones, but that's a different
topic).

> 
> Note that I'm not familiar at all with Nouveau or TTM, and it might
> be something that's solved by another component, or I'm just
> misunderstanding how the whole thing is supposed to work. This being
> said, I'd really like to implement a VM_BIND-like uAPI in pancsf using
> the gpuva_manager infra you're proposing here, so please bare with me
> :-).
> 
> > 2. (un-)map the requested memory bindings
> > 3. free structures and page tables
> > 
> > - Separated generic job scheduler code from specific job 
> > implementations.
> > - Separated the EXEC and VM_BIND implementation of the UAPI.
> > - Reworked the locking parts of the nvkm/vmm RAW interface, such that
> >   (un-)map operations can be executed in fence signalling critical 
> > sections.
> >   
> 
> Regards,
> 
> Boris
> 



Re: [PATCH drm-next v2 00/16] [RFC] DRM GPUVA Manager & Nouveau VM_BIND UAPI

2023-03-10 Thread Boris Brezillon
Hi Danilo,

On Fri, 10 Mar 2023 17:45:58 +0100
Danilo Krummrich  wrote:

> Hi Boris,
> 
> On 3/9/23 10:48, Boris Brezillon wrote:
> > On Thu, 9 Mar 2023 10:12:43 +0100
> > Boris Brezillon  wrote:
> >   
> >> Hi Danilo,
> >>
> >> On Fri, 17 Feb 2023 14:44:06 +0100
> >> Danilo Krummrich  wrote:
> >>  
> >>> Changes in V2:
> >>> ==
> >>>Nouveau:
> >>>  - Reworked the Nouveau VM_BIND UAPI to avoid memory allocations in 
> >>> fence
> >>>signalling critical sections. Updates to the VA space are split up 
> >>> in three
> >>>separate stages, where only the 2. stage executes in a fence 
> >>> signalling
> >>>critical section:
> >>>
> >>>  1. update the VA space, allocate new structures and page tables  
> >>
> >> Sorry for the silly question, but I didn't find where the page tables
> >> pre-allocation happens. Mind pointing it to me? It's also unclear when
> >> this step happens. Is this at bind-job submission time, when the job is
> >> not necessarily ready to run, potentially waiting for other deps to be
> >> signaled. Or is it done when all deps are met, as an extra step before
> >> jumping to step 2. If that's the former, then I don't see how the VA
> >> space update can happen, since the bind-job might depend on other
> >> bind-jobs modifying the same portion of the VA space (unbind ops might
> >> lead to intermediate page table levels disappearing while we were
> >> waiting for deps). If it's the latter, I wonder why this is not
> >> considered as an allocation in the fence signaling path (for the
> >> bind-job out-fence to be signaled, you need these allocations to
> >> succeed, unless failing to allocate page-tables is considered like a HW
> >> misbehavior and the fence is signaled with an error in that case).  
> > 
> > Ok, so I just noticed you only have one bind queue per drm_file
> > (cli->sched_entity), and jobs are executed in-order on a given queue,
> > so I guess that allows you to modify the VA space at submit time
> > without risking any modifications to the VA space coming from other
> > bind-queues targeting the same VM. And, if I'm correct, synchronous
> > bind/unbind ops take the same path, so no risk for those to modify the
> > VA space either (just wonder if it's a good thing to have to sync
> > bind/unbind operations waiting on async ones, but that's a different
> > topic).  
> 
> Yes, that's all correct.
> 
> The page table allocation happens through nouveau_uvmm_vmm_get() which 
> either allocates the corresponding page tables or increases the 
> reference count, in case they already exist, accordingly.
> The call goes all the way through nvif into the nvkm layer (not the 
> easiest to follow the call chain) and ends up in nvkm_vmm_ptes_get().
> 
> There are multiple reasons for updating the VA space at submit time in 
> Nouveau.
> 
> 1) Subsequent EXEC ioctl() calls would need to wait for the bind jobs 
> they depend on within the ioctl() rather than in the scheduler queue, 
> because at the point of time where the ioctl() happens the VA space 
> wouldn't be up-to-date.

Hm, actually that's what explicit sync is all about, isn't it? If you
have async binding ops, you should retrieve the bind-op out-fences and
pass them back as in-fences to the EXEC call, so you're sure all the
memory mappings you depend on are active when you execute those GPU
jobs. And if you're using sync binds, the changes are guaranteed to be
applied before the ioctl() returns. Am I missing something?

> 
> 2) Let's assume a new mapping is requested and within it's range other 
> mappings already exist. Let's also assume that those existing mappings 
> aren't contiguous, such that there are gaps between them. In such a case 
> I need to allocate page tables only for the gaps between the existing 
> mappings, or alternatively, allocate them for the whole range of the new 
> mapping, but free / decrease the reference count of the page tables for 
> the ranges of the previously existing mappings afterwards.
> In the first case I need to know the gaps to allocate page tables for 
> when submitting the job, which means the VA space must be up-to-date. In 
> the latter one I must save the ranges of the previously existing 
> mappings somewhere in order to clean them up, hence I need to allocate 
> memory to store this information. Since I can't allocate this memory in 
> the

Re: [PATCH RFC 0/4] drm/panfrost: Expose memory usage stats through fdinfo

2023-01-16 Thread Boris Brezillon
Hi Steven,

On Mon, 16 Jan 2023 10:30:21 +
Steven Price  wrote:

> On 04/01/2023 13:03, Boris Brezillon wrote:
> > Hello,
> > 
> > Here's an attempt at exposing some memory usage stats through fdinfo,
> > which recently proved useful in debugging a memory leak. Not entirely
> > sure the name I chose are accurate, so feel free to propose
> > alternatives, and let me know if you see any other mem-related stuff
> > that would be interesting to expose.  
> 
> Sorry it's taken me a while to look at this - I'm still working through
> the holiday backlog.
> 
> The names look reasonable to me, and I gave this a quick spin and it
> seemed to work (the numbers reported looks reasonable). As Daniel
> suggested it would be good if some of the boiler plate fdinfo code could
> be moved to generic code (although to be fair there's not much here).
> 
> Of course what we're missing is the 'engine' usage information for
> gputop - it's been on my todo list of a while, but I'm more than happy
> for you to do it for me ;) It's somewhat more tricky because of the
> whole 'queuing' on slots mechanism that Mali has. But we obviously
> shouldn't block this memory implementation on that, it can be added
> afterwards.

Yeah, we've been discussing this drm-engine-xxx feature with Chris, and
I was telling him there's no easy way to get accurate numbers when
_NEXT queuing is involved. It all depends on whether we're able to
process the first job DONE interrupt before the second one kicks in, and
even then, we can't tell for sure for how long the second job has been
running when we get to process the first job interrupt. Inserting
WRITE_JOB(CYCLE_COUNT) before a job chain is doable, but inserting it
after isn't, and I'm not sure we want to add such tricks to the kernel
driver anyway. Don't know if you have any better ideas. If not, I guess
we can leave with this inaccuracy and still expose drm-engine-xxx...

Regards,

Boris


Re: [PATCH drm-next 13/14] drm/nouveau: implement new VM_BIND UAPI

2023-01-20 Thread Boris Brezillon
On Thu, 19 Jan 2023 04:58:48 +
Matthew Brost  wrote:

> > For the ops structures the drm_gpuva_manager allocates for reporting the
> > split/merge steps back to the driver I have ideas to entirely avoid
> > allocations, which also is a good thing in respect of Christians feedback
> > regarding the huge amount of mapping requests some applications seem to
> > generate.
> >  
> 
> It should be fine to have allocations to report the split/merge step as
> this step should be before a dma-fence is published, but yea if possible
> to avoid extra allocs as that is always better.
> 
> Also BTW, great work on drm_gpuva_manager too. We will almost likely
> pick this up in Xe rather than open coding all of this as we currently
> do. We should probably start the port to this soon so we can contribute
> to the implementation and get both of our drivers upstream sooner.

Also quite interested in using this drm_gpuva_manager for pancsf, since
I've been open-coding something similar. Didn't have the
gpuva_region concept to make sure VA mapping/unmapping requests don't
don't go outside a pre-reserved region, but it seems to automate some
of the stuff I've been doing quite nicely.


Re: [PATCH drm-next 13/14] drm/nouveau: implement new VM_BIND UAPI

2023-01-20 Thread Boris Brezillon
On Thu, 19 Jan 2023 16:38:06 +
Matthew Brost  wrote:

> On Thu, Jan 19, 2023 at 04:36:43PM +0100, Danilo Krummrich wrote:
> > On 1/19/23 05:58, Matthew Brost wrote:  
> > > On Thu, Jan 19, 2023 at 04:44:23AM +0100, Danilo Krummrich wrote:  
> > > > On 1/18/23 21:37, Thomas Hellström (Intel) wrote:  
> > > > > 
> > > > > On 1/18/23 07:12, Danilo Krummrich wrote:  
> > > > > > This commit provides the implementation for the new uapi motivated 
> > > > > > by the
> > > > > > Vulkan API. It allows user mode drivers (UMDs) to:
> > > > > > 
> > > > > > 1) Initialize a GPU virtual address (VA) space via the new
> > > > > >      DRM_IOCTL_NOUVEAU_VM_INIT ioctl for UMDs to specify the 
> > > > > > portion of VA
> > > > > >      space managed by the kernel and userspace, respectively.
> > > > > > 
> > > > > > 2) Allocate and free a VA space region as well as bind and unbind 
> > > > > > memory
> > > > > >      to the GPUs VA space via the new DRM_IOCTL_NOUVEAU_VM_BIND 
> > > > > > ioctl.
> > > > > >      UMDs can request the named operations to be processed either
> > > > > >      synchronously or asynchronously. It supports DRM syncobjs
> > > > > >      (incl. timelines) as synchronization mechanism. The management 
> > > > > > of the
> > > > > >      GPU VA mappings is implemented with the DRM GPU VA manager.
> > > > > > 
> > > > > > 3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC ioctl. 
> > > > > > The
> > > > > >      execution happens asynchronously. It supports DRM syncobj 
> > > > > > (incl.
> > > > > >      timelines) as synchronization mechanism. DRM GEM object 
> > > > > > locking is
> > > > > >      handled with drm_exec.
> > > > > > 
> > > > > > Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC, use the 
> > > > > > DRM
> > > > > > GPU scheduler for the asynchronous paths.
> > > > > > 
> > > > > > Signed-off-by: Danilo Krummrich 
> > > > > > ---
> > > > > >    Documentation/gpu/driver-uapi.rst   |   3 +
> > > > > >    drivers/gpu/drm/nouveau/Kbuild  |   2 +
> > > > > >    drivers/gpu/drm/nouveau/Kconfig |   2 +
> > > > > >    drivers/gpu/drm/nouveau/nouveau_abi16.c |  16 +
> > > > > >    drivers/gpu/drm/nouveau/nouveau_abi16.h |   1 +
> > > > > >    drivers/gpu/drm/nouveau/nouveau_drm.c   |  23 +-
> > > > > >    drivers/gpu/drm/nouveau/nouveau_drv.h   |   9 +-
> > > > > >    drivers/gpu/drm/nouveau/nouveau_exec.c  | 310 ++
> > > > > >    drivers/gpu/drm/nouveau/nouveau_exec.h  |  55 ++
> > > > > >    drivers/gpu/drm/nouveau/nouveau_sched.c | 780 
> > > > > > 
> > > > > >    drivers/gpu/drm/nouveau/nouveau_sched.h |  98 +++
> > > > > >    11 files changed, 1295 insertions(+), 4 deletions(-)
> > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.c
> > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.h
> > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.c
> > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.h  
> > > > > ...  
> > > > > > 
> > > > > > +static struct dma_fence *
> > > > > > +nouveau_bind_job_run(struct nouveau_job *job)
> > > > > > +{
> > > > > > +    struct nouveau_bind_job *bind_job = to_nouveau_bind_job(job);
> > > > > > +    struct nouveau_uvmm *uvmm = nouveau_cli_uvmm(job->cli);
> > > > > > +    struct bind_job_op *op;
> > > > > > +    int ret = 0;
> > > > > > +  
> > > > > 
> > > > > I was looking at how nouveau does the async binding compared to how xe
> > > > > does it.
> > > > > It looks to me that this function being a scheduler run_job callback 
> > > > > is
> > > > > the main part of the VM_BIND dma-fence signalling critical section for
> > > > > the job's done_fence and if so, needs to be annotated as such?  
> > > > 
> > > > Yes, that's the case.
> > > >   
> > > > > 
> > > > > For example nouveau_uvma_region_new allocates memory, which is not
> > > > > allowed if in a dma_fence signalling critical section and the locking
> > > > > also looks suspicious?  
> > > > 
> > > > Thanks for pointing this out, I missed that somehow.
> > > > 
> > > > I will change it to pre-allocate new regions, mappings and page tables
> > > > within the job's submit() function.
> > > >   
> > > 
> > > Yea that what we basically do in Xe, in the IOCTL step allocate all the
> > > backing store for new page tables, populate new page tables (these are
> > > not yet visible in the page table structure), and in last step which is
> > > executed after all the dependencies are satified program all the leaf
> > > entires making the new binding visible.
> > > 
> > > We screwed have this up by defering most of the IOCTL to a worker but
> > > will fix this fix this one way or another soon - get rid of worker or
> > > introduce a type of sync that is signaled after the worker + publish the
> > > dma-fence in the worker. I'd like to close on this one soon.  
> > > > For the ops structures the drm_gpuva_manager allocates for reporting the
> > > > split/merge steps bac

Re: [PATCH drm-next 13/14] drm/nouveau: implement new VM_BIND UAPI

2023-01-23 Thread Boris Brezillon
On Sun, 22 Jan 2023 17:48:37 +
Matthew Brost  wrote:

> On Fri, Jan 20, 2023 at 11:22:45AM +0100, Boris Brezillon wrote:
> > On Thu, 19 Jan 2023 16:38:06 +
> > Matthew Brost  wrote:
> >   
> > > On Thu, Jan 19, 2023 at 04:36:43PM +0100, Danilo Krummrich wrote:  
> > > > On 1/19/23 05:58, Matthew Brost wrote:
> > > > > On Thu, Jan 19, 2023 at 04:44:23AM +0100, Danilo Krummrich wrote:
> > > > > > On 1/18/23 21:37, Thomas Hellström (Intel) wrote:
> > > > > > > 
> > > > > > > On 1/18/23 07:12, Danilo Krummrich wrote:
> > > > > > > > This commit provides the implementation for the new uapi 
> > > > > > > > motivated by the
> > > > > > > > Vulkan API. It allows user mode drivers (UMDs) to:
> > > > > > > > 
> > > > > > > > 1) Initialize a GPU virtual address (VA) space via the new
> > > > > > > >      DRM_IOCTL_NOUVEAU_VM_INIT ioctl for UMDs to specify the 
> > > > > > > > portion of VA
> > > > > > > >      space managed by the kernel and userspace, respectively.
> > > > > > > > 
> > > > > > > > 2) Allocate and free a VA space region as well as bind and 
> > > > > > > > unbind memory
> > > > > > > >      to the GPUs VA space via the new DRM_IOCTL_NOUVEAU_VM_BIND 
> > > > > > > > ioctl.
> > > > > > > >      UMDs can request the named operations to be processed 
> > > > > > > > either
> > > > > > > >      synchronously or asynchronously. It supports DRM syncobjs
> > > > > > > >      (incl. timelines) as synchronization mechanism. The 
> > > > > > > > management of the
> > > > > > > >      GPU VA mappings is implemented with the DRM GPU VA manager.
> > > > > > > > 
> > > > > > > > 3) Execute push buffers with the new DRM_IOCTL_NOUVEAU_EXEC 
> > > > > > > > ioctl. The
> > > > > > > >      execution happens asynchronously. It supports DRM syncobj 
> > > > > > > > (incl.
> > > > > > > >      timelines) as synchronization mechanism. DRM GEM object 
> > > > > > > > locking is
> > > > > > > >      handled with drm_exec.
> > > > > > > > 
> > > > > > > > Both, DRM_IOCTL_NOUVEAU_VM_BIND and DRM_IOCTL_NOUVEAU_EXEC, use 
> > > > > > > > the DRM
> > > > > > > > GPU scheduler for the asynchronous paths.
> > > > > > > > 
> > > > > > > > Signed-off-by: Danilo Krummrich 
> > > > > > > > ---
> > > > > > > >    Documentation/gpu/driver-uapi.rst   |   3 +
> > > > > > > >    drivers/gpu/drm/nouveau/Kbuild  |   2 +
> > > > > > > >    drivers/gpu/drm/nouveau/Kconfig |   2 +
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_abi16.c |  16 +
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_abi16.h |   1 +
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_drm.c   |  23 +-
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_drv.h   |   9 +-
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_exec.c  | 310 ++
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_exec.h  |  55 ++
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_sched.c | 780 
> > > > > > > > 
> > > > > > > >    drivers/gpu/drm/nouveau/nouveau_sched.h |  98 +++
> > > > > > > >    11 files changed, 1295 insertions(+), 4 deletions(-)
> > > > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.c
> > > > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_exec.h
> > > > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.c
> > > > > > > >    create mode 100644 drivers/gpu/drm/nouveau/nouveau_sched.h   
> > > > > > > >  
> > > > > > > ...
> > > > > > > > 
> > > > > > > > +static struct dma_fence *
> > > > > > > > +nouveau_bind_job_r

Re: [PATCH 1/1] drm/shmem: Dual licence the files as GPL-2 and MIT

2022-11-14 Thread Boris Brezillon
On Sat, 12 Nov 2022 19:42:10 +
Robert Swindells  wrote:

> Contributors to these files are:
> 
> Noralf Trønnes 
> Liu Zixian 
> Dave Airlie 
> Thomas Zimmermann 
> Lucas De Marchi 
> Gerd Hoffmann 
> Rob Herring 
> Jakub Kicinski 
> Marcel Ziswiler 
> Stephen Rothwell 
> Daniel Vetter 
> Cai Huoqing 
> Neil Roberts 
> Marek Szyprowski 
> Emil Velikov 
> Sam Ravnborg 
> Boris Brezillon 

Acked-by: Boris Brezillon 

> Dan Carpenter 
> 
> Signed-off-by: Robert Swindells 
> ---
>  drivers/gpu/drm/drm_gem_shmem_helper.c | 2 +-
>  include/drm/drm_gem_shmem_helper.h | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index 35138f8a375c..f1a68a71f876 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -1,4 +1,4 @@
> -// SPDX-License-Identifier: GPL-2.0
> +// SPDX-License-Identifier: GPL-2.0 or MIT
>  /*
>   * Copyright 2018 Noralf Trønnes
>   */
> diff --git a/include/drm/drm_gem_shmem_helper.h 
> b/include/drm/drm_gem_shmem_helper.h
> index a2201b2488c5..56ac32947d1c 100644
> --- a/include/drm/drm_gem_shmem_helper.h
> +++ b/include/drm/drm_gem_shmem_helper.h
> @@ -1,4 +1,4 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> +/* SPDX-License-Identifier: GPL-2.0 or MIT */
>  
>  #ifndef __DRM_GEM_SHMEM_HELPER_H__
>  #define __DRM_GEM_SHMEM_HELPER_H__



Re: [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2022-12-30 Thread Boris Brezillon
Hello Matthew,

On Thu, 22 Dec 2022 14:21:11 -0800
Matthew Brost  wrote:

> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> seems a bit odd but let us explain the reasoning below.
> 
> 1. In XE the submission order from multiple drm_sched_entity is not
> guaranteed to be the same completion even if targeting the same hardware
> engine. This is because in XE we have a firmware scheduler, the GuC,
> which allowed to reorder, timeslice, and preempt submissions. If a using
> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> apart as the TDR expects submission order == completion order. Using a
> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.

Oh, that's interesting. I've been trying to solve the same sort of
issues to support Arm's new Mali GPU which is relying on a FW-assisted
scheduling scheme (you give the FW N streams to execute, and it does
the scheduling between those N command streams, the kernel driver
does timeslice scheduling to update the command streams passed to the
FW). I must admit I gave up on using drm_sched at some point, mostly
because the integration with drm_sched was painful, but also because I
felt trying to bend drm_sched to make it interact with a
timeslice-oriented scheduling model wasn't really future proof. Giving
drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
help for a few things (didn't think it through yet), but I feel it's
coming short on other aspects we have to deal with on Arm GPUs. Here
are a few things I noted while working on the drm_sched-based PoC:

- The complexity to suspend/resume streams and recover from failures
  remains quite important (because everything is still very asynchronous
  under the hood). Sure, you don't have to do this fancy
  timeslice-based scheduling, but that's still a lot of code, and
  AFAICT, it didn't integrate well with drm_sched TDR (my previous
  attempt at reconciling them has been unsuccessful, but maybe your
  patches would help there)
- You lose one of the nice thing that's brought by timeslice-based
  scheduling: a tiny bit of fairness. That is, if one stream is queuing
  a compute job that's monopolizing the GPU core, you know the kernel
  part of the scheduler will eventually evict it and let other streams
  with same or higher priority run, even before the job timeout
  kicks in.
- Stream slots exposed by the Arm FW are not exactly HW queues that run
  things concurrently. The FW can decide to let only the stream with the
  highest priority get access to the various HW resources (GPU cores,
  tiler, ...), and let other streams starve. That means you might get
  spurious timeouts on some jobs/sched-entities while they didn't even
  get a chance to run.

So overall, and given I'm no longer the only one having to deal with a
FW scheduler that's designed with timeslice scheduling in mind, I'm
wondering if it's not time to design a common timeslice-based scheduler
instead of trying to bend drivers to use the model enforced by
drm_sched. But that's just my 2 cents, of course.

Regards,

Boris


Re: [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2022-12-30 Thread Boris Brezillon
On Fri, 30 Dec 2022 11:20:42 +0100
Boris Brezillon  wrote:

> Hello Matthew,
> 
> On Thu, 22 Dec 2022 14:21:11 -0800
> Matthew Brost  wrote:
> 
> > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > seems a bit odd but let us explain the reasoning below.
> > 
> > 1. In XE the submission order from multiple drm_sched_entity is not
> > guaranteed to be the same completion even if targeting the same hardware
> > engine. This is because in XE we have a firmware scheduler, the GuC,
> > which allowed to reorder, timeslice, and preempt submissions. If a using
> > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > apart as the TDR expects submission order == completion order. Using a
> > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.  
> 
> Oh, that's interesting. I've been trying to solve the same sort of
> issues to support Arm's new Mali GPU which is relying on a FW-assisted
> scheduling scheme (you give the FW N streams to execute, and it does
> the scheduling between those N command streams, the kernel driver
> does timeslice scheduling to update the command streams passed to the
> FW). I must admit I gave up on using drm_sched at some point, mostly
> because the integration with drm_sched was painful, but also because I
> felt trying to bend drm_sched to make it interact with a
> timeslice-oriented scheduling model wasn't really future proof. Giving
> drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> help for a few things (didn't think it through yet), but I feel it's
> coming short on other aspects we have to deal with on Arm GPUs.

Ok, so I just had a quick look at the Xe driver and how it
instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
have a better understanding of how you get away with using drm_sched
while still controlling how scheduling is really done. Here
drm_gpu_scheduler is just a dummy abstract that let's you use the
drm_sched job queuing/dep/tracking mechanism. The whole run-queue
selection is dumb because there's only one entity ever bound to the
scheduler (the one that's part of the xe_guc_engine object which also
contains the drm_gpu_scheduler instance). I guess the main issue we'd
have on Arm is the fact that the stream doesn't necessarily get
scheduled when ->run_job() is called, it can be placed in the runnable
queue and be picked later by the kernel-side scheduler when a FW slot
gets released. That can probably be sorted out by manually disabling the
job timer and re-enabling it when the stream gets picked by the
scheduler. But my main concern remains, we're basically abusing
drm_sched here.

For the Arm driver, that means turning the following sequence

1. wait for job deps
2. queue job to ringbuf and push the stream to the runnable
   queue (if it wasn't queued already). Wakeup the timeslice scheduler
   to re-evaluate (if the stream is not on a FW slot already)
3. stream gets picked by the timeslice scheduler and sent to the FW for
   execution

into

1. queue job to entity which takes care of waiting for job deps for
   us
2. schedule a drm_sched_main iteration
3. the only available entity is picked, and the first job from this
   entity is dequeued. ->run_job() is called: the job is queued to the
   ringbuf and the stream is pushed to the runnable queue (if it wasn't
   queued already). Wakeup the timeslice scheduler to re-evaluate (if
   the stream is not on a FW slot already)
4. stream gets picked by the timeslice scheduler and sent to the FW for
   execution

That's one extra step we don't really need. To sum-up, yes, all the
job/entity tracking might be interesting to share/re-use, but I wonder
if we couldn't have that without pulling out the scheduling part of
drm_sched, or maybe I'm missing something, and there's something in
drm_gpu_scheduler you really need.


Re: [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-01 Thread Boris Brezillon
On Fri, 30 Dec 2022 12:55:08 +0100
Boris Brezillon  wrote:

> On Fri, 30 Dec 2022 11:20:42 +0100
> Boris Brezillon  wrote:
> 
> > Hello Matthew,
> > 
> > On Thu, 22 Dec 2022 14:21:11 -0800
> > Matthew Brost  wrote:
> >   
> > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> > > seems a bit odd but let us explain the reasoning below.
> > > 
> > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > guaranteed to be the same completion even if targeting the same hardware
> > > engine. This is because in XE we have a firmware scheduler, the GuC,
> > > which allowed to reorder, timeslice, and preempt submissions. If a using
> > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> > > apart as the TDR expects submission order == completion order. Using a
> > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.
> > 
> > Oh, that's interesting. I've been trying to solve the same sort of
> > issues to support Arm's new Mali GPU which is relying on a FW-assisted
> > scheduling scheme (you give the FW N streams to execute, and it does
> > the scheduling between those N command streams, the kernel driver
> > does timeslice scheduling to update the command streams passed to the
> > FW). I must admit I gave up on using drm_sched at some point, mostly
> > because the integration with drm_sched was painful, but also because I
> > felt trying to bend drm_sched to make it interact with a
> > timeslice-oriented scheduling model wasn't really future proof. Giving
> > drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> > help for a few things (didn't think it through yet), but I feel it's
> > coming short on other aspects we have to deal with on Arm GPUs.  
> 
> Ok, so I just had a quick look at the Xe driver and how it
> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> have a better understanding of how you get away with using drm_sched
> while still controlling how scheduling is really done. Here
> drm_gpu_scheduler is just a dummy abstract that let's you use the
> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> selection is dumb because there's only one entity ever bound to the
> scheduler (the one that's part of the xe_guc_engine object which also
> contains the drm_gpu_scheduler instance). I guess the main issue we'd
> have on Arm is the fact that the stream doesn't necessarily get
> scheduled when ->run_job() is called, it can be placed in the runnable
> queue and be picked later by the kernel-side scheduler when a FW slot
> gets released. That can probably be sorted out by manually disabling the
> job timer and re-enabling it when the stream gets picked by the
> scheduler. But my main concern remains, we're basically abusing
> drm_sched here.
> 
> For the Arm driver, that means turning the following sequence
> 
> 1. wait for job deps
> 2. queue job to ringbuf and push the stream to the runnable
>queue (if it wasn't queued already). Wakeup the timeslice scheduler
>to re-evaluate (if the stream is not on a FW slot already)
> 3. stream gets picked by the timeslice scheduler and sent to the FW for
>execution
> 
> into
> 
> 1. queue job to entity which takes care of waiting for job deps for
>us
> 2. schedule a drm_sched_main iteration
> 3. the only available entity is picked, and the first job from this
>entity is dequeued. ->run_job() is called: the job is queued to the
>ringbuf and the stream is pushed to the runnable queue (if it wasn't
>queued already). Wakeup the timeslice scheduler to re-evaluate (if
>the stream is not on a FW slot already)
> 4. stream gets picked by the timeslice scheduler and sent to the FW for
>execution
> 
> That's one extra step we don't really need. To sum-up, yes, all the
> job/entity tracking might be interesting to share/re-use, but I wonder
> if we couldn't have that without pulling out the scheduling part of
> drm_sched, or maybe I'm missing something, and there's something in
> drm_gpu_scheduler you really need.

On second thought, that's probably an acceptable overhead (not even
sure the extra step I was mentioning exists in practice, because dep
fence signaled state is checked as part of the drm_sched_main
iteration, so that's basically replacing the worker I schedule to
check job deps), and I like the idea of being able to re-use drm_sched
dep-tracking without resorting to invasive changes to the existing
logic, so I'll probably give it a try.


Re: [RFC PATCH 00/20] Initial Xe driver submission

2023-01-03 Thread Boris Brezillon
Hi,

On Mon, 02 Jan 2023 13:42:46 +0200
Jani Nikula  wrote:

> On Mon, 02 Jan 2023, Thomas Zimmermann  wrote:
> > Hi
> >
> > Am 22.12.22 um 23:21 schrieb Matthew Brost:  
> >> Hello,
> >> 
> >> This is a submission for Xe, a new driver for Intel GPUs that supports both
> >> integrated and discrete platforms starting with Tiger Lake (first platform 
> >> with
> >> Intel Xe Architecture). The intention of this new driver is to have a 
> >> fresh base
> >> to work from that is unencumbered by older platforms, whilst also taking 
> >> the
> >> opportunity to rearchitect our driver to increase sharing across the drm
> >> subsystem, both leveraging and allowing us to contribute more towards other
> >> shared components like TTM and drm/scheduler. The memory model is based on 
> >> VM
> >> bind which is similar to the i915 implementation. Likewise the execbuf
> >> implementation for Xe is very similar to execbuf3 in the i915 [1].  
> >
> > After Xe has stabilized, will i915 loose the ability to drive this 
> > hardware (and possibly other)?  I'm specfically thinking of the i915 
> > code that requires TTM. Keeping that dependecy within Xe only might 
> > benefit DRM as a whole.  
> 
> There's going to be a number of platforms supported by both drivers, and
> from purely a i915 standpoint dropping any currently supported platforms
> or that dependency from i915 would be a regression.
> 
> >> 
> >> The code is at a stage where it is already functional and has experimental
> >> support for multiple platforms starting from Tiger Lake, with initial 
> >> support
> >> implemented in Mesa (for Iris and Anv, our OpenGL and Vulkan drivers), as 
> >> well
> >> as in NEO (for OpenCL and Level0). A Mesa MR has been posted [2] and NEO
> >> implementation will be released publicly early next year. We also have a 
> >> suite
> >> of IGTs for XE that will appear on the IGT list shortly.
> >> 
> >> It has been built with the assumption of supporting multiple architectures 
> >> from
> >> the get-go, right now with tests running both on X86 and ARM hosts. And we
> >> intend to continue working on it and improving on it as part of the kernel
> >> community upstream.
> >> 
> >> The new Xe driver leverages a lot from i915 and work on i915 continues as 
> >> we
> >> ready Xe for production throughout 2023.
> >> 
> >> As for display, the intent is to share the display code with the i915 
> >> driver so
> >> that there is maximum reuse there. Currently this is being done by 
> >> compiling the
> >> display code twice, but alternatives to that are under consideration and 
> >> we want
> >> to have more discussion on what the best final solution will look like 
> >> over the
> >> next few months. Right now, work is ongoing in refactoring the display 
> >> codebase
> >> to remove as much as possible any unnecessary dependencies on i915 
> >> specific data
> >> structures there..  
> >
> > Could both drivers reside in a common parent directory and share 
> > something like a DRM Intel helper module with the common code? This 
> > would fit well with the common design of DRM helpers.  
> 
> I think it's too early to tell.
> 
> For one thing, setting that up would be a lot of up front infrastructure
> work. I'm not sure how to even pull that off when Xe is still
> out-of-tree and i915 development plunges on upstream as ever.
> 
> For another, realistically, the overlap between supported platforms is
> going to end at some point, and eventually new platforms are only going
> to be supported with Xe. That's going to open up new possibilities for
> refactoring also the display code. I think it would be premature to lock
> in to a common directory structure or a common helper module at this
> point.
> 
> I'm not saying no to the idea, and we've contemplated it before, but I
> think there are still too many moving parts to decide to go that way.

FWIW, I actually have the same dilemma with the driver for new Mali GPUs
I'm working on. I initially started making it a sub-driver of the
existing panfrost driver (some HW blocks are similar, like the
IOMMU and a few other things, and some SW abstracts can be shared here
and there, like the GEM allocator logic). But I'm now considering
forking the driver (after Alyssa planted the seed :-)), not only
because I want to start from a clean sheet on the the uAPI front
(wouldn't be an issue in your case, because you're talking about
sharing helpers, not the driver frontend), but also because any refactor
to panfrost is a potential source of regression for existing users. So,
I tend to agree with Jani here, trying to share code before things have
settled down is likely to cause pain to both Xe and i915
users+developers.

Best Regards,

Boris


Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-03 Thread Boris Brezillon
Hi,

On Tue, 3 Jan 2023 13:02:15 +
Tvrtko Ursulin  wrote:

> On 02/01/2023 07:30, Boris Brezillon wrote:
> > On Fri, 30 Dec 2022 12:55:08 +0100
> > Boris Brezillon  wrote:
> >   
> >> On Fri, 30 Dec 2022 11:20:42 +0100
> >> Boris Brezillon  wrote:
> >>  
> >>> Hello Matthew,
> >>>
> >>> On Thu, 22 Dec 2022 14:21:11 -0800
> >>> Matthew Brost  wrote:
> >>>  
> >>>> In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> >>>> mapping between a drm_gpu_scheduler and drm_sched_entity. At first this
> >>>> seems a bit odd but let us explain the reasoning below.
> >>>>
> >>>> 1. In XE the submission order from multiple drm_sched_entity is not
> >>>> guaranteed to be the same completion even if targeting the same hardware
> >>>> engine. This is because in XE we have a firmware scheduler, the GuC,
> >>>> which allowed to reorder, timeslice, and preempt submissions. If a using
> >>>> shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR falls
> >>>> apart as the TDR expects submission order == completion order. Using a
> >>>> dedicated drm_gpu_scheduler per drm_sched_entity solve this problem.  
> >>>
> >>> Oh, that's interesting. I've been trying to solve the same sort of
> >>> issues to support Arm's new Mali GPU which is relying on a FW-assisted
> >>> scheduling scheme (you give the FW N streams to execute, and it does
> >>> the scheduling between those N command streams, the kernel driver
> >>> does timeslice scheduling to update the command streams passed to the
> >>> FW). I must admit I gave up on using drm_sched at some point, mostly
> >>> because the integration with drm_sched was painful, but also because I
> >>> felt trying to bend drm_sched to make it interact with a
> >>> timeslice-oriented scheduling model wasn't really future proof. Giving
> >>> drm_sched_entity exlusive access to a drm_gpu_scheduler probably might
> >>> help for a few things (didn't think it through yet), but I feel it's
> >>> coming short on other aspects we have to deal with on Arm GPUs.  
> >>
> >> Ok, so I just had a quick look at the Xe driver and how it
> >> instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> >> have a better understanding of how you get away with using drm_sched
> >> while still controlling how scheduling is really done. Here
> >> drm_gpu_scheduler is just a dummy abstract that let's you use the
> >> drm_sched job queuing/dep/tracking mechanism. The whole run-queue
> >> selection is dumb because there's only one entity ever bound to the
> >> scheduler (the one that's part of the xe_guc_engine object which also
> >> contains the drm_gpu_scheduler instance). I guess the main issue we'd
> >> have on Arm is the fact that the stream doesn't necessarily get
> >> scheduled when ->run_job() is called, it can be placed in the runnable
> >> queue and be picked later by the kernel-side scheduler when a FW slot
> >> gets released. That can probably be sorted out by manually disabling the
> >> job timer and re-enabling it when the stream gets picked by the
> >> scheduler. But my main concern remains, we're basically abusing
> >> drm_sched here.
> >>
> >> For the Arm driver, that means turning the following sequence
> >>
> >> 1. wait for job deps
> >> 2. queue job to ringbuf and push the stream to the runnable
> >> queue (if it wasn't queued already). Wakeup the timeslice scheduler
> >> to re-evaluate (if the stream is not on a FW slot already)
> >> 3. stream gets picked by the timeslice scheduler and sent to the FW for
> >> execution
> >>
> >> into
> >>
> >> 1. queue job to entity which takes care of waiting for job deps for
> >> us
> >> 2. schedule a drm_sched_main iteration
> >> 3. the only available entity is picked, and the first job from this
> >> entity is dequeued. ->run_job() is called: the job is queued to the
> >> ringbuf and the stream is pushed to the runnable queue (if it wasn't
> >> queued already). Wakeup the timeslice scheduler to re-evaluate (if
> >> the stream is not on a FW slot already)
> >> 4. stream gets picked by the timeslice scheduler and sent to the FW for
> >> exe

[PATCH RFC 0/4] drm/panfrost: Expose memory usage stats through fdinfo

2023-01-04 Thread Boris Brezillon
Hello,

Here's an attempt at exposing some memory usage stats through fdinfo,
which recently proved useful in debugging a memory leak. Not entirely
sure the name I chose are accurate, so feel free to propose
alternatives, and let me know if you see any other mem-related stuff
that would be interesting to expose.

Regards,

Boris

Boris Brezillon (4):
  drm/panfrost: Provide a dummy show_fdinfo() implementation
  drm/panfrost: Track BO resident size
  drm/panfrost: Add a helper to retrieve MMU context stats
  drm/panfrost: Expose some memory related stats through fdinfo

 drivers/gpu/drm/panfrost/panfrost_drv.c   | 24 -
 drivers/gpu/drm/panfrost/panfrost_gem.h   |  7 +
 .../gpu/drm/panfrost/panfrost_gem_shrinker.c  |  1 +
 drivers/gpu/drm/panfrost/panfrost_mmu.c   | 27 +++
 drivers/gpu/drm/panfrost/panfrost_mmu.h   | 10 +++
 5 files changed, 68 insertions(+), 1 deletion(-)

-- 
2.38.1



[PATCH RFC 3/4] drm/panfrost: Add a helper to retrieve MMU context stats

2023-01-04 Thread Boris Brezillon
For now we only gather a few memory usage stats that we'll expose
through fdinfo, but this can be extended if needed.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 25 +
 drivers/gpu/drm/panfrost/panfrost_mmu.h | 10 ++
 2 files changed, 35 insertions(+)

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 454799d5a0ef..80c6e0e17195 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -435,6 +435,31 @@ addr_to_mapping(struct panfrost_device *pfdev, int as, u64 
addr)
return mapping;
 }
 
+void panfrost_mmu_get_stats(struct panfrost_mmu *mmu,
+   struct panfrost_mmu_stats *stats)
+{
+   struct drm_mm_node *node;
+
+   memset(stats, 0, sizeof(*stats));
+
+   spin_lock(&mmu->mm_lock);
+   drm_mm_for_each_node(node, &mmu->mm) {
+   struct panfrost_gem_mapping *mapping;
+   struct panfrost_gem_object *bo;
+
+   mapping = container_of(node, struct panfrost_gem_mapping, 
mmnode);
+   bo = mapping->obj;
+
+   stats->all += bo->base.base.size;
+   stats->resident += bo->resident_size;
+   if (bo->base.madv > 0)
+   stats->purgeable += bo->resident_size;
+   if (bo->base.base.dma_buf)
+   stats->shared += bo->base.base.size;
+   }
+   spin_unlock(&mmu->mm_lock);
+}
+
 #define NUM_FAULT_PAGES (SZ_2M / PAGE_SIZE)
 
 static int panfrost_mmu_map_fault_addr(struct panfrost_device *pfdev, int as,
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.h 
b/drivers/gpu/drm/panfrost/panfrost_mmu.h
index cc2a0d307feb..bbffd39deaf3 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.h
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.h
@@ -8,6 +8,13 @@ struct panfrost_gem_mapping;
 struct panfrost_file_priv;
 struct panfrost_mmu;
 
+struct panfrost_mmu_stats {
+   u64 all;
+   u64 resident;
+   u64 purgeable;
+   u64 shared;
+};
+
 int panfrost_mmu_map(struct panfrost_gem_mapping *mapping);
 void panfrost_mmu_unmap(struct panfrost_gem_mapping *mapping);
 
@@ -22,4 +29,7 @@ struct panfrost_mmu *panfrost_mmu_ctx_get(struct panfrost_mmu 
*mmu);
 void panfrost_mmu_ctx_put(struct panfrost_mmu *mmu);
 struct panfrost_mmu *panfrost_mmu_ctx_create(struct panfrost_device *pfdev);
 
+void panfrost_mmu_get_stats(struct panfrost_mmu *mmu,
+   struct panfrost_mmu_stats *stats);
+
 #endif
-- 
2.38.1



[PATCH RFC 4/4] drm/panfrost: Expose some memory related stats through fdinfo

2023-01-04 Thread Boris Brezillon
drm-memory-all: memory hold by this context. Not that all the memory is
not necessarily resident: heap BO size is counted even though only part
of the memory reserved for those BOs might be allocated.

drm-memory-resident: resident memory size. For normal BOs it's the same
as drm-memory-all, but for heap BOs, only the memory actually allocated
is counted.

drm-memory-purgeable: amount of memory that can be reclaimed by the
system (madvise(DONT_NEED)).

drm-memory-shared: amount of memory shared through dma-buf.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 6ee43559fc14..05d5d480df2a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -519,9 +519,16 @@ static void panfrost_show_fdinfo(struct seq_file *m, 
struct file *f)
 {
struct drm_file *file = f->private_data;
struct panfrost_file_priv *panfrost_priv = file->driver_priv;
+   struct panfrost_mmu_stats mmu_stats;
+
+   panfrost_mmu_get_stats(panfrost_priv->mmu, &mmu_stats);
 
seq_printf(m, "drm-driver:\t%s\n", file->minor->dev->driver->name);
seq_printf(m, "drm-client-id:\t%llu\n", 
panfrost_priv->sched_entity[0].fence_context);
+   seq_printf(m, "drm-memory-all:\t%llu KiB\n", mmu_stats.all >> 10);
+   seq_printf(m, "drm-memory-resident:\t%llu KiB\n", mmu_stats.resident >> 
10);
+   seq_printf(m, "drm-memory-purgeable:\t%llu KiB\n", mmu_stats.purgeable 
>> 10);
+   seq_printf(m, "drm-memory-shared:\t%llu KiB\n", mmu_stats.shared >> 10);
 }
 
 static const struct file_operations panfrost_drm_driver_fops = {
-- 
2.38.1



[PATCH RFC 2/4] drm/panfrost: Track BO resident size

2023-01-04 Thread Boris Brezillon
Heap BOs use an on-demand allocation scheme, meaning that the resident
size is different from the BO side. Track resident size so we can more
accurately per-FD expose memory usage.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_gem.h  | 7 +++
 drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c | 1 +
 drivers/gpu/drm/panfrost/panfrost_mmu.c  | 2 ++
 3 files changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h 
b/drivers/gpu/drm/panfrost/panfrost_gem.h
index 8088d5fd8480..58f5d091c983 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.h
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
@@ -36,6 +36,13 @@ struct panfrost_gem_object {
 */
atomic_t gpu_usecount;
 
+   /* Actual memory used by the BO. Should be zero before pages are
+* pinned, then the size of the BO, unless it's a heap BO. In
+* this case the resident size is updated when the fault handler
+* allocates memory.
+*/
+   size_t resident_size;
+
bool noexec :1;
bool is_heap:1;
 };
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c 
b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
index bf0170782f25..efbc8dec4a9f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c
@@ -54,6 +54,7 @@ static bool panfrost_gem_purge(struct drm_gem_object *obj)
panfrost_gem_teardown_mappings_locked(bo);
drm_gem_shmem_purge_locked(&bo->base);
ret = true;
+   bo->resident_size = 0;
 
mutex_unlock(&shmem->pages_lock);
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 4e83a1891f3e..454799d5a0ef 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -340,6 +340,7 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping)
mmu_map_sg(pfdev, mapping->mmu, mapping->mmnode.start << PAGE_SHIFT,
   prot, sgt);
mapping->active = true;
+   bo->resident_size = bo->base.base.size;
 
return 0;
 }
@@ -508,6 +509,7 @@ static int panfrost_mmu_map_fault_addr(struct 
panfrost_device *pfdev, int as,
}
}
 
+   bo->resident_size += SZ_2M;
mutex_unlock(&bo->base.pages_lock);
 
sgt = &bo->sgts[page_offset / (SZ_2M / PAGE_SIZE)];
-- 
2.38.1



[PATCH RFC 1/4] drm/panfrost: Provide a dummy show_fdinfo() implementation

2023-01-04 Thread Boris Brezillon
Provide a dummy show_fdinfo() implementation exposing drm-driver and
drm-client-id. More stats will be added soon.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 2fa5afe21288..6ee43559fc14 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -515,7 +515,22 @@ static const struct drm_ioctl_desc 
panfrost_drm_driver_ioctls[] = {
PANFROST_IOCTL(MADVISE, madvise,DRM_RENDER_ALLOW),
 };
 
-DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
+static void panfrost_show_fdinfo(struct seq_file *m, struct file *f)
+{
+   struct drm_file *file = f->private_data;
+   struct panfrost_file_priv *panfrost_priv = file->driver_priv;
+
+   seq_printf(m, "drm-driver:\t%s\n", file->minor->dev->driver->name);
+   seq_printf(m, "drm-client-id:\t%llu\n", 
panfrost_priv->sched_entity[0].fence_context);
+}
+
+static const struct file_operations panfrost_drm_driver_fops = {
+   .owner = THIS_MODULE,
+   DRM_GEM_FOPS,
+#ifdef CONFIG_PROC_FS
+   .show_fdinfo = panfrost_show_fdinfo,
+#endif
+};
 
 /*
  * Panfrost driver version:
-- 
2.38.1



Re: [PATCH RFC 1/4] drm/panfrost: Provide a dummy show_fdinfo() implementation

2023-01-09 Thread Boris Brezillon
Hi Daniel,

On Thu, 5 Jan 2023 16:31:49 +0100
Daniel Vetter  wrote:

> On Wed, Jan 04, 2023 at 02:03:05PM +0100, Boris Brezillon wrote:
> > Provide a dummy show_fdinfo() implementation exposing drm-driver and
> > drm-client-id. More stats will be added soon.
> > 
> > Signed-off-by: Boris Brezillon 
> > ---
> >  drivers/gpu/drm/panfrost/panfrost_drv.c | 17 -
> >  1 file changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
> > b/drivers/gpu/drm/panfrost/panfrost_drv.c
> > index 2fa5afe21288..6ee43559fc14 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> > @@ -515,7 +515,22 @@ static const struct drm_ioctl_desc 
> > panfrost_drm_driver_ioctls[] = {
> > PANFROST_IOCTL(MADVISE, madvise,DRM_RENDER_ALLOW),
> >  };
> >  
> > -DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
> > +static void panfrost_show_fdinfo(struct seq_file *m, struct file *f)
> > +{
> > +   struct drm_file *file = f->private_data;
> > +   struct panfrost_file_priv *panfrost_priv = file->driver_priv;
> > +
> > +   seq_printf(m, "drm-driver:\t%s\n", file->minor->dev->driver->name);
> > +   seq_printf(m, "drm-client-id:\t%llu\n", 
> > panfrost_priv->sched_entity[0].fence_context);  
> 
> I think at this point we really need to not just have a document that says
> what this should look like, but drm infrastructure with shared code.
> Drivers all inventing their fdinfo really doesn't seem like a great idea
> to me.

Okay. I'm just curious how far you want to go with this common
infrastructure? Are we talking about having a generic helper printing
the pretty generic drm-{driver,client-id} props and letting the driver
prints its driver specific properties, or do you also want to
standardize/automate printing of some drm-memory/drm-engine props too?

Regards,

Boris


Re: [PATCH RFC 1/4] drm/panfrost: Provide a dummy show_fdinfo() implementation

2023-01-09 Thread Boris Brezillon
On Mon, 9 Jan 2023 11:17:49 +0100
Daniel Vetter  wrote:

> On Mon, 9 Jan 2023 at 09:34, Boris Brezillon
>  wrote:
> >
> > Hi Daniel,
> >
> > On Thu, 5 Jan 2023 16:31:49 +0100
> > Daniel Vetter  wrote:
> >  
> > > On Wed, Jan 04, 2023 at 02:03:05PM +0100, Boris Brezillon wrote:  
> > > > Provide a dummy show_fdinfo() implementation exposing drm-driver and
> > > > drm-client-id. More stats will be added soon.
> > > >
> > > > Signed-off-by: Boris Brezillon 
> > > > ---
> > > >  drivers/gpu/drm/panfrost/panfrost_drv.c | 17 -
> > > >  1 file changed, 16 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
> > > > b/drivers/gpu/drm/panfrost/panfrost_drv.c
> > > > index 2fa5afe21288..6ee43559fc14 100644
> > > > --- a/drivers/gpu/drm/panfrost/panfrost_drv.c
> > > > +++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
> > > > @@ -515,7 +515,22 @@ static const struct drm_ioctl_desc 
> > > > panfrost_drm_driver_ioctls[] = {
> > > > PANFROST_IOCTL(MADVISE, madvise,DRM_RENDER_ALLOW),
> > > >  };
> > > >
> > > > -DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
> > > > +static void panfrost_show_fdinfo(struct seq_file *m, struct file *f)
> > > > +{
> > > > +   struct drm_file *file = f->private_data;
> > > > +   struct panfrost_file_priv *panfrost_priv = file->driver_priv;
> > > > +
> > > > +   seq_printf(m, "drm-driver:\t%s\n", file->minor->dev->driver->name);
> > > > +   seq_printf(m, "drm-client-id:\t%llu\n", 
> > > > panfrost_priv->sched_entity[0].fence_context);  
> > >
> > > I think at this point we really need to not just have a document that says
> > > what this should look like, but drm infrastructure with shared code.
> > > Drivers all inventing their fdinfo really doesn't seem like a great idea
> > > to me.  
> >
> > Okay. I'm just curious how far you want to go with this common
> > infrastructure? Are we talking about having a generic helper printing
> > the pretty generic drm-{driver,client-id} props and letting the driver
> > prints its driver specific properties, or do you also want to
> > standardize/automate printing of some drm-memory/drm-engine props too?  
> 
> I think we should standardized what's used by multiple drivers at
> least. It might be a bit tough for the memory/engine props, because
> there's really not much standard stuff there yet (e.g. for memory I'm
> still hoping for cgroups work, for engines we should probably base
> this on drm_sched_entity and maybe untie that somewhat from sched
> itself for i915-sched and fw sched and whatever there is).

Good, didn't want to be drawn in endless discussions about what should
be standardized and what shouldn't anyway. So I'll start with
drm-{driver,client-id}. For the client-id, we'll probably need
some sort of unique-id stored at the drm_file level (ida-based?),
unless you want to leave that to drivers too.


Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-09 Thread Boris Brezillon
Hi Jason,

On Mon, 9 Jan 2023 09:45:09 -0600
Jason Ekstrand  wrote:

> On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> wrote:
> 
> > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > Boris Brezillon  wrote:
> > >  
> > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > Boris Brezillon  wrote:
> > > >  
> > > > > Hello Matthew,
> > > > >
> > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > Matthew Brost  wrote:
> > > > >  
> > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1
> > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At first  
> > this  
> > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > >
> > > > > > 1. In XE the submission order from multiple drm_sched_entity is not
> > > > > > guaranteed to be the same completion even if targeting the same  
> > hardware  
> > > > > > engine. This is because in XE we have a firmware scheduler, the  
> > GuC,  
> > > > > > which allowed to reorder, timeslice, and preempt submissions. If a  
> > using  
> > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the TDR  
> > falls  
> > > > > > apart as the TDR expects submission order == completion order.  
> > Using a  
> > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this  
> > problem.  
> > > > >
> > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > issues to support Arm's new Mali GPU which is relying on a  
> > FW-assisted  
> > > > > scheduling scheme (you give the FW N streams to execute, and it does
> > > > > the scheduling between those N command streams, the kernel driver
> > > > > does timeslice scheduling to update the command streams passed to the
> > > > > FW). I must admit I gave up on using drm_sched at some point, mostly
> > > > > because the integration with drm_sched was painful, but also because  
> > I  
> > > > > felt trying to bend drm_sched to make it interact with a
> > > > > timeslice-oriented scheduling model wasn't really future proof.  
> > Giving  
> > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > might  
> > > > > help for a few things (didn't think it through yet), but I feel it's
> > > > > coming short on other aspects we have to deal with on Arm GPUs.  
> > > >
> > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I think I
> > > > have a better understanding of how you get away with using drm_sched
> > > > while still controlling how scheduling is really done. Here
> > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue  
> >
> > You nailed it here, we use the DRM scheduler for queuing jobs,
> > dependency tracking and releasing jobs to be scheduled when dependencies
> > are met, and lastly a tracking mechanism of inflights jobs that need to
> > be cleaned up if an error occurs. It doesn't actually do any scheduling
> > aside from the most basic level of not overflowing the submission ring
> > buffer. In this sense, a 1 to 1 relationship between entity and
> > scheduler fits quite well.
> >  
> 
> Yeah, I think there's an annoying difference between what AMD/NVIDIA/Intel
> want here and what you need for Arm thanks to the number of FW queues
> available. I don't remember the exact number of GuC queues but it's at
> least 1k. This puts it in an entirely different class from what you have on
> Mali. Roughly, there's about three categories here:
> 
>  1. Hardware where the kernel is placing jobs on actual HW rings. This is
> old Mali, Intel Haswell and earlier, and probably a bunch of others.
> (Intel BDW+ with execlists is a weird case that doesn't fit in this
> categorization.)
> 
>  2. Hardware (or firmware) with a very limited number of queues where
> you're going to have to juggle in the kernel in order to run desktop Linux.
> 
>  3. Firmware scheduling with a high queue count. In this case, you don't
> want the kern

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-10 Thread Boris Brezillon
Hi Daniel,

On Mon, 9 Jan 2023 21:40:21 +0100
Daniel Vetter  wrote:

> On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > Hi Jason,
> > 
> > On Mon, 9 Jan 2023 09:45:09 -0600
> > Jason Ekstrand  wrote:
> >   
> > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > wrote:
> > >   
> > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:    
> > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > Boris Brezillon  wrote:
> > > > >
> > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > Boris Brezillon  wrote:
> > > > > >
> > > > > > > Hello Matthew,
> > > > > > >
> > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > Matthew Brost  wrote:
> > > > > > >
> > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 
> > > > > > > > to 1
> > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. At 
> > > > > > > > first
> > > > this
> > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > >
> > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is 
> > > > > > > > not
> > > > > > > > guaranteed to be the same completion even if targeting the same 
> > > > > > > >
> > > > hardware
> > > > > > > > engine. This is because in XE we have a firmware scheduler, the 
> > > > > > > >
> > > > GuC,
> > > > > > > > which allowed to reorder, timeslice, and preempt submissions. 
> > > > > > > > If a
> > > > using
> > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, the 
> > > > > > > > TDR
> > > > falls
> > > > > > > > apart as the TDR expects submission order == completion order.  
> > > > > > > >   
> > > > Using a
> > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this
> > > > problem.
> > > > > > >
> > > > > > > Oh, that's interesting. I've been trying to solve the same sort of
> > > > > > > issues to support Arm's new Mali GPU which is relying on a
> > > > FW-assisted
> > > > > > > scheduling scheme (you give the FW N streams to execute, and it 
> > > > > > > does
> > > > > > > the scheduling between those N command streams, the kernel driver
> > > > > > > does timeslice scheduling to update the command streams passed to 
> > > > > > > the
> > > > > > > FW). I must admit I gave up on using drm_sched at some point, 
> > > > > > > mostly
> > > > > > > because the integration with drm_sched was painful, but also 
> > > > > > > because
> > > > I
> > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > timeslice-oriented scheduling model wasn't really future proof.   
> > > > > > >  
> > > > Giving
> > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler probably  
> > > > > > >   
> > > > might
> > > > > > > help for a few things (didn't think it through yet), but I feel 
> > > > > > > it's
> > > > > > > coming short on other aspects we have to deal with on Arm GPUs.   
> > > > > > >  
> > > > > >
> > > > > > Ok, so I just had a quick look at the Xe driver and how it
> > > > > > instantiates the drm_sched_entity and drm_gpu_scheduler, and I 
> > > > > > think I
> > > > > > have a better understanding of how you get away with using drm_sched
> > > > > > while still controlling how scheduling is really done. Here
> > > > > > drm_gpu_scheduler is just a dummy abstract that let's you use the
> > > > > > drm_sched job queuing/dep/tracking mechanism. The whole run-queue   
> &

Re: [RFC PATCH 00/20] Initial Xe driver submission

2023-01-10 Thread Boris Brezillon
+Frank, who's also working on the pvr uAPI.

Hi,

On Thu, 22 Dec 2022 14:21:07 -0800
Matthew Brost  wrote:

> The code has been organized such that we have all patches that touch areas
> outside of drm/xe first for review, and then the actual new driver in a 
> separate
> commit. The code which is outside of drm/xe is included in this RFC while
> drm/xe is not due to the size of the commit. The drm/xe is code is available 
> in
> a public repo listed below.
> 
> Xe driver commit:
> https://cgit.freedesktop.org/drm/drm-xe/commit/?h=drm-xe-next&id=9cb016ebbb6a275f57b1cb512b95d5a842391ad7
> 
> Xe kernel repo:
> https://cgit.freedesktop.org/drm/drm-xe/

Sorry to hijack this thread, again, but I'm currently working on the
pancsf uAPI, and I was wondering how DRM maintainers/developers felt
about the new direction taken by the Xe driver on some aspects of their
uAPI (to decide if I should copy these patterns or go the old way):

- plan for ioctl extensions through '__u64 extensions;' fields (the
  vulkan way, basically)
- turning the GETPARAM in DEV_QUERY which can return more than a 64-bit
  integer at a time
- having ioctls taking sub-operations instead of one ioctl per
  operation (I'm referring to VM_BIND here, which handles map, unmap,
  restart, ... through a single entry point)

Regards,

Boris






Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Boris Brezillon
Hi Daniel,

On Wed, 11 Jan 2023 22:47:02 +0100
Daniel Vetter  wrote:

> On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
>  wrote:
> >
> > Hi Daniel,
> >
> > On Mon, 9 Jan 2023 21:40:21 +0100
> > Daniel Vetter  wrote:
> >  
> > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:  
> > > > Hi Jason,
> > > >
> > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > Jason Ekstrand  wrote:
> > > >  
> > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > wrote:
> > > > >  
> > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon wrote:  
> > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > Boris Brezillon  wrote:
> > > > > > >  
> > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > Boris Brezillon  wrote:
> > > > > > > >  
> > > > > > > > > Hello Matthew,
> > > > > > > > >
> > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > Matthew Brost  wrote:
> > > > > > > > >  
> > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have 
> > > > > > > > > > a 1 to 1
> > > > > > > > > > mapping between a drm_gpu_scheduler and drm_sched_entity. 
> > > > > > > > > > At first  
> > > > > > this  
> > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > >
> > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > guaranteed to be the same completion even if targeting the 
> > > > > > > > > > same  
> > > > > > hardware  
> > > > > > > > > > engine. This is because in XE we have a firmware scheduler, 
> > > > > > > > > > the  
> > > > > > GuC,  
> > > > > > > > > > which allowed to reorder, timeslice, and preempt 
> > > > > > > > > > submissions. If a  
> > > > > > using  
> > > > > > > > > > shared drm_gpu_scheduler across multiple drm_sched_entity, 
> > > > > > > > > > the TDR  
> > > > > > falls  
> > > > > > > > > > apart as the TDR expects submission order == completion 
> > > > > > > > > > order.  
> > > > > > Using a  
> > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this 
> > > > > > > > > >  
> > > > > > problem.  
> > > > > > > > >
> > > > > > > > > Oh, that's interesting. I've been trying to solve the same 
> > > > > > > > > sort of
> > > > > > > > > issues to support Arm's new Mali GPU which is relying on a  
> > > > > > FW-assisted  
> > > > > > > > > scheduling scheme (you give the FW N streams to execute, and 
> > > > > > > > > it does
> > > > > > > > > the scheduling between those N command streams, the kernel 
> > > > > > > > > driver
> > > > > > > > > does timeslice scheduling to update the command streams 
> > > > > > > > > passed to the
> > > > > > > > > FW). I must admit I gave up on using drm_sched at some point, 
> > > > > > > > > mostly
> > > > > > > > > because the integration with drm_sched was painful, but also 
> > > > > > > > > because  
> > > > > > I  
> > > > > > > > > felt trying to bend drm_sched to make it interact with a
> > > > > > > > > timeslice-oriented scheduling model wasn't really future 
> > > > > > > > > proof.  
> > > > > > Giving  
> > > > > > > > > drm_sched_entity exlusive access to a drm_gpu_scheduler 
> > > > > > > > > probably  
> >

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Boris Brezillon
On Thu, 12 Jan 2023 10:32:18 +0100
Daniel Vetter  wrote:

> On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> > Hi Daniel,
> > 
> > On Wed, 11 Jan 2023 22:47:02 +0100
> > Daniel Vetter  wrote:
> >   
> > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > >  wrote:  
> > > >
> > > > Hi Daniel,
> > > >
> > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > Daniel Vetter  wrote:
> > > >
> > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > > > > > Hi Jason,
> > > > > >
> > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > Jason Ekstrand  wrote:
> > > > > >
> > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > > > 
> > > > > > > wrote:
> > > > > > >    
> > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon 
> > > > > > > > wrote:
> > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > >
> > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > >
> > > > > > > > > > > Hello Matthew,
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > Matthew Brost  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to 
> > > > > > > > > > > > have a 1 to 1
> > > > > > > > > > > > mapping between a drm_gpu_scheduler and 
> > > > > > > > > > > > drm_sched_entity. At first
> > > > > > > > this
> > > > > > > > > > > > seems a bit odd but let us explain the reasoning below.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > > > guaranteed to be the same completion even if targeting 
> > > > > > > > > > > > the same
> > > > > > > > hardware
> > > > > > > > > > > > engine. This is because in XE we have a firmware 
> > > > > > > > > > > > scheduler, the
> > > > > > > > GuC,
> > > > > > > > > > > > which allowed to reorder, timeslice, and preempt 
> > > > > > > > > > > > submissions. If a
> > > > > > > > using
> > > > > > > > > > > > shared drm_gpu_scheduler across multiple 
> > > > > > > > > > > > drm_sched_entity, the TDR
> > > > > > > > falls
> > > > > > > > > > > > apart as the TDR expects submission order == completion 
> > > > > > > > > > > > order.
> > > > > > > > Using a
> > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve 
> > > > > > > > > > > > this
> > > > > > > > problem.
> > > > > > > > > > >
> > > > > > > > > > > Oh, that's interesting. I've been trying to solve the 
> > > > > > > > > > > same sort of
> > > > > > > > > > > issues to support Arm's new Mali GPU which is relying on 
> > > > > > > > > > > a
> > > > > > > > FW-assisted
> > > > > > > > > > > scheduling scheme (you give the FW N streams to execute, 
> > > > > > > > > > > and it does
> > > > > > > > > > > the scheduling between those N command streams, the 
> > > > > >

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Boris Brezillon
On Thu, 12 Jan 2023 11:11:03 +0100
Boris Brezillon  wrote:

> On Thu, 12 Jan 2023 10:32:18 +0100
> Daniel Vetter  wrote:
> 
> > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:  
> > > Hi Daniel,
> > > 
> > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > Daniel Vetter  wrote:
> > > 
> > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > >  wrote:
> > > > >
> > > > > Hi Daniel,
> > > > >
> > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > Daniel Vetter  wrote:
> > > > >  
> > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > > > > >   
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > Jason Ekstrand  wrote:
> > > > > > >  
> > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > > > > 
> > > > > > > > wrote:
> > > > > > > >  
> > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon 
> > > > > > > > > wrote:  
> > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > >  
> > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > > >  
> > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > Matthew Brost  wrote:
> > > > > > > > > > > >  
> > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to 
> > > > > > > > > > > > > have a 1 to 1
> > > > > > > > > > > > > mapping between a drm_gpu_scheduler and 
> > > > > > > > > > > > > drm_sched_entity. At first  
> > > > > > > > > this  
> > > > > > > > > > > > > seems a bit odd but let us explain the reasoning 
> > > > > > > > > > > > > below.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > > > > guaranteed to be the same completion even if 
> > > > > > > > > > > > > targeting the same  
> > > > > > > > > hardware  
> > > > > > > > > > > > > engine. This is because in XE we have a firmware 
> > > > > > > > > > > > > scheduler, the  
> > > > > > > > > GuC,  
> > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt 
> > > > > > > > > > > > > submissions. If a  
> > > > > > > > > using  
> > > > > > > > > > > > > shared drm_gpu_scheduler across multiple 
> > > > > > > > > > > > > drm_sched_entity, the TDR  
> > > > > > > > > falls  
> > > > > > > > > > > > > apart as the TDR expects submission order == 
> > > > > > > > > > > > > completion order.  
> > > > > > > > > Using a  
> > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity 
> > > > > > > > > > > > > solve this  
> > > > > > > > > problem.  
> > > > > > > > > > > >
> > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the 
> > > > > > > > > > > > same sort of
> >

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Boris Brezillon
On Thu, 12 Jan 2023 11:11:03 +0100
Boris Brezillon  wrote:

> On Thu, 12 Jan 2023 10:32:18 +0100
> Daniel Vetter  wrote:
> 
> > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:  
> > > Hi Daniel,
> > > 
> > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > Daniel Vetter  wrote:
> > > 
> > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > >  wrote:
> > > > >
> > > > > Hi Daniel,
> > > > >
> > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > Daniel Vetter  wrote:
> > > > >  
> > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon wrote:
> > > > > >   
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > Jason Ekstrand  wrote:
> > > > > > >  
> > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > > > > 
> > > > > > > > wrote:
> > > > > > > >  
> > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon 
> > > > > > > > > wrote:  
> > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > >  
> > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > > >  
> > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > Matthew Brost  wrote:
> > > > > > > > > > > >  
> > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has made to 
> > > > > > > > > > > > > have a 1 to 1
> > > > > > > > > > > > > mapping between a drm_gpu_scheduler and 
> > > > > > > > > > > > > drm_sched_entity. At first  
> > > > > > > > > this  
> > > > > > > > > > > > > seems a bit odd but let us explain the reasoning 
> > > > > > > > > > > > > below.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > > > > guaranteed to be the same completion even if 
> > > > > > > > > > > > > targeting the same  
> > > > > > > > > hardware  
> > > > > > > > > > > > > engine. This is because in XE we have a firmware 
> > > > > > > > > > > > > scheduler, the  
> > > > > > > > > GuC,  
> > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt 
> > > > > > > > > > > > > submissions. If a  
> > > > > > > > > using  
> > > > > > > > > > > > > shared drm_gpu_scheduler across multiple 
> > > > > > > > > > > > > drm_sched_entity, the TDR  
> > > > > > > > > falls  
> > > > > > > > > > > > > apart as the TDR expects submission order == 
> > > > > > > > > > > > > completion order.  
> > > > > > > > > Using a  
> > > > > > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity 
> > > > > > > > > > > > > solve this  
> > > > > > > > > problem.  
> > > > > > > > > > > >
> > > > > > > > > > > > Oh, that's interesting. I've been trying to solve the 
> > > > > > > > > > > > same sort of
> >

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Boris Brezillon
On Thu, 12 Jan 2023 11:42:57 +0100
Daniel Vetter  wrote:

> On Thu, Jan 12, 2023 at 11:25:53AM +0100, Boris Brezillon wrote:
> > On Thu, 12 Jan 2023 11:11:03 +0100
> > Boris Brezillon  wrote:
> >   
> > > On Thu, 12 Jan 2023 10:32:18 +0100
> > > Daniel Vetter  wrote:
> > >   
> > > > On Thu, Jan 12, 2023 at 10:10:53AM +0100, Boris Brezillon wrote:
> > > > > Hi Daniel,
> > > > > 
> > > > > On Wed, 11 Jan 2023 22:47:02 +0100
> > > > > Daniel Vetter  wrote:
> > > > >   
> > > > > > On Tue, 10 Jan 2023 at 09:46, Boris Brezillon
> > > > > >  wrote:  
> > > > > > >
> > > > > > > Hi Daniel,
> > > > > > >
> > > > > > > On Mon, 9 Jan 2023 21:40:21 +0100
> > > > > > > Daniel Vetter  wrote:
> > > > > > >
> > > > > > > > On Mon, Jan 09, 2023 at 06:17:48PM +0100, Boris Brezillon 
> > > > > > > > wrote:
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > On Mon, 9 Jan 2023 09:45:09 -0600
> > > > > > > > > Jason Ekstrand  wrote:
> > > > > > > > >
> > > > > > > > > > On Thu, Jan 5, 2023 at 1:40 PM Matthew Brost 
> > > > > > > > > > 
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > On Mon, Jan 02, 2023 at 08:30:19AM +0100, Boris Brezillon 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Fri, 30 Dec 2022 12:55:08 +0100
> > > > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, 30 Dec 2022 11:20:42 +0100
> > > > > > > > > > > > > Boris Brezillon  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hello Matthew,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, 22 Dec 2022 14:21:11 -0800
> > > > > > > > > > > > > > Matthew Brost  wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In XE, the new Intel GPU driver, a choice has 
> > > > > > > > > > > > > > > made to have a 1 to 1
> > > > > > > > > > > > > > > mapping between a drm_gpu_scheduler and 
> > > > > > > > > > > > > > > drm_sched_entity. At first
> > > > > > > > > > > this
> > > > > > > > > > > > > > > seems a bit odd but let us explain the reasoning 
> > > > > > > > > > > > > > > below.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. In XE the submission order from multiple 
> > > > > > > > > > > > > > > drm_sched_entity is not
> > > > > > > > > > > > > > > guaranteed to be the same completion even if 
> > > > > > > > > > > > > > > targeting the same
> > > > > > > > > > > hardware
> > > > > > > > > > > > > > > engine. This is because in XE we have a firmware 
> > > > > > > > > > > > > > > scheduler, the
> > > > > > > > > > > GuC,
> > > > > > > > > > > > > > > which allowed to reorder, timeslice, and preempt 
> > > > > > > > > > > > > > > submissions. If a
> > > > > > > > > > > using
> > > > > > > > > > > > > > > shared drm_gpu_scheduler across multiple 
> > > > > > > > > > > > > > > drm_sched_entity, the TDR
> > &g

Re: [Intel-gfx] [RFC PATCH 04/20] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-01-12 Thread Boris Brezillon
On Thu, 12 Jan 2023 16:38:18 +0100
Daniel Vetter  wrote:

> > >
> > > Also if you do the allocation in ->prepare_job with dma_fence and not
> > > run_job, then I think can sort out fairness issues (if they do pop up) in
> > > the drm/sched code instead of having to think about this in each driver.  
> >
> > By allocation, you mean assigning a FW slot ID? If we do this allocation
> > in ->prepare_job(), couldn't we mess up ordering? Like,
> > lower-prio/later-queuing entity being scheduled before its pairs,
> > because there's no guarantee on the job completion order (and thus the
> > queue idleness order). I mean, completion order depends on the kind of
> > job being executed by the queues, the time the FW actually lets the
> > queue execute things and probably other factors. You can use metrics
> > like the position in the LRU list + the amount of jobs currently
> > queued to a group to guess which one will be idle first, but that's
> > just a guess. And I'm not sure I see what doing this slot selection in  
> > ->prepare_job() would bring us compared to doing it in ->run_job(),  
> > where we can just pick the least recently used slot.  
> 
> In ->prepare_job you can let the scheduler code do the stalling (and
> ensure fairness), in ->run_job it's your job.

Yeah returning a fence in ->prepare_job() to wait for a FW slot to
become idle sounds good. This fence would be signaled when one of the
slots becomes idle. But I'm wondering why we'd want to select the slot
so early. Can't we just do the selection in ->run_job()? After all, if
the fence has been signaled, that means we'll find at least one slot
that's ready when we hit ->run_job(), and we can select it at that
point.

> The current RFC doesn't
> really bother much with getting this very right, but if the scheduler
> code tries to make sure it pushes higher-prio stuff in first before
> others, you should get the right outcome.

Okay, so I'm confused again. We said we had a 1:1
drm_gpu_scheduler:drm_sched_entity mapping, meaning that entities are
isolated from each other. I can see how I could place the dma_fence
returned by ->prepare_job() in a driver-specific per-priority list, so
the driver can pick the highest-prio/first-inserted entry and signal the
associated fence when a slot becomes idle. But I have a hard time
seeing how common code could do that if it doesn't see the other
entities. Right now, drm_gpu_scheduler only selects the best entity
among the registered ones, and there's only one entity per
drm_gpu_scheduler in this case.

> 
> The more important functional issue is that you must only allocate the
> fw slot after all dependencies have signalled.

Sure, but it doesn't have to be a specific FW slot, it can be any FW
slot, as long as we don't signal more fences than we have slots
available, right?

> Otherwise you might get
> a nice deadlock, where job A is waiting for the fw slot of B to become
> free, and B is waiting for A to finish.

Got that part, and that's ensured by the fact we wait for all
regular deps before returning the FW-slot-available dma_fence in
->prepare_job(). This exact same fence will be signaled when a slot
becomes idle.

> 
> > > Few fw sched slots essentially just make fw scheduling unfairness more
> > > prominent than with others, but I don't think it's fundamentally something
> > > else really.
> > >
> > > If every ctx does that and the lru isn't too busted, they should then form
> > > a nice orderly queue and cycle through the fw scheduler, while still being
> > > able to get some work done. It's essentially the exact same thing that
> > > happens with ttm vram eviction, when you have a total working set where
> > > each process fits in vram individually, but in total they're too big and
> > > you need to cycle things through.  
> >
> > I see.
> >  
> > >  
> > > > > I'll need to make sure this still works with the concept of group 
> > > > > (it's
> > > > > not a single queue we schedule, it's a group of queues, meaning that 
> > > > > we
> > > > > have N fences to watch to determine if the slot is busy or not, but
> > > > > that should be okay).  
> > > >
> > > > Oh, there's one other thing I forgot to mention: the FW scheduler is
> > > > not entirely fair, it does take the slot priority (which has to be
> > > > unique across all currently assigned slots) into account when
> > > > scheduling groups. So, ideally, we'd want to rotate group priorities
> > > > when they share the same drm_sched_priority (probably based on the
> > > > position in the LRU).  
> > >
> > > Hm that will make things a bit more fun I guess, especially with your
> > > constraint to not update this too often. How strict is that priority
> > > difference? If it's a lot, we might need to treat this more like execlist
> > > and less like a real fw scheduler ...  
> >
> > Strict as in, if two groups with same priority try to request an
> > overlapping set of resources (cores or tilers), it can deadlock, so
> > pretty strict I would say :-).  

[RFC] Discussing the pancsf scheduler implementation

2023-01-12 Thread Boris Brezillon
Hello,

Moving the discussion that started here [1] to a separate thread to stop
polluting the Xe RFC.

> > > Also if you do the allocation in ->prepare_job with dma_fence and not
> > > run_job, then I think can sort out fairness issues (if they do pop up) in
> > > the drm/sched code instead of having to think about this in each driver.  
> > >   
> >
> > By allocation, you mean assigning a FW slot ID? If we do this allocation
> > in ->prepare_job(), couldn't we mess up ordering? Like,
> > lower-prio/later-queuing entity being scheduled before its pairs,
> > because there's no guarantee on the job completion order (and thus the
> > queue idleness order). I mean, completion order depends on the kind of
> > job being executed by the queues, the time the FW actually lets the
> > queue execute things and probably other factors. You can use metrics
> > like the position in the LRU list + the amount of jobs currently
> > queued to a group to guess which one will be idle first, but that's
> > just a guess. And I'm not sure I see what doing this slot selection in
> > ->prepare_job() would bring us compared to doing it in ->run_job(),
> > where we can just pick the least recently used slot.
> 
> In ->prepare_job you can let the scheduler code do the stalling (and
> ensure fairness), in ->run_job it's your job.  

Yeah returning a fence in ->prepare_job() to wait for a FW slot to
become idle sounds good. This fence would be signaled when one of the
slots becomes idle. But I'm wondering why we'd want to select the slot
so early. Can't we just do the selection in ->run_job()? After all, if
the fence has been signaled, that means we'll find at least one slot
that's ready when we hit ->run_job(), and we can select it at that
point.

> The current RFC doesn't
> really bother much with getting this very right, but if the scheduler
> code tries to make sure it pushes higher-prio stuff in first before
> others, you should get the right outcome.  

Okay, so I'm confused again. We said we had a 1:1
drm_gpu_scheduler:drm_sched_entity mapping, meaning that entities are
isolated from each other. I can see how I could place the dma_fence
returned by ->prepare_job() in a driver-specific per-priority list, so
the driver can pick the highest-prio/first-inserted entry and signal the
associated fence when a slot becomes idle. But I have a hard time
seeing how common code could do that if it doesn't see the other
entities. Right now, drm_gpu_scheduler only selects the best entity
among the registered ones, and there's only one entity per
drm_gpu_scheduler in this case.

> 
> The more important functional issue is that you must only allocate the
> fw slot after all dependencies have signalled.  

Sure, but it doesn't have to be a specific FW slot, it can be any FW
slot, as long as we don't signal more fences than we have slots
available, right?

> Otherwise you might get
> a nice deadlock, where job A is waiting for the fw slot of B to become
> free, and B is waiting for A to finish.  

Got that part, and that's ensured by the fact we wait for all
regular deps before returning the FW-slot-available dma_fence in
->prepare_job(). This exact same fence will be signaled when a slot  
becomes idle.

>   
> > > Few fw sched slots essentially just make fw scheduling unfairness more
> > > prominent than with others, but I don't think it's fundamentally something
> > > else really.
> > >
> > > If every ctx does that and the lru isn't too busted, they should then form
> > > a nice orderly queue and cycle through the fw scheduler, while still being
> > > able to get some work done. It's essentially the exact same thing that
> > > happens with ttm vram eviction, when you have a total working set where
> > > each process fits in vram individually, but in total they're too big and
> > > you need to cycle things through.
> >
> > I see.
> >
> > >
> > > > > I'll need to make sure this still works with the concept of group 
> > > > > (it's
> > > > > not a single queue we schedule, it's a group of queues, meaning that 
> > > > > we
> > > > > have N fences to watch to determine if the slot is busy or not, but
> > > > > that should be okay).
> > > >
> > > > Oh, there's one other thing I forgot to mention: the FW scheduler is
> > > > not entirely fair, it does take the slot priority (which has to be
> > > > unique across all currently assigned slots) into account when
> > > > scheduling groups. So, ideally, we'd want to rotate group priorities
> > > > when they share the same drm_sched_priority (probably based on the
> > > > position in the LRU).
> > >
> > > Hm that will make things a bit more fun I guess, especially with your
> > > constraint to not update this too often. How strict is that priority
> > > difference? If it's a lot, we might need to treat this more like execlist
> > > and less like a real fw scheduler ...
> >
> > Strict as in, if two groups with same priority try to request an
> > overlapping set of resour

Re: [PATCH v3 0/2] drm/panfrost: Add MT8188 support

2024-06-19 Thread Boris Brezillon
On Tue, 11 Jun 2024 10:56:00 +0200
AngeloGioacchino Del Regno 
wrote:

> Changes in v3:
>  - Added comment stating that MT8188 uses same supplies as MT8183
>as requested by Steven
> 
> Changes in v2:
>  - Fixed bindings to restrict number of power domains for MT8188's
>GPU to three like MT8183(b).
> 
> This series adds support for MT8188's Mali-G57 MC3.
> 
> AngeloGioacchino Del Regno (2):
>   dt-bindings: gpu: mali-bifrost: Add compatible for MT8188 SoC
>   drm/panfrost: Add support for Mali on the MT8188 SoC

Queued to drm-misc-next.

Thanks,

Boris

> 
>  .../devicetree/bindings/gpu/arm,mali-bifrost.yaml  |  5 -
>  drivers/gpu/drm/panfrost/panfrost_drv.c| 10 ++
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 



Re: [PATCH] MAINTAINERS: update Microchip's Atmel-HLCDC driver maintainers

2024-06-20 Thread Boris Brezillon
On Thu, 20 Jun 2024 15:28:56 +0530
Manikandan Muralidharan  wrote:

> Drop Sam Ravnborg and Boris Brezillon as they are no longer interested in
> maintaining the drivers. Add myself and Dharma Balasubiramani as the
> Maintainer and co-maintainer for Microchip's Atmel-HLCDC driver.
> Thanks for their work.
> 
> Signed-off-by: Manikandan Muralidharan 

Acked-by: Boris Brezillon 

> ---
>  MAINTAINERS | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index d1566c647a50..8f2a40285544 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7290,8 +7290,8 @@ F:  drivers/gpu/drm/ci/xfails/meson*
>  F:   drivers/gpu/drm/meson/
>  
>  DRM DRIVERS FOR ATMEL HLCDC
> -M:   Sam Ravnborg 
> -M:   Boris Brezillon 
> +M:   Manikandan Muralidharan 
> +M:   Dharma Balasubiramani 
>  L:   dri-devel@lists.freedesktop.org
>  S:   Supported
>  T:   git https://gitlab.freedesktop.org/drm/misc/kernel.git



Re: [PATCH v4 5/7] drm/panfrost: Add a new ioctl to submit batches

2021-07-26 Thread Boris Brezillon
On Thu, 8 Jul 2021 14:10:45 +0200
Christian König  wrote:

> >> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> >> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> >> @@ -254,6 +254,9 @@ static int panfrost_acquire_object_fences(struct 
> >> panfrost_job *job)
> >>return ret;
> >>}
> >>   
> >> +  if (job->bo_flags[i] & PANFROST_BO_REF_NO_IMPLICIT_DEP)
> >> +  continue;  
> > This breaks dma_resv rules. I'll send out patch set fixing this pattern in
> > other drivers, I'll ping you on that for what you need to change. Should
> > go out today or so.

I guess you're talking about [1]. TBH, I don't quite see the point of
exposing a 'no-implicit' flag if we end up forcing this implicit dep
anyway, but I'm probably missing something.

> 
> I'm really wondering if the behavior that the exclusive fences replaces 
> all the shared fences was such a good idea.

Is that what's done in [1], or are you talking about a different
patchset/approach?

> 
> It just allows drivers to mess up things in a way which can be easily 
> used to compromise the system.

I must admit I'm a bit lost, so I'm tempted to drop that flag for now
:-).

[1]https://patchwork.freedesktop.org/patch/443711/?series=92334&rev=3


Re: [PATCH v4 00/18] drm/sched dependency tracking and dma-resv fixes

2021-07-27 Thread Boris Brezillon
On Mon, 12 Jul 2021 19:53:34 +0200
Daniel Vetter  wrote:

> Hi all,
> 
> Quick new version since the previous one was a bit too broken:
> - dropped the bug-on patch to avoid breaking amdgpu's gpu reset failure
>   games
> - another attempt at splitting job_init/arm, hopefully we're getting
>   there.
> 
> Note that Christian has brought up a bikeshed on the new functions to add
> dependencies to drm_sched_jobs. I'm happy to repaint, if there's some kind
> of consensus on what it should be.
> 
> Testing and review very much welcome, as usual.
> 
> Cheers, Daniel
> 
> Daniel Vetter (18):
>   drm/sched: Split drm_sched_job_init
>   drm/sched: Barriers are needed for entity->last_scheduled
>   drm/sched: Add dependency tracking
>   drm/sched: drop entity parameter from drm_sched_push_job
>   drm/sched: improve docs around drm_sched_entity

Patches 1, 3, 4 and 5 are

Reviewed-by: Boris Brezillon 

>   drm/panfrost: use scheduler dependency tracking
>   drm/lima: use scheduler dependency tracking
>   drm/v3d: Move drm_sched_job_init to v3d_job_init
>   drm/v3d: Use scheduler dependency handling
>   drm/etnaviv: Use scheduler dependency handling
>   drm/gem: Delete gem array fencing helpers
>   drm/sched: Don't store self-dependencies
>   drm/sched: Check locking in drm_sched_job_await_implicit
>   drm/msm: Don't break exclusive fence ordering
>   drm/etnaviv: Don't break exclusive fence ordering
>   drm/i915: delete exclude argument from i915_sw_fence_await_reservation
>   drm/i915: Don't break exclusive fence ordering
>   dma-resv: Give the docs a do-over
> 
>  Documentation/gpu/drm-mm.rst  |   3 +
>  drivers/dma-buf/dma-resv.c|  24 ++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c|   4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c   |   4 +-
>  drivers/gpu/drm/drm_gem.c |  96 -
>  drivers/gpu/drm/etnaviv/etnaviv_gem.h |   5 +-
>  drivers/gpu/drm/etnaviv/etnaviv_gem_submit.c  |  64 +++---
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c   |  65 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.h   |   3 +-
>  drivers/gpu/drm/i915/display/intel_display.c  |   4 +-
>  drivers/gpu/drm/i915/gem/i915_gem_clflush.c   |   2 +-
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c|   8 +-
>  drivers/gpu/drm/i915/i915_sw_fence.c  |   6 +-
>  drivers/gpu/drm/i915/i915_sw_fence.h  |   1 -
>  drivers/gpu/drm/lima/lima_gem.c   |   7 +-
>  drivers/gpu/drm/lima/lima_sched.c |  28 +--
>  drivers/gpu/drm/lima/lima_sched.h |   6 +-
>  drivers/gpu/drm/msm/msm_gem_submit.c  |   3 +-
>  drivers/gpu/drm/panfrost/panfrost_drv.c   |  16 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c   |  39 +---
>  drivers/gpu/drm/panfrost/panfrost_job.h   |   5 +-
>  drivers/gpu/drm/scheduler/sched_entity.c  | 140 +++--
>  drivers/gpu/drm/scheduler/sched_fence.c   |  19 +-
>  drivers/gpu/drm/scheduler/sched_main.c| 181 +++--
>  drivers/gpu/drm/v3d/v3d_drv.h |   6 +-
>  drivers/gpu/drm/v3d/v3d_gem.c | 115 +--
>  drivers/gpu/drm/v3d/v3d_sched.c   |  44 +
>  include/drm/drm_gem.h |   5 -
>  include/drm/gpu_scheduler.h   | 186 ++
>  include/linux/dma-buf.h   |   7 +
>  include/linux/dma-resv.h  | 104 +-
>  31 files changed, 672 insertions(+), 528 deletions(-)
> 



[PATCH] drm/panfrost: Add PANFROST_BO_NO{READ,WRITE} flags

2021-09-30 Thread Boris Brezillon
So we can create GPU mappings without R/W permissions. Particularly
useful to debug corruptions caused by out-of-bound writes.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 14 --
 drivers/gpu/drm/panfrost/panfrost_gem.c |  2 ++
 drivers/gpu/drm/panfrost/panfrost_gem.h |  2 ++
 drivers/gpu/drm/panfrost/panfrost_mmu.c |  8 +++-
 include/uapi/drm/panfrost_drm.h |  2 ++
 5 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 82ad9a67f251..40e4a4db3ab1 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -75,6 +75,10 @@ static int panfrost_ioctl_get_param(struct drm_device *ddev, 
void *data, struct
return 0;
 }
 
+#define PANFROST_BO_FLAGS \
+   (PANFROST_BO_NOEXEC | PANFROST_BO_HEAP | \
+PANFROST_BO_NOREAD | PANFROST_BO_NOWRITE)
+
 static int panfrost_ioctl_create_bo(struct drm_device *dev, void *data,
struct drm_file *file)
 {
@@ -84,7 +88,7 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, 
void *data,
struct panfrost_gem_mapping *mapping;
 
if (!args->size || args->pad ||
-   (args->flags & ~(PANFROST_BO_NOEXEC | PANFROST_BO_HEAP)))
+   (args->flags & ~PANFROST_BO_FLAGS))
return -EINVAL;
 
/* Heaps should never be executable */
@@ -92,6 +96,11 @@ static int panfrost_ioctl_create_bo(struct drm_device *dev, 
void *data,
!(args->flags & PANFROST_BO_NOEXEC))
return -EINVAL;
 
+   /* Executable implies readable */
+   if ((args->flags & PANFROST_BO_NOREAD) &&
+   !(args->flags & PANFROST_BO_NOEXEC))
+   return -EINVAL;
+
bo = panfrost_gem_create_with_handle(file, dev, args->size, args->flags,
 &args->handle);
if (IS_ERR(bo))
@@ -520,6 +529,7 @@ DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
  * - 1.0 - initial interface
  * - 1.1 - adds HEAP and NOEXEC flags for CREATE_BO
  * - 1.2 - adds AFBC_FEATURES query
+ * - 1.3 - adds PANFROST_BO_NO{READ,WRITE} flags
  */
 static const struct drm_driver panfrost_drm_driver = {
.driver_features= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ,
@@ -532,7 +542,7 @@ static const struct drm_driver panfrost_drm_driver = {
.desc   = "panfrost DRM",
.date   = "20180908",
.major  = 1,
-   .minor  = 2,
+   .minor  = 3,
 
.gem_create_object  = panfrost_gem_create_object,
.prime_handle_to_fd = drm_gem_prime_handle_to_fd,
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.c 
b/drivers/gpu/drm/panfrost/panfrost_gem.c
index 23377481f4e3..d6c1bb1445f2 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.c
@@ -251,6 +251,8 @@ panfrost_gem_create_with_handle(struct drm_file *file_priv,
 
bo = to_panfrost_bo(&shmem->base);
bo->noexec = !!(flags & PANFROST_BO_NOEXEC);
+   bo->noread = !!(flags & PANFROST_BO_NOREAD);
+   bo->nowrite = !!(flags & PANFROST_BO_NOWRITE);
bo->is_heap = !!(flags & PANFROST_BO_HEAP);
 
/*
diff --git a/drivers/gpu/drm/panfrost/panfrost_gem.h 
b/drivers/gpu/drm/panfrost/panfrost_gem.h
index 8088d5fd8480..6246b5fef446 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gem.h
+++ b/drivers/gpu/drm/panfrost/panfrost_gem.h
@@ -37,6 +37,8 @@ struct panfrost_gem_object {
atomic_t gpu_usecount;
 
bool noexec :1;
+   bool noread :1;
+   bool nowrite:1;
bool is_heap:1;
 };
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index f51d3f791a17..6a5c9d94d6f2 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -307,7 +307,7 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping)
struct drm_gem_object *obj = &bo->base.base;
struct panfrost_device *pfdev = to_panfrost_device(obj->dev);
struct sg_table *sgt;
-   int prot = IOMMU_READ | IOMMU_WRITE;
+   int prot = 0;
 
if (WARN_ON(mapping->active))
return 0;
@@ -315,6 +315,12 @@ int panfrost_mmu_map(struct panfrost_gem_mapping *mapping)
if (bo->noexec)
prot |= IOMMU_NOEXEC;
 
+   if (!bo->nowrite)
+   prot |= IOMMU_WRITE;
+
+   if (!bo->noread)
+   prot |= IOMMU_READ;
+
sgt = drm_gem_shmem_get_pages_sgt(obj);
if (WARN_ON(IS_ERR(sgt)))
return PTR_ERR(sgt);
diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
i

[PATCH v5 1/8] drm/panfrost: Pass a job to panfrost_{acquire, attach}_object_fences()

2021-09-30 Thread Boris Brezillon
So we don't have to change the prototype if we extend the function.

v3:
* Fix subject

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 908d79520853..ed8d1588b1de 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -240,15 +240,13 @@ static void panfrost_job_hw_submit(struct panfrost_job 
*job, int js)
spin_unlock(&pfdev->js->job_lock);
 }
 
-static int panfrost_acquire_object_fences(struct drm_gem_object **bos,
- int bo_count,
- struct drm_sched_job *job)
+static int panfrost_acquire_object_fences(struct panfrost_job *job)
 {
int i, ret;
 
-   for (i = 0; i < bo_count; i++) {
+   for (i = 0; i < job->bo_count; i++) {
/* panfrost always uses write mode in its current uapi */
-   ret = drm_sched_job_add_implicit_dependencies(job, bos[i],
+   ret = drm_sched_job_add_implicit_dependencies(&job->base, 
job->bos[i],
  true);
if (ret)
return ret;
@@ -257,14 +255,12 @@ static int panfrost_acquire_object_fences(struct 
drm_gem_object **bos,
return 0;
 }
 
-static void panfrost_attach_object_fences(struct drm_gem_object **bos,
- int bo_count,
- struct dma_fence *fence)
+static void panfrost_attach_object_fences(struct panfrost_job *job)
 {
int i;
 
-   for (i = 0; i < bo_count; i++)
-   dma_resv_add_excl_fence(bos[i]->resv, fence);
+   for (i = 0; i < job->bo_count; i++)
+   dma_resv_add_excl_fence(job->bos[i]->resv, 
job->render_done_fence);
 }
 
 int panfrost_job_push(struct panfrost_job *job)
@@ -283,8 +279,7 @@ int panfrost_job_push(struct panfrost_job *job)
 
job->render_done_fence = dma_fence_get(&job->base.s_fence->finished);
 
-   ret = panfrost_acquire_object_fences(job->bos, job->bo_count,
-&job->base);
+   ret = panfrost_acquire_object_fences(job);
if (ret) {
mutex_unlock(&pfdev->sched_lock);
goto unlock;
@@ -296,8 +291,7 @@ int panfrost_job_push(struct panfrost_job *job)
 
mutex_unlock(&pfdev->sched_lock);
 
-   panfrost_attach_object_fences(job->bos, job->bo_count,
- job->render_done_fence);
+   panfrost_attach_object_fences(job);
 
 unlock:
drm_gem_unlock_reservations(job->bos, job->bo_count, &acquire_ctx);
-- 
2.31.1



[PATCH v5 0/8] drm/panfrost: drm/panfrost: Add a new submit ioctl

2021-09-30 Thread Boris Brezillon
Hello,

I finally got to resubmitting a new version of this series. I think
I fixed all the issues reported by Steve and Daniel. Still no support
for {IN,OUT}_FENCE_FD, but that can be added later if we need it.

For those who didn't follow the previous iterations, this is an
attempt at providing a new submit ioctl that's more Vulkan-friendly
than the existing one. This ioctl

1/ allows passing several out syncobjs so we can easily update
   several fence/semaphore in a single ioctl() call
2/ allows passing several jobs so we don't have to have one ioctl
   per job-chain recorded in the command buffer
3/ supports disabling implicit dependencies as well as 
   non-exclusive access to BOs, thus removing unnecessary
   synchronization

I've also been looking at adding {IN,OUT}_FENCE_FD support (allowing
one to pass at most one sync_file object in input and/or creating a
sync_file FD embedding the render out fence), but it's not entirely
clear to me when that's useful. Indeed, we can already do the
sync_file <-> syncobj conversion using the
SYNCOBJ_{FD_TO_HANDLE,HANDLE_TO_FD} ioctls if we have to.
Note that, unlike Turnip, PanVk is using syncobjs to implement
vkQueueWaitIdle(), so the syncobj -> sync_file conversion doesn't
have to happen for each submission, but maybe there's a good reason
to use sync_files for that too. Any feedback on that aspect would
be useful I guess.

Any feedback on this new ioctl is welcome, in particular, do you
think other things are missing/would be nice to have for Vulkan?

Regards,

Boris

P.S.: basic igt tests for these new ioctls re available there [1]

[1]https://gitlab.freedesktop.org/bbrezillon/igt-gpu-tools/-/tree/panfrost-batch-submit

Boris Brezillon (8):
  drm/panfrost: Pass a job to panfrost_{acquire,attach}_object_fences()
  drm/panfrost: Move the mappings collection out of
panfrost_lookup_bos()
  drm/panfrost: Add BO access flags to relax dependencies between jobs
  drm/panfrost: Add the ability to create submit queues
  drm/panfrost: Add a new ioctl to submit batches
  drm/panfrost: Support synchronization jobs
  drm/panfrost: Advertise the SYNCOBJ_TIMELINE feature
  drm/panfrost: Bump minor version to reflect the feature additions

 drivers/gpu/drm/panfrost/Makefile |   3 +-
 drivers/gpu/drm/panfrost/panfrost_device.h|   2 +-
 drivers/gpu/drm/panfrost/panfrost_drv.c   | 637 +-
 drivers/gpu/drm/panfrost/panfrost_job.c   |  93 ++-
 drivers/gpu/drm/panfrost/panfrost_job.h   |   8 +-
 .../gpu/drm/panfrost/panfrost_submitqueue.c   | 132 
 .../gpu/drm/panfrost/panfrost_submitqueue.h   |  26 +
 include/uapi/drm/panfrost_drm.h   | 119 
 8 files changed, 796 insertions(+), 224 deletions(-)
 create mode 100644 drivers/gpu/drm/panfrost/panfrost_submitqueue.c
 create mode 100644 drivers/gpu/drm/panfrost/panfrost_submitqueue.h

-- 
2.31.1



[PATCH v5 8/8] drm/panfrost: Bump minor version to reflect the feature additions

2021-09-30 Thread Boris Brezillon
We now have a new ioctl that allows submitting multiple jobs at once
(among other things) and we support timelined syncobjs. Bump the
minor version number to reflect those changes.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 9f983b763372..21871810df77 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -815,6 +815,9 @@ DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
  * - 1.1 - adds HEAP and NOEXEC flags for CREATE_BO
  * - 1.2 - adds AFBC_FEATURES query
  * - 1.3 - adds PANFROST_BO_NO{READ,WRITE} flags
+ * - 1.4 - adds the BATCH_SUBMIT, CREATE_SUBMITQUEUE, DESTROY_SUBMITQUEUE
+ *ioctls, adds support for DEP_ONLY jobs and advertises the
+ *SYNCOBJ_TIMELINE feature
  */
 static const struct drm_driver panfrost_drm_driver = {
.driver_features= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
-- 
2.31.1



[PATCH v5 4/8] drm/panfrost: Add the ability to create submit queues

2021-09-30 Thread Boris Brezillon
Needed to keep VkQueues isolated from each other.

v4:
* Make panfrost_ioctl_create_submitqueue() return the queue ID
  instead of a queue object

v3:
* Limit the number of submitqueue per context to 16
* Fix a deadlock

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/Makefile |   3 +-
 drivers/gpu/drm/panfrost/panfrost_device.h|   2 +-
 drivers/gpu/drm/panfrost/panfrost_drv.c   |  71 --
 drivers/gpu/drm/panfrost/panfrost_job.c   |  44 ++
 drivers/gpu/drm/panfrost/panfrost_job.h   |   7 +-
 .../gpu/drm/panfrost/panfrost_submitqueue.c   | 132 ++
 .../gpu/drm/panfrost/panfrost_submitqueue.h   |  26 
 include/uapi/drm/panfrost_drm.h   |  17 +++
 8 files changed, 260 insertions(+), 42 deletions(-)
 create mode 100644 drivers/gpu/drm/panfrost/panfrost_submitqueue.c
 create mode 100644 drivers/gpu/drm/panfrost/panfrost_submitqueue.h

diff --git a/drivers/gpu/drm/panfrost/Makefile 
b/drivers/gpu/drm/panfrost/Makefile
index b71935862417..e99192b66ec9 100644
--- a/drivers/gpu/drm/panfrost/Makefile
+++ b/drivers/gpu/drm/panfrost/Makefile
@@ -9,6 +9,7 @@ panfrost-y := \
panfrost_gpu.o \
panfrost_job.o \
panfrost_mmu.o \
-   panfrost_perfcnt.o
+   panfrost_perfcnt.o \
+   panfrost_submitqueue.o
 
 obj-$(CONFIG_DRM_PANFROST) += panfrost.o
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h 
b/drivers/gpu/drm/panfrost/panfrost_device.h
index 8b25278f34c8..51c0ba4e50f5 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -137,7 +137,7 @@ struct panfrost_mmu {
 struct panfrost_file_priv {
struct panfrost_device *pfdev;
 
-   struct drm_sched_entity sched_entity[NUM_JOB_SLOTS];
+   struct idr queues;
 
struct panfrost_mmu *mmu;
 };
diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index a386c66f349c..f8f430f68090 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -19,6 +19,7 @@
 #include "panfrost_job.h"
 #include "panfrost_gpu.h"
 #include "panfrost_perfcnt.h"
+#include "panfrost_submitqueue.h"
 
 static bool unstable_ioctls;
 module_param_unsafe(unstable_ioctls, bool, 0600);
@@ -259,6 +260,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, 
void *data,
struct panfrost_device *pfdev = dev->dev_private;
struct drm_panfrost_submit *args = data;
struct drm_syncobj *sync_out = NULL;
+   struct panfrost_submitqueue *queue;
struct panfrost_job *job;
int ret = 0, slot;
 
@@ -268,10 +270,16 @@ static int panfrost_ioctl_submit(struct drm_device *dev, 
void *data,
if (args->requirements && args->requirements != PANFROST_JD_REQ_FS)
return -EINVAL;
 
+   queue = panfrost_submitqueue_get(file->driver_priv, 0);
+   if (IS_ERR(queue))
+   return PTR_ERR(queue);
+
if (args->out_sync > 0) {
sync_out = drm_syncobj_find(file, args->out_sync);
-   if (!sync_out)
-   return -ENODEV;
+   if (!sync_out) {
+   ret = -ENODEV;
+   goto fail_put_queue;
+   }
}
 
job = kzalloc(sizeof(*job), GFP_KERNEL);
@@ -291,7 +299,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, 
void *data,
slot = panfrost_job_get_slot(job);
 
ret = drm_sched_job_init(&job->base,
-&job->file_priv->sched_entity[slot],
+&queue->sched_entity[slot],
 NULL);
if (ret)
goto out_put_job;
@@ -304,7 +312,7 @@ static int panfrost_ioctl_submit(struct drm_device *dev, 
void *data,
if (ret)
goto out_cleanup_job;
 
-   ret = panfrost_job_push(job);
+   ret = panfrost_job_push(queue, job);
if (ret)
goto out_cleanup_job;
 
@@ -320,6 +328,8 @@ static int panfrost_ioctl_submit(struct drm_device *dev, 
void *data,
 out_put_syncout:
if (sync_out)
drm_syncobj_put(sync_out);
+fail_put_queue:
+   panfrost_submitqueue_put(queue);
 
return ret;
 }
@@ -469,6 +479,36 @@ static int panfrost_ioctl_madvise(struct drm_device *dev, 
void *data,
return ret;
 }
 
+static int
+panfrost_ioctl_create_submitqueue(struct drm_device *dev, void *data,
+ struct drm_file *file_priv)
+{
+   struct panfrost_file_priv *priv = file_priv->driver_priv;
+   struct drm_panfrost_create_submitqueue *args = data;
+   int ret;
+
+   ret = panfrost_submitqueue_create(priv, args->priority, args->flags);
+   if (ret < 0)
+   return ret;
+
+   args->id = re

[PATCH v5 6/8] drm/panfrost: Support synchronization jobs

2021-09-30 Thread Boris Brezillon
Sometimes, all the user wants to do is add a synchronization point.
Userspace can already do that by submitting a NULL job, but this implies
submitting something to the GPU when we could simply skip the job and
signal the done fence directly.

v5:
* New patch

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 9 +++--
 drivers/gpu/drm/panfrost/panfrost_job.c | 6 ++
 include/uapi/drm/panfrost_drm.h | 7 +++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 30dc158d56e6..89a0c484310c 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -542,7 +542,9 @@ static const struct panfrost_submit_ioctl_version_info 
submit_versions[] = {
[1] = { 48, 8, 16 },
 };
 
-#define PANFROST_JD_ALLOWED_REQS PANFROST_JD_REQ_FS
+#define PANFROST_JD_ALLOWED_REQS \
+   (PANFROST_JD_REQ_FS | \
+PANFROST_JD_REQ_DEP_ONLY)
 
 static int
 panfrost_submit_job(struct drm_device *dev, struct drm_file *file_priv,
@@ -559,7 +561,10 @@ panfrost_submit_job(struct drm_device *dev, struct 
drm_file *file_priv,
if (args->requirements & ~PANFROST_JD_ALLOWED_REQS)
return -EINVAL;
 
-   if (!args->head)
+   /* If this is a dependency-only job, the job chain head should be NULL,
+* otherwise it should be non-NULL.
+*/
+   if ((args->head != 0) != !(args->requirements & 
PANFROST_JD_REQ_DEP_ONLY))
return -EINVAL;
 
bo_stride = submit_versions[version].bo_ref_stride;
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index 0367cee8f6df..6d8706d4a096 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -192,6 +192,12 @@ static void panfrost_job_hw_submit(struct panfrost_job 
*job, int js)
u64 jc_head = job->jc;
int ret;
 
+   if (job->requirements & PANFROST_JD_REQ_DEP_ONLY) {
+   /* Nothing to execute, signal the fence directly. */
+   dma_fence_signal_locked(job->done_fence);
+   return;
+   }
+
panfrost_devfreq_record_busy(&pfdev->pfdevfreq);
 
ret = pm_runtime_get_sync(pfdev->dev);
diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
index 5e3f8a344f41..b9df066970f6 100644
--- a/include/uapi/drm/panfrost_drm.h
+++ b/include/uapi/drm/panfrost_drm.h
@@ -46,6 +46,13 @@ extern "C" {
 #define DRM_IOCTL_PANFROST_PERFCNT_DUMP
DRM_IOW(DRM_COMMAND_BASE + DRM_PANFROST_PERFCNT_DUMP, struct 
drm_panfrost_perfcnt_dump)
 
 #define PANFROST_JD_REQ_FS (1 << 0)
+
+/*
+ * Dependency only job. The job chain head should be set to 0 when this flag
+ * is set.
+ */
+#define PANFROST_JD_REQ_DEP_ONLY (1 << 1)
+
 /**
  * struct drm_panfrost_submit - ioctl argument for submitting commands to the 
3D
  * engine.
-- 
2.31.1



[PATCH v5 7/8] drm/panfrost: Advertise the SYNCOBJ_TIMELINE feature

2021-09-30 Thread Boris Brezillon
Now that we have a new SUBMIT ioctl dealing with timelined syncojbs we
can advertise the feature.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 89a0c484310c..9f983b763372 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -817,7 +817,8 @@ DEFINE_DRM_GEM_FOPS(panfrost_drm_driver_fops);
  * - 1.3 - adds PANFROST_BO_NO{READ,WRITE} flags
  */
 static const struct drm_driver panfrost_drm_driver = {
-   .driver_features= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ,
+   .driver_features= DRIVER_RENDER | DRIVER_GEM | DRIVER_SYNCOBJ |
+ DRIVER_SYNCOBJ_TIMELINE,
.open   = panfrost_open,
.postclose  = panfrost_postclose,
.ioctls = panfrost_drm_driver_ioctls,
-- 
2.31.1



[PATCH v5 2/8] drm/panfrost: Move the mappings collection out of panfrost_lookup_bos()

2021-09-30 Thread Boris Brezillon
So we can re-use it from elsewhere.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 53 ++---
 1 file changed, 29 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index 40e4a4db3ab1..b131da3c9399 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -118,6 +118,34 @@ static int panfrost_ioctl_create_bo(struct drm_device 
*dev, void *data,
return 0;
 }
 
+static int
+panfrost_get_job_mappings(struct drm_file *file_priv, struct panfrost_job *job)
+{
+   struct panfrost_file_priv *priv = file_priv->driver_priv;
+   unsigned int i;
+
+   job->mappings = kvmalloc_array(job->bo_count,
+  sizeof(*job->mappings),
+  GFP_KERNEL | __GFP_ZERO);
+   if (!job->mappings)
+   return -ENOMEM;
+
+   for (i = 0; i < job->bo_count; i++) {
+   struct panfrost_gem_mapping *mapping;
+   struct panfrost_gem_object *bo;
+
+   bo = to_panfrost_bo(job->bos[i]);
+   mapping = panfrost_gem_mapping_get(bo, priv);
+   if (!mapping)
+   return -EINVAL;
+
+   atomic_inc(&bo->gpu_usecount);
+   job->mappings[i] = mapping;
+   }
+
+   return 0;
+}
+
 /**
  * panfrost_lookup_bos() - Sets up job->bo[] with the GEM objects
  * referenced by the job.
@@ -137,9 +165,6 @@ panfrost_lookup_bos(struct drm_device *dev,
  struct drm_panfrost_submit *args,
  struct panfrost_job *job)
 {
-   struct panfrost_file_priv *priv = file_priv->driver_priv;
-   struct panfrost_gem_object *bo;
-   unsigned int i;
int ret;
 
job->bo_count = args->bo_handle_count;
@@ -153,27 +178,7 @@ panfrost_lookup_bos(struct drm_device *dev,
if (ret)
return ret;
 
-   job->mappings = kvmalloc_array(job->bo_count,
-  sizeof(struct panfrost_gem_mapping *),
-  GFP_KERNEL | __GFP_ZERO);
-   if (!job->mappings)
-   return -ENOMEM;
-
-   for (i = 0; i < job->bo_count; i++) {
-   struct panfrost_gem_mapping *mapping;
-
-   bo = to_panfrost_bo(job->bos[i]);
-   mapping = panfrost_gem_mapping_get(bo, priv);
-   if (!mapping) {
-   ret = -EINVAL;
-   break;
-   }
-
-   atomic_inc(&bo->gpu_usecount);
-   job->mappings[i] = mapping;
-   }
-
-   return ret;
+   return panfrost_get_job_mappings(file_priv, job);
 }
 
 /**
-- 
2.31.1



[PATCH v5 5/8] drm/panfrost: Add a new ioctl to submit batches

2021-09-30 Thread Boris Brezillon
This should help limit the number of ioctls when submitting multiple
jobs. The new ioctl also supports syncobj timelines and BO access flags.

v5:
* Fix typos
* Add BUILD_BUG_ON() checks to make sure SUBMIT_BATCH_VERSION and
  descriptor sizes are synced
* Simplify error handling in panfrost_ioctl_batch_submit()
* Don't disable implicit fences on exclusive references

v4:
* Implement panfrost_ioctl_submit() as a wrapper around
  panfrost_submit_job()
* Replace stride fields by a version field which is mapped to
  a  tuple internally

v3:
* Re-use panfrost_get_job_bos() and panfrost_get_job_in_syncs() in the
  old submit path

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 584 
 drivers/gpu/drm/panfrost/panfrost_job.c |   4 +-
 include/uapi/drm/panfrost_drm.h |  92 
 3 files changed, 492 insertions(+), 188 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index f8f430f68090..30dc158d56e6 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -147,193 +147,6 @@ panfrost_get_job_mappings(struct drm_file *file_priv, 
struct panfrost_job *job)
return 0;
 }
 
-/**
- * panfrost_lookup_bos() - Sets up job->bo[] with the GEM objects
- * referenced by the job.
- * @dev: DRM device
- * @file_priv: DRM file for this fd
- * @args: IOCTL args
- * @job: job being set up
- *
- * Resolve handles from userspace to BOs and attach them to job.
- *
- * Note that this function doesn't need to unreference the BOs on
- * failure, because that will happen at panfrost_job_cleanup() time.
- */
-static int
-panfrost_lookup_bos(struct drm_device *dev,
- struct drm_file *file_priv,
- struct drm_panfrost_submit *args,
- struct panfrost_job *job)
-{
-   unsigned int i;
-   int ret;
-
-   job->bo_count = args->bo_handle_count;
-
-   if (!job->bo_count)
-   return 0;
-
-   job->bo_flags = kvmalloc_array(job->bo_count,
-  sizeof(*job->bo_flags),
-  GFP_KERNEL | __GFP_ZERO);
-   if (!job->bo_flags)
-   return -ENOMEM;
-
-   for (i = 0; i < job->bo_count; i++)
-   job->bo_flags[i] = PANFROST_BO_REF_EXCLUSIVE;
-
-   ret = drm_gem_objects_lookup(file_priv,
-(void __user *)(uintptr_t)args->bo_handles,
-job->bo_count, &job->bos);
-   if (ret)
-   return ret;
-
-   return panfrost_get_job_mappings(file_priv, job);
-}
-
-/**
- * panfrost_copy_in_sync() - Sets up job->deps with the sync objects
- * referenced by the job.
- * @dev: DRM device
- * @file_priv: DRM file for this fd
- * @args: IOCTL args
- * @job: job being set up
- *
- * Resolve syncobjs from userspace to fences and attach them to job.
- *
- * Note that this function doesn't need to unreference the fences on
- * failure, because that will happen at panfrost_job_cleanup() time.
- */
-static int
-panfrost_copy_in_sync(struct drm_device *dev,
- struct drm_file *file_priv,
- struct drm_panfrost_submit *args,
- struct panfrost_job *job)
-{
-   u32 *handles;
-   int ret = 0;
-   int i, in_fence_count;
-
-   in_fence_count = args->in_sync_count;
-
-   if (!in_fence_count)
-   return 0;
-
-   handles = kvmalloc_array(in_fence_count, sizeof(u32), GFP_KERNEL);
-   if (!handles) {
-   ret = -ENOMEM;
-   DRM_DEBUG("Failed to allocate incoming syncobj handles\n");
-   goto fail;
-   }
-
-   if (copy_from_user(handles,
-  (void __user *)(uintptr_t)args->in_syncs,
-  in_fence_count * sizeof(u32))) {
-   ret = -EFAULT;
-   DRM_DEBUG("Failed to copy in syncobj handles\n");
-   goto fail;
-   }
-
-   for (i = 0; i < in_fence_count; i++) {
-   struct dma_fence *fence;
-
-   ret = drm_syncobj_find_fence(file_priv, handles[i], 0, 0,
-&fence);
-   if (ret)
-   goto fail;
-
-   ret = drm_sched_job_add_dependency(&job->base, fence);
-
-   if (ret)
-   goto fail;
-   }
-
-fail:
-   kvfree(handles);
-   return ret;
-}
-
-static int panfrost_ioctl_submit(struct drm_device *dev, void *data,
-   struct drm_file *file)
-{
-   struct panfrost_device *pfdev = dev->dev_private;
-   struct drm_panfrost_submit *args = data;
-   struct drm_syncobj *sync_out = NULL;
-   struct panfrost_submitqueue *queue;
-   struct panfrost_job *job;
-   int ret =

[PATCH v5 3/8] drm/panfrost: Add BO access flags to relax dependencies between jobs

2021-09-30 Thread Boris Brezillon
Jobs reading from the same BO should not be serialized. Add access
flags so we can relax the implicit dependencies in that case. We force
exclusive access for now to keep the behavior unchanged, but a new
SUBMIT ioctl taking explicit access flags will be introduced.

Signed-off-by: Boris Brezillon 
Reviewed-by: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_drv.c | 10 ++
 drivers/gpu/drm/panfrost/panfrost_job.c | 21 ++---
 drivers/gpu/drm/panfrost/panfrost_job.h |  1 +
 include/uapi/drm/panfrost_drm.h |  3 +++
 4 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_drv.c 
b/drivers/gpu/drm/panfrost/panfrost_drv.c
index b131da3c9399..a386c66f349c 100644
--- a/drivers/gpu/drm/panfrost/panfrost_drv.c
+++ b/drivers/gpu/drm/panfrost/panfrost_drv.c
@@ -165,6 +165,7 @@ panfrost_lookup_bos(struct drm_device *dev,
  struct drm_panfrost_submit *args,
  struct panfrost_job *job)
 {
+   unsigned int i;
int ret;
 
job->bo_count = args->bo_handle_count;
@@ -172,6 +173,15 @@ panfrost_lookup_bos(struct drm_device *dev,
if (!job->bo_count)
return 0;
 
+   job->bo_flags = kvmalloc_array(job->bo_count,
+  sizeof(*job->bo_flags),
+  GFP_KERNEL | __GFP_ZERO);
+   if (!job->bo_flags)
+   return -ENOMEM;
+
+   for (i = 0; i < job->bo_count; i++)
+   job->bo_flags[i] = PANFROST_BO_REF_EXCLUSIVE;
+
ret = drm_gem_objects_lookup(file_priv,
 (void __user *)(uintptr_t)args->bo_handles,
 job->bo_count, &job->bos);
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
b/drivers/gpu/drm/panfrost/panfrost_job.c
index ed8d1588b1de..1a9085d8dcf1 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -245,9 +245,17 @@ static int panfrost_acquire_object_fences(struct 
panfrost_job *job)
int i, ret;
 
for (i = 0; i < job->bo_count; i++) {
+   bool exclusive = job->bo_flags[i] & PANFROST_BO_REF_EXCLUSIVE;
+
+   if (!exclusive) {
+   ret = dma_resv_reserve_shared(job->bos[i]->resv, 1);
+   if (ret)
+   return ret;
+   }
+
/* panfrost always uses write mode in its current uapi */
ret = drm_sched_job_add_implicit_dependencies(&job->base, 
job->bos[i],
- true);
+ exclusive);
if (ret)
return ret;
}
@@ -259,8 +267,14 @@ static void panfrost_attach_object_fences(struct 
panfrost_job *job)
 {
int i;
 
-   for (i = 0; i < job->bo_count; i++)
-   dma_resv_add_excl_fence(job->bos[i]->resv, 
job->render_done_fence);
+   for (i = 0; i < job->bo_count; i++) {
+   struct dma_resv *robj = job->bos[i]->resv;
+
+   if (job->bo_flags[i] & PANFROST_BO_REF_EXCLUSIVE)
+   dma_resv_add_excl_fence(robj, job->render_done_fence);
+   else
+   dma_resv_add_shared_fence(robj, job->render_done_fence);
+   }
 }
 
 int panfrost_job_push(struct panfrost_job *job)
@@ -326,6 +340,7 @@ static void panfrost_job_cleanup(struct kref *ref)
kvfree(job->bos);
}
 
+   kvfree(job->bo_flags);
kfree(job);
 }
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.h 
b/drivers/gpu/drm/panfrost/panfrost_job.h
index 77e6d0e6f612..96d755f12cf7 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.h
+++ b/drivers/gpu/drm/panfrost/panfrost_job.h
@@ -28,6 +28,7 @@ struct panfrost_job {
 
struct panfrost_gem_mapping **mappings;
struct drm_gem_object **bos;
+   u32 *bo_flags;
u32 bo_count;
 
/* Fence to be signaled by drm-sched once its done with the job */
diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
index a2de81225125..c8fdf45b1573 100644
--- a/include/uapi/drm/panfrost_drm.h
+++ b/include/uapi/drm/panfrost_drm.h
@@ -226,6 +226,9 @@ struct drm_panfrost_madvise {
__u32 retained;   /* out, whether backing store still exists */
 };
 
+/* Exclusive (AKA write) access to the BO */
+#define PANFROST_BO_REF_EXCLUSIVE  0x1
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.31.1



Re: [PATCH] drm/panfrost: Add PANFROST_BO_NO{READ,WRITE} flags

2021-09-30 Thread Boris Brezillon
On Thu, 30 Sep 2021 15:13:29 -0400
Alyssa Rosenzweig  wrote:

> > +   /* Executable implies readable */
> > +   if ((args->flags & PANFROST_BO_NOREAD) &&
> > +   !(args->flags & PANFROST_BO_NOEXEC))
> > +   return -EINVAL;  
> 
> Generally, executable also implies not-writeable. Should we check that?

We were allowing it until now, so doing that would break the backward
compat, unfortunately. Steve also mentioned that the DDK might use
shaders modifying other shaders here [1], it clearly doesn't happen in
panfrost, but I think I'd prefer to keep the existing behavior by
default, just to be safe. I'll send a patch setting the RO flag on
all executable BOs in mesa/panfrost.

[1]https://oftc.irclog.whitequark.org/panfrost/2021-09-02


Re: [PATCH] drm/panfrost: Add PANFROST_BO_NO{READ,WRITE} flags

2021-09-30 Thread Boris Brezillon
On Thu, 30 Sep 2021 20:47:23 +0200
Boris Brezillon  wrote:

> So we can create GPU mappings without R/W permissions. Particularly
> useful to debug corruptions caused by out-of-bound writes.

Oops, I forgot to add the PANFROST_BO_PRIVATE flag suggested by Robin
here [1]. I'll send a v2.

[1]https://oftc.irclog.whitequark.org/panfrost/2021-09-02


Re: [PATCH] drm/panfrost: Add PANFROST_BO_NO{READ,WRITE} flags

2021-09-30 Thread Boris Brezillon
On Thu, 30 Sep 2021 18:12:11 -0400
Alyssa Rosenzweig  wrote:

> > > > +   /* Executable implies readable */
> > > > +   if ((args->flags & PANFROST_BO_NOREAD) &&
> > > > +   !(args->flags & PANFROST_BO_NOEXEC))
> > > > +   return -EINVAL;
> > > 
> > > Generally, executable also implies not-writeable. Should we check that?  
> > 
> > We were allowing it until now, so doing that would break the backward
> > compat, unfortunately.  
> 
> Not a problem if you only enforce this starting with the appropriate
> UABI version, but...

I still don't see how that solves the 
situation, since old-userspace doesn't know about the new UABI, and
there's no version field on the CREATE_BO ioctl() to let the kernel
know about the UABI used by this userspace program. I mean, we could
add one, or add a new PANFROST_BO_EXTENDED_FLAGS flag to enforce this
'noexec implies nowrite' behavior, but is it really simpler than
explicitly passing the NOWRITE flag when NOEXEC is passed?

> 
> > Steve also mentioned that the DDK might use shaders modifying other
> > shaders here [1]  
> 
> What? I believe it, but what?
> 
> For the case of pilot shaders, that shouldn't require self-modifying
> code. As I understand, the DDK binds the push uniform (FAU / RMU) buffer
> as global shader memory (SSBO) and uses regular STORE instructions on
> it. That requires writability on that BO but that should be fine.

Okay.



Re: [PATCH] drm/panfrost: Add PANFROST_BO_NO{READ,WRITE} flags

2021-10-01 Thread Boris Brezillon
Hi Robin,

On Thu, 30 Sep 2021 21:44:24 +0200
Boris Brezillon  wrote:

> On Thu, 30 Sep 2021 20:47:23 +0200
> Boris Brezillon  wrote:
> 
> > So we can create GPU mappings without R/W permissions. Particularly
> > useful to debug corruptions caused by out-of-bound writes.  
> 
> Oops, I forgot to add the PANFROST_BO_PRIVATE flag suggested by Robin
> here [1]. I'll send a v2.

When you're talking about a PANFROST_BO_GPU_PRIVATE flag (or
PANFROST_BO_NO_CPU_ACCESS), you mean something that can set
ARM_LPAE_PTE_SH_IS instead of the unconditional ARM_LPAE_PTE_SH_OS we
have right now [1], right? In this case, how would you pass this info
to the iommu? Looks like we have an IOMMU_CACHE, but I don't think
it reflects what we're trying to do. IOMMU_PRIV is about privileged
mappings, so definitely not what we want. Should we add a new
IOMMU_NO_{EXTERNAL,HOST,CPU}_ACCESS flag for that?

Regards,

Boris

[1]https://elixir.bootlin.com/linux/v5.15-rc3/source/drivers/iommu/io-pgtable-arm.c#L453


[PATCH v2 1/5] [RFC]iommu: Add a IOMMU_DEVONLY protection flag

2021-10-01 Thread Boris Brezillon
The IOMMU_DEVONLY flag allows the caller to flag a mappings backed by
device-private buffers. That means other devices or CPUs are not
expected to access the physical memory region pointed by the mapping,
and the MMU driver can safely restrict the shareability domain to the
device itself.

Will be used by the ARM MMU driver to flag Mali mappings accessed only
by the GPU as Inner-shareable.

Signed-off-by: Boris Brezillon 
---
 include/linux/iommu.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d2f3435e7d17..db14781b522f 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -31,6 +31,13 @@
  * if the IOMMU page table format is equivalent.
  */
 #define IOMMU_PRIV (1 << 5)
+/*
+ * Mapping is only accessed by the device behind the iommu. That means other
+ * devices or CPUs are not expected to access this physical memory region,
+ * and the MMU driver can safely restrict the shareability domain to the
+ * device itself.
+ */
+#define IOMMU_DEVONLY  (1 << 6)
 
 struct iommu_ops;
 struct iommu_group;
-- 
2.31.1



[PATCH v2 0/5] drm/panfrost: Add extra GPU-usage flags

2021-10-01 Thread Boris Brezillon
Hello,

This is a follow-up of [1], which was adding the read/write
restrictions on GPU buffers. Robin and Steven suggested that I add a
flag to restrict the shareability domain on GPU-private buffers, so
here it is.

As you can see, the first patch is flagges RFC, since I'm not sure
adding a new IOMMU_ flag is the right solution, but IOMMU_CACHE
doesn't feel like a good fit either. Please let me know if you have
better ideas.

Regards,

Boris

[1]https://patchwork.kernel.org/project/dri-devel/patch/20210930184723.1482426-1-boris.brezil...@collabora.com/

Boris Brezillon (5):
  [RFC]iommu: Add a IOMMU_DEVONLY protection flag
  [RFC]iommu/io-pgtable-arm: Take the DEVONLY flag into account on
ARM_MALI_LPAE
  drm/panfrost: Add PANFROST_BO_NO{READ,WRITE} flags
  drm/panfrost: Add a PANFROST_BO_GPUONLY flag
  drm/panfrost: Bump the driver version to 1.3

 drivers/gpu/drm/panfrost/panfrost_drv.c | 15 +--
 drivers/gpu/drm/panfrost/panfrost_gem.c |  3 +++
 drivers/gpu/drm/panfrost/panfrost_gem.h |  3 +++
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 11 ++-
 drivers/iommu/io-pgtable-arm.c  | 25 +
 include/linux/iommu.h   |  7 +++
 include/uapi/drm/panfrost_drm.h |  3 +++
 7 files changed, 56 insertions(+), 11 deletions(-)

-- 
2.31.1



<    2   3   4   5   6   7   8   9   10   11   >