Re: [PATCH 01/13] drm: execution context for GEM buffers v4

2023-06-17 Thread Boris Brezillon
+Matthew who's been using drm_exec in Xe if I'm correct.

Hello Christian,

On Wed, 14 Jun 2023 15:02:52 +0200
Boris Brezillon  wrote:

> On Wed, 14 Jun 2023 14:30:53 +0200
> Christian König  wrote:
> 
> > Am 14.06.23 um 14:23 schrieb Boris Brezillon:  
> > > Hi Christian,
> > >
> > > On Thu,  4 May 2023 13:51:47 +0200
> > > "Christian König"  wrote:
> > >
> > >> This adds the infrastructure for an execution context for GEM buffers
> > >> which is similar to the existing TTMs execbuf util and intended to 
> > >> replace
> > >> it in the long term.
> > >>
> > >> The basic functionality is that we abstracts the necessary loop to lock
> > >> many different GEM buffers with automated deadlock and duplicate 
> > >> handling.
> > > As many other drivers do already, we are considering using drm_exec()
> > > for our resv locking in the PowerVR driver, so we might have more
> > > questions/comments in the coming days/weeks, but I already have a
> > > couple right now (see below).
> > >
> > >> v3: drop duplicate tracking, radeon is really the only one needing that  
> > >>   
> > > I think we'd actually be interested in duplicate tracking. Is there any
> > > way we can make it an optional feature through some extra helpers/flags?
> > > Doesn't have to be done in this patch series, I'm just wondering if this
> > > is something we can share as well.
> > 
> > You can still capture the -EALREADY error and act appropriately in your 
> > driver.
> > 
> > For radeon it just means ignoring the error code and going ahead, but 
> > that behavior doesn't seem to be desired in most cases.
> > 
> > Initially I though we need to separately track how many and how often 
> > BOs are duplicated, but there is simply no use for this.
> >   
> > >
> > > [...]
> > >
> > >> +/**
> > >> + * DOC: Overview
> > >> + *
> > >> + * This component mainly abstracts the retry loop necessary for locking
> > >> + * multiple GEM objects while preparing hardware operations (e.g. 
> > >> command
> > >> + * submissions, page table updates etc..).
> > >> + *
> > >> + * If a contention is detected while locking a GEM object the cleanup 
> > >> procedure
> > >> + * unlocks all previously locked GEM objects and locks the contended 
> > >> one first
> > >> + * before locking any further objects.
> > >> + *
> > >> + * After an object is locked fences slots can optionally be reserved on 
> > >> the
> > >> + * dma_resv object inside the GEM object.
> > >> + *
> > >> + * A typical usage pattern should look like this::
> > >> + *
> > >> + *  struct drm_gem_object *obj;
> > >> + *  struct drm_exec exec;
> > >> + *  unsigned long index;
> > >> + *  int ret;
> > >> + *
> > >> + *  drm_exec_init(&exec, true);
> > >> + *  drm_exec_while_not_all_locked(&exec) {
> > >> + *  ret = drm_exec_prepare_obj(&exec, boA, 1);
> > >> + *  drm_exec_continue_on_contention(&exec);
> > >> + *  if (ret)
> > >> + *  goto error;
> > >> + *
> > > Have you considered defining a drm_exec_try_prepare_obj_or_retry()
> > > combining drm_exec_prepare_obj() and drm_exec_continue_on_contention()?
> > >
> > > #define drm_exec_try_prepare_obj_or_retry(exec, obj, num_fences) \
> > >  ({ \
> > >  int __ret = drm_exec_prepare_obj(exec, bo, num_fences); \
> > >  if (unlikely(drm_exec_is_contended(exec))) \
> > >  continue; \
> > >  __ret; \
> > >  })
> > >
> > > This way the following pattern
> > >
> > >   ret = drm_exec_prepare_obj(&exec, boA, 1);
> > >   drm_exec_continue_on_contention(&exec);
> > >   if (ret)
> > >   goto error;
> > >
> > > can be turned into something more conventional:
> > >
> > >   ret = drm_exec_try_prepare_obj_or_retry(&exec, boA, 1);
> > >   if (ret)
> > >   goto error;
> > 
> > Yeah, I was considering that as wel

[PATCH v6] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb()

2023-06-19 Thread Boris Brezillon
drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped
from the dependency array that was waited upon before
drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
so we're basically waiting for all dependencies except one.

In theory, this wait shouldn't be needed because resources should have
their users registered to the dma_resv object, thus guaranteeing that
future jobs wanting to access these resources wait on all the previous
users (depending on the access type, of course). But we want to keep
these explicit waits in the kill entity path just in case.

Let's make sure we keep all dependencies in the array in
drm_sched_job_dependency(), so we can iterate over the array and wait
in drm_sched_entity_kill_jobs_cb().

We also make sure we wait on drm_sched_fence::finished if we were
originally asked to wait on drm_sched_fence::scheduled. In that case,
we assume the intent was to delegate the wait to the firmware/GPU or
rely on the pipelining done at the entity/scheduler level, but when
killing jobs, we really want to wait for completion not just scheduling.

v6:
- Back to v4 implementation
- Add Christian's R-b

v5:
- Flag deps on which we should only wait for the scheduled event
  at insertion time

v4:
- Fix commit message
- Fix a use-after-free bug

v3:
- Always wait for drm_sched_fence::finished fences in
  drm_sched_entity_kill_jobs_cb() when we see a sched_fence

v2:
- Don't evict deps in drm_sched_job_dependency()

Signed-off-by: Boris Brezillon 
Suggested-by: "Christian König" 
Reviewed-by: "Christian König" 
Cc: Frank Binns 
Cc: Sarah Walker 
Cc: Donald Robson 
Cc: Luben Tuikov 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Sumit Semwal 
Cc: "Christian König" 
---
 drivers/gpu/drm/scheduler/sched_entity.c | 41 +++-
 1 file changed, 33 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index 68e807ae136a..ec41d82d0141 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -176,16 +176,32 @@ static void drm_sched_entity_kill_jobs_cb(struct 
dma_fence *f,
 {
struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
 finish_cb);
-   int r;
+   unsigned long index;
 
dma_fence_put(f);
 
/* Wait for all dependencies to avoid data corruptions */
-   while (!xa_empty(&job->dependencies)) {
-   f = xa_erase(&job->dependencies, job->last_dependency++);
-   r = dma_fence_add_callback(f, &job->finish_cb,
-  drm_sched_entity_kill_jobs_cb);
-   if (!r)
+   xa_for_each(&job->dependencies, index, f) {
+   struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
+
+   if (s_fence && f == &s_fence->scheduled) {
+   /* The dependencies array had a reference on the 
scheduled
+* fence, and the finished fence refcount might have
+* dropped to zero. Use dma_fence_get_rcu() so we get
+* a NULL fence in that case.
+*/
+   f = dma_fence_get_rcu(&s_fence->finished);
+
+   /* Now that we have a reference on the finished fence,
+* we can release the reference the dependencies array
+* had on the scheduled fence.
+*/
+   dma_fence_put(&s_fence->scheduled);
+   }
+
+   xa_erase(&job->dependencies, index);
+   if (f && !dma_fence_add_callback(f, &job->finish_cb,
+drm_sched_entity_kill_jobs_cb))
return;
 
dma_fence_put(f);
@@ -415,8 +431,17 @@ static struct dma_fence *
 drm_sched_job_dependency(struct drm_sched_job *job,
 struct drm_sched_entity *entity)
 {
-   if (!xa_empty(&job->dependencies))
-   return xa_erase(&job->dependencies, job->last_dependency++);
+   struct dma_fence *f;
+
+   /* We keep the fence around, so we can iterate over all dependencies
+* in drm_sched_entity_kill_jobs_cb() to ensure all deps are signaled
+* before killing the job.
+*/
+   f = xa_load(&job->dependencies, job->last_dependency);
+   if (f) {
+   job->last_dependency++;
+   return dma_fence_get(f);
+   }
 
if (job->sched->ops->prepare_job)
return job->sched->ops->prepare_job(job, entity);
-- 
2.40.1



Re: [PATCH 01/13] drm: execution context for GEM buffers v4

2023-06-19 Thread Boris Brezillon
Hello Thomas,

On Mon, 19 Jun 2023 10:59:16 +0200
Thomas Hellström (Intel)  wrote:

>    
> > +/**
> > + * DOC: Overview
> > + *
> > + * This component mainly abstracts the retry loop necessary for locking
> > + * multiple GEM objects while preparing hardware operations (e.g. 
> > command
> > + * submissions, page table updates etc..).
> > + *
> > + * If a contention is detected while locking a GEM object the cleanup 
> > procedure
> > + * unlocks all previously locked GEM objects and locks the contended 
> > one first
> > + * before locking any further objects.
> > + *
> > + * After an object is locked fences slots can optionally be reserved 
> > on the
> > + * dma_resv object inside the GEM object.
> > + *
> > + * A typical usage pattern should look like this::
> > + *
> > + * struct drm_gem_object *obj;
> > + * struct drm_exec exec;
> > + * unsigned long index;
> > + * int ret;
> > + *
> > + * drm_exec_init(&exec, true);
> > + * drm_exec_while_not_all_locked(&exec) {
> > + * ret = drm_exec_prepare_obj(&exec, boA, 1);
> > + * drm_exec_continue_on_contention(&exec);
> > + * if (ret)
> > + * goto error;
> > + *  
>  Have you considered defining a drm_exec_try_prepare_obj_or_retry()
>  combining drm_exec_prepare_obj() and drm_exec_continue_on_contention()?
> 
>  #define drm_exec_try_prepare_obj_or_retry(exec, obj, num_fences) \
>    ({ \
>    int __ret = drm_exec_prepare_obj(exec, bo, 
>  num_fences); \
>    if (unlikely(drm_exec_is_contended(exec))) \
>    continue; \
>    __ret; \
>    })
> 
>  This way the following pattern
> 
>   ret = drm_exec_prepare_obj(&exec, boA, 1);
>   drm_exec_continue_on_contention(&exec);
>   if (ret)
>   goto error;
> 
>  can be turned into something more conventional:
> 
>   ret = drm_exec_try_prepare_obj_or_retry(&exec, boA, 1);
>   if (ret)
>   goto error;  
> >>> Yeah, I was considering that as well. But then abandoned it as to
> >>> complicated.
> >>>
> >>> I really need to find some time to work on that anyway.  
> > I've been playing with drm_exec for a couple weeks now, and I wanted
> > to share something I hacked to try and make the API simpler and
> > more robust against misuse (see the below diff, which is a slightly
> > adjusted version of your work).  
> 
> It would be good if we could have someone taking charge of this series 
> and address all review comments, I see some of my comments getting lost, 
> we have multiple submitters and I can't find a dri-devel patchwork entry 
> for this.

My bad, I wasn't intending to submit a new version. I just added a
diff to show what I had in mind. This being said, it'd be great if we
could make some progress on this series, because we have quite a few
drivers depending on it now.

> 
> >
> > In this version, the user is no longer in control of the retry
> > loop. Instead, it provides an expression (a call to a
> > sub-function) to be re-evaluated each time a contention is
> > detected. IMHO, this makes the 'prepare-objs' functions easier to
> > apprehend, and avoids any mistake like calling
> > drm_exec_continue_on_contention() in an inner loop, or breaking
> > out of the drm_exec_while_all_locked() loop unintentionally.  
> 
> In i915 we've had a very similar helper to this, and while I agree this 
> newer version would probably help make code cleaner, but OTOH there also 
> are some places where the short drm_exec_while_all_locked() -likeblock 
> don't really motivate a separate function. Porting i915 to the current 
> version will take some work, For  the xe driver both versions would work 
> fine.

Note that the drm_exec_until_all_locked() helper I introduced is taking
an expression, so in theory, you don't have to define a separate
function.

drm_exec_until_all_locked(&exec, {
/* inlined-code */
int ret;

ret = blabla()
if (ret)
goto error;

...

error:
/* return value. */
ret;
});

This being said, as soon as you have several failure paths,
it makes things a lot easier/controllable if you make it a function,
and I honestly don't think the readability would suffer from having a
function defined just above the user. My main concern with the original
approach was the risk of calling continue/break_if_contended() in the
wrong place, and also the fact you can't really externalize things to
a function if you're looking for a cleaner split. At least with
drm_exec_until_all_locked() you can do both.

Regards,

Bo

Re: [PATCH 01/13] drm: execution context for GEM buffers v4

2023-06-19 Thread Boris Brezillon
On Mon, 19 Jun 2023 11:20:06 +0200
Christian König  wrote:

> Hi guys,
> 
> Am 19.06.23 um 10:59 schrieb Thomas Hellström (Intel):
> > [SNIP]  
> 
>  I really need to find some time to work on that anyway.  
> >> I've been playing with drm_exec for a couple weeks now, and I wanted
> >> to share something I hacked to try and make the API simpler and
> >> more robust against misuse (see the below diff, which is a slightly
> >> adjusted version of your work).  
> >
> > It would be good if we could have someone taking charge of this series 
> > and address all review comments, I see some of my comments getting 
> > lost, we have multiple submitters and I can't find a dri-devel 
> > patchwork entry for this. Anyway some comments below.  
> 
> I can try to find some time for the series this week (As long as nobody 
> comes along and has any burning roof).

That's great news!

> 
> >  
> >>
> >> In this version, the user is no longer in control of the retry
> >> loop. Instead, it provides an expression (a call to a
> >> sub-function) to be re-evaluated each time a contention is
> >> detected. IMHO, this makes the 'prepare-objs' functions easier to
> >> apprehend, and avoids any mistake like calling
> >> drm_exec_continue_on_contention() in an inner loop, or breaking
> >> out of the drm_exec_while_all_locked() loop unintentionally.  
> >
> > In i915 we've had a very similar helper to this, and while I agree 
> > this newer version would probably help make code cleaner, but OTOH 
> > there also are some places where the short drm_exec_while_all_locked() 
> > -likeblock don't really motivate a separate function. Porting i915 to 
> > the current version will take some work, For  the xe driver both 
> > versions would work fine.  
> 
> Yeah, this is actually what my first version of this looked like. But I 
> abandoned that approach because we have a lot of cases were we just 
> quickly want to lock a few GEM objects and don't want the extra overhead 
> of putting all the state into some bag to forward it to a function.

If you're talking about verbosity, it might be the case, though I guess
it mostly a matter of taste (I do like when things are well isolated).
As for runtime overhead, I'd expect the compiler to inline the function
anyway, so it's unlikely to change anything.

> >> +/* Track the locked object in the array */
> >> +static int drm_exec_obj_locked(struct drm_exec *exec,
> >> +   struct drm_gem_object *obj)
> >> +{
> >> +    if (unlikely(exec->num_objects == exec->max_objects)) {
> >> +    size_t size = exec->max_objects * sizeof(void *);
> >> +    void *tmp;
> >> +
> >> +    tmp = kvrealloc(exec->objects, size, size + PAGE_SIZE,
> >> +    GFP_KERNEL);
> >> +    if (!tmp)
> >> +    return -ENOMEM;  
> >
> > Sometimes you need to just temporarily lock an object and then unlock 
> > it again if it goes out of scope before reaching the end of 
> > _until_all_locked(). In that case you might need to remove a lock from 
> > the array. I *think* for all use-cases in i915 it would suffice to 
> > take a snapshot of num_objects, and unlock everything above that, 
> > having exec->objects behave like a stack, but was ever a list 
> > considered instead of a realloced array?  
> 
> Yes, the problem is that linked lists really suck regarding their cache 
> line locality. That's why I've came up with this approach here.

Hm, maybe I'm missing something, but if you place the list_head obj you
use to stack the locked objects close enough to the resv pointer, and
aligned on cache line, it shouldn't really be a problem, given you have
to dereference the GEM object to retrieve its resv anyway.


Re: [PATCH 01/13] drm: execution context for GEM buffers v4

2023-06-19 Thread Boris Brezillon
On Mon, 19 Jun 2023 12:44:06 +0200
Christian König  wrote:

> Am 19.06.23 um 12:12 schrieb Boris Brezillon:
> > [SNIP]
> > Note that the drm_exec_until_all_locked() helper I introduced is taking
> > an expression, so in theory, you don't have to define a separate
> > function.
> >
> > drm_exec_until_all_locked(&exec, {
> > /* inlined-code */
> > int ret;
> >
> > ret = blabla()
> > if (ret)
> > goto error;
> >
> > ...
> >
> > error:
> > /* return value. */
> > ret;
> > });
> >
> > This being said, as soon as you have several failure paths,
> > it makes things a lot easier/controllable if you make it a function,
> > and I honestly don't think the readability would suffer from having a
> > function defined just above the user. My main concern with the original
> > approach was the risk of calling continue/break_if_contended() in the
> > wrong place, and also the fact you can't really externalize things to
> > a function if you're looking for a cleaner split. At least with
> > drm_exec_until_all_locked() you can do both.  
> 
> Yeah, but that means that you can't use return inside your code block 
> and instead has to define an error label for handling "normal" 
> contention which is what I'm trying to avoid here.
> 
> How about:
> 
> #define drm_exec_until_all_locked(exec)    \
>      __drm_exec_retry: if (drm_exec_cleanup(exec))
> 
> 
> #define drm_exec_retry_on_contention(exec)  \
>      if (unlikely(drm_exec_is_contended(exec)))  \
>      goto __drm_exec_retry
> 
> 
> And then use it like:
> 
> drm_exec_until_all_locked(exec)
> {
>      ret = drm_exec_prepare_obj(exec, obj);
>      drm_exec_retry_on_contention(exec);
> }

That would work, and I was about to suggest extending my proposal with
a drm_exec_retry_on_contention() to support both use cases. The only
downside is the fact you might be able to break out of a loop that has
local variables, which will leak stack space.

> 
> The only problem I can see with this is that the __drm_exec_retry label 
> would be function local.

You can use local labels [1] to make it local to a block (see my
version, just need to rename the retry label into __drm_exec_retry). I
checked, and this is used elsewhere in the kernel (like in
linux/wait.h, which is a core feature), so it should be safe to use.

[1]https://gcc.gnu.org/onlinedocs/gcc/Local-Labels.html


Re: [PATCH 01/13] drm: execution context for GEM buffers v4

2023-06-19 Thread Boris Brezillon
On Mon, 19 Jun 2023 13:05:02 +0200
Boris Brezillon  wrote:

> On Mon, 19 Jun 2023 12:44:06 +0200
> Christian König  wrote:
> 
> > Am 19.06.23 um 12:12 schrieb Boris Brezillon:  
> > > [SNIP]
> > > Note that the drm_exec_until_all_locked() helper I introduced is taking
> > > an expression, so in theory, you don't have to define a separate
> > > function.
> > >
> > >   drm_exec_until_all_locked(&exec, {
> > >   /* inlined-code */
> > >   int ret;
> > >
> > >   ret = blabla()
> > >   if (ret)
> > >   goto error;
> > >
> > >   ...
> > >
> > > error:
> > >   /* return value. */
> > >   ret;
> > >   });
> > >
> > > This being said, as soon as you have several failure paths,
> > > it makes things a lot easier/controllable if you make it a function,
> > > and I honestly don't think the readability would suffer from having a
> > > function defined just above the user. My main concern with the original
> > > approach was the risk of calling continue/break_if_contended() in the
> > > wrong place, and also the fact you can't really externalize things to
> > > a function if you're looking for a cleaner split. At least with
> > > drm_exec_until_all_locked() you can do both.
> > 
> > Yeah, but that means that you can't use return inside your code block 
> > and instead has to define an error label for handling "normal" 
> > contention which is what I'm trying to avoid here.
> > 
> > How about:
> > 
> > #define drm_exec_until_all_locked(exec)    \
> >      __drm_exec_retry: if (drm_exec_cleanup(exec))
> > 
> > 
> > #define drm_exec_retry_on_contention(exec)  \
> >      if (unlikely(drm_exec_is_contended(exec)))  \
> >      goto __drm_exec_retry
> > 
> > 
> > And then use it like:
> > 
> > drm_exec_until_all_locked(exec)
> > {
> >      ret = drm_exec_prepare_obj(exec, obj);
> >      drm_exec_retry_on_contention(exec);
> > }  
> 
> That would work, and I was about to suggest extending my proposal with
> a drm_exec_retry_on_contention() to support both use cases. The only
> downside is the fact you might be able to break out of a loop that has
> local variables, which will leak stack space.

Nevermind, brain fart on my end. It shouldn't leak any stack space, so
yeah, I think that's a good compromise.


Re: [PATCH 01/13] drm: execution context for GEM buffers v4

2023-06-19 Thread Boris Brezillon
On Mon, 19 Jun 2023 12:44:06 +0200
Christian König  wrote:

> Am 19.06.23 um 12:12 schrieb Boris Brezillon:
> > [SNIP]
> > Note that the drm_exec_until_all_locked() helper I introduced is taking
> > an expression, so in theory, you don't have to define a separate
> > function.
> >
> > drm_exec_until_all_locked(&exec, {
> > /* inlined-code */
> > int ret;
> >
> > ret = blabla()
> > if (ret)
> > goto error;
> >
> > ...
> >
> > error:
> > /* return value. */
> > ret;
> > });
> >
> > This being said, as soon as you have several failure paths,
> > it makes things a lot easier/controllable if you make it a function,
> > and I honestly don't think the readability would suffer from having a
> > function defined just above the user. My main concern with the original
> > approach was the risk of calling continue/break_if_contended() in the
> > wrong place, and also the fact you can't really externalize things to
> > a function if you're looking for a cleaner split. At least with
> > drm_exec_until_all_locked() you can do both.  
> 
> Yeah, but that means that you can't use return inside your code block 
> and instead has to define an error label for handling "normal" 
> contention which is what I'm trying to avoid here.

Sorry, didn't pay attention to this particular concern. Indeed, if you
want to return inside the expression, that's a problem.

> 
> How about:
> 
> #define drm_exec_until_all_locked(exec)    \
>      __drm_exec_retry: if (drm_exec_cleanup(exec))
> 
> 
> #define drm_exec_retry_on_contention(exec)  \
>      if (unlikely(drm_exec_is_contended(exec)))  \
>      goto __drm_exec_retry
> 
> 
> And then use it like:
> 
> drm_exec_until_all_locked(exec)
> {
>      ret = drm_exec_prepare_obj(exec, obj);
>      drm_exec_retry_on_contention(exec);
> }
> 
> The only problem I can see with this is that the __drm_exec_retry label 
> would be function local.

Yeah, I'm not sure it's safe to use non-local labels for that, because,
as soon as you have more than one drm_exec_until_all_locked() call in a
given function it won't work, which is why I placed things in a block
with local labels, which in turn means you can't return directly,
unfortunately.


Re: [PATCH 01/13] drm: execution context for GEM buffers v4

2023-06-19 Thread Boris Brezillon
On Mon, 19 Jun 2023 14:29:23 +0200
Boris Brezillon  wrote:

> On Mon, 19 Jun 2023 12:44:06 +0200
> Christian König  wrote:
> 
> > Am 19.06.23 um 12:12 schrieb Boris Brezillon:  
> > > [SNIP]
> > > Note that the drm_exec_until_all_locked() helper I introduced is taking
> > > an expression, so in theory, you don't have to define a separate
> > > function.
> > >
> > >   drm_exec_until_all_locked(&exec, {
> > >   /* inlined-code */
> > >   int ret;
> > >
> > >   ret = blabla()
> > >   if (ret)
> > >   goto error;
> > >
> > >   ...
> > >
> > > error:
> > >   /* return value. */
> > >   ret;
> > >   });
> > >
> > > This being said, as soon as you have several failure paths,
> > > it makes things a lot easier/controllable if you make it a function,
> > > and I honestly don't think the readability would suffer from having a
> > > function defined just above the user. My main concern with the original
> > > approach was the risk of calling continue/break_if_contended() in the
> > > wrong place, and also the fact you can't really externalize things to
> > > a function if you're looking for a cleaner split. At least with
> > > drm_exec_until_all_locked() you can do both.
> > 
> > Yeah, but that means that you can't use return inside your code block 
> > and instead has to define an error label for handling "normal" 
> > contention which is what I'm trying to avoid here.  
> 
> Sorry, didn't pay attention to this particular concern. Indeed, if you
> want to return inside the expression, that's a problem.

Sorry, that's wrong again. Had trouble focusing yesterday...

So, returning directly from the expression block should be perfectly
fine. The only problem is breaking out of the retry loop early and
propagating the error, but that's no more or less problematic than it
was before. We just need the drm_exec_retry_on_contention() helper you
suggested, and a drm_exec_stop() that would go to some local
__drm_exec_stop label.

int ret = 0;

ret = drm_exec_until_all_locked(exec, ({
...
ret = drm_exec_prepare_obj(exec, objA, 1);
drm_exec_retry_on_contention(exec);
if (ret)
drm_exec_stop(exec, ret);
...

ret = drm_exec_prepare_obj(exec, objB, 1);
drm_exec_retry_on_contention(exec);
if (ret)
drm_exec_stop(exec, ret);

0;
}));

Which is pretty close to the syntax you defined initially, except for
the '0;' oddity at the end, which is ugly, I admit.


Re: [PATCH 06/13] drm/amdgpu: use the new drm_exec object for CS v2

2023-06-20 Thread Boris Brezillon
On Tue, 20 Jun 2023 10:12:13 +0200
Christian König  wrote:

> > I think Boris's suggestion of having this through a common 
> > DRM_EXEC_FLAG_ALLOW_DUPLICATES flag fits well.  
> 
> No, again. The only driver which should accept duplicates is radeon, for 
> all other drivers especially new ones duplicates should probably be 
> rejected.
> 
> We only allow this for radeon because it is already UAPI, could be that 
> we need to do this for amdgpu as well but I really hope we don't need this.

Just want to describe the use case we have: we support submission in
batch (several jobs passed to the submit ioctl) with a
submit-all-or-nothing model: if any of the job description is passed
wrong args or causes an allocation error, we fail the whole group. In
the submission path, we want to prepare GEMs for all jobs. That means
adding enough fence slots for the number job finished fences. Given not
all jobs will access the same set of BOs, I thought I could use
duplicates support to make my life easier, because otherwise I have to
collect all BOs upfront, store them in a temporary array, and keep
track of the number of fence slots needed for each of them. I guess
the other option would be to over-estimate the number of slots and make
it equal to num_jobs for all BOs.


Re: [PATCH 06/13] drm/amdgpu: use the new drm_exec object for CS v2

2023-06-20 Thread Boris Brezillon
On Tue, 20 Jun 2023 10:44:26 +0200
Christian König  wrote:

> Am 20.06.23 um 10:28 schrieb Boris Brezillon:
> > On Tue, 20 Jun 2023 10:12:13 +0200
> > Christian König  wrote:
> >  
> >>> I think Boris's suggestion of having this through a common
> >>> DRM_EXEC_FLAG_ALLOW_DUPLICATES flag fits well.  
> >> No, again. The only driver which should accept duplicates is radeon, for
> >> all other drivers especially new ones duplicates should probably be
> >> rejected.
> >>
> >> We only allow this for radeon because it is already UAPI, could be that
> >> we need to do this for amdgpu as well but I really hope we don't need 
> >> this.  
> > Just want to describe the use case we have: we support submission in
> > batch (several jobs passed to the submit ioctl) with a
> > submit-all-or-nothing model: if any of the job description is passed
> > wrong args or causes an allocation error, we fail the whole group. In
> > the submission path, we want to prepare GEMs for all jobs. That means
> > adding enough fence slots for the number job finished fences. Given not
> > all jobs will access the same set of BOs, I thought I could use
> > duplicates support to make my life easier, because otherwise I have to
> > collect all BOs upfront, store them in a temporary array, and keep
> > track of the number of fence slots needed for each of them. I guess
> > the other option would be to over-estimate the number of slots and make
> > it equal to num_jobs for all BOs.  
> 
> Sounds pretty much what amdgpu is doing as well, but question is why 
> don't you give just one list of BOs? Do you really want to add the 
> fences that fine grained?

Actually, we don't give a list of BOs at all, we pass a VM, and lock
all BOs attached to the VM (similar to what Xe does). And, as all other
drivers being submitted recently, we use explicit sync, so most of
those VM BOs, except for the imported/exported ones, will be given a
BOOKKEEP fence.

The reason we need support for duplicates is because we also have
implicit BOs (like the HWRT object that's shared by the
geometry/fragment queues to pass data around), and those can be passed
to multiple jobs in a given batch and require special synchronization
(geometry job writes to them, fragment job reads from them, so we have
a reader/writer sync to express). I can of course de-duplicate upfront,
by parsing jobs and creating an array of BOs that need to be acquired
over the whole submission, but that's still one extra-step I'd prefer
to avoid, given the dma_resv framework allows us to figure it out at
lock time. I can also just deal with the EALREADY case in the driver
directly, it's not like it's super complicated anyway, just thought
other drivers would fall in the same situation, that's all.

> 
> For radeon it turned out that we just had stupid userspace which 
> sometimes mentioned a BO in the list twice.

Okay, that's not the same thing, indeed.

> 
> On the other hand over estimating the number of fences needed is 
> perfectly fine as well, that is rounded up to the next kvmalloc size or 
> even next page size anyway.

Yeah, actually over-provisioning is not the most annoying part.
Iterating over jobs to collect 'meta'-BOs is, so if I can just rely on
EALREADY to detect that case and fallback to reserving an extra slot in
that situation, I'd prefer that.


Re: [PATCH 06/13] drm/amdgpu: use the new drm_exec object for CS v2

2023-06-20 Thread Boris Brezillon
On Tue, 20 Jun 2023 11:14:51 +0200
Christian König  wrote:

> Am 20.06.23 um 11:09 schrieb Boris Brezillon:
> > On Tue, 20 Jun 2023 10:44:26 +0200
> > Christian König  wrote:
> >  
> >> Am 20.06.23 um 10:28 schrieb Boris Brezillon:  
> >>> On Tue, 20 Jun 2023 10:12:13 +0200
> >>> Christian König  wrote:
> >>> 
> >>>>> I think Boris's suggestion of having this through a common
> >>>>> DRM_EXEC_FLAG_ALLOW_DUPLICATES flag fits well.  
> >>>> No, again. The only driver which should accept duplicates is radeon, for
> >>>> all other drivers especially new ones duplicates should probably be
> >>>> rejected.
> >>>>
> >>>> We only allow this for radeon because it is already UAPI, could be that
> >>>> we need to do this for amdgpu as well but I really hope we don't need 
> >>>> this.  
> >>> Just want to describe the use case we have: we support submission in
> >>> batch (several jobs passed to the submit ioctl) with a
> >>> submit-all-or-nothing model: if any of the job description is passed
> >>> wrong args or causes an allocation error, we fail the whole group. In
> >>> the submission path, we want to prepare GEMs for all jobs. That means
> >>> adding enough fence slots for the number job finished fences. Given not
> >>> all jobs will access the same set of BOs, I thought I could use
> >>> duplicates support to make my life easier, because otherwise I have to
> >>> collect all BOs upfront, store them in a temporary array, and keep
> >>> track of the number of fence slots needed for each of them. I guess
> >>> the other option would be to over-estimate the number of slots and make
> >>> it equal to num_jobs for all BOs.  
> >> Sounds pretty much what amdgpu is doing as well, but question is why
> >> don't you give just one list of BOs? Do you really want to add the
> >> fences that fine grained?  
> > Actually, we don't give a list of BOs at all, we pass a VM, and lock
> > all BOs attached to the VM (similar to what Xe does). And, as all other
> > drivers being submitted recently, we use explicit sync, so most of
> > those VM BOs, except for the imported/exported ones, will be given a
> > BOOKKEEP fence.
> >
> > The reason we need support for duplicates is because we also have
> > implicit BOs (like the HWRT object that's shared by the
> > geometry/fragment queues to pass data around), and those can be passed
> > to multiple jobs in a given batch and require special synchronization
> > (geometry job writes to them, fragment job reads from them, so we have
> > a reader/writer sync to express). I can of course de-duplicate upfront,
> > by parsing jobs and creating an array of BOs that need to be acquired
> > over the whole submission, but that's still one extra-step I'd prefer
> > to avoid, given the dma_resv framework allows us to figure it out at
> > lock time. I can also just deal with the EALREADY case in the driver
> > directly, it's not like it's super complicated anyway, just thought
> > other drivers would fall in the same situation, that's all.  
> 
> Well as long as you just need to ignore EALREADY, that should be trivial 
> and doable.

Oh, yeah, that's all I need really. We probably don't want to add the
GEM object a second time in the array though, hence the goto
reserve_fences in my proposal when EALREADY is returned.

> 
> What radeon needs is to keep EALREADY BOs in a separate container 
> because we need to double check their properties to not break the UAPI.
> 
> I strongly think that this shouldn't be needed by any other driver.
> 
> Going to add a flag to ignore EALREADY which can be set during exec init.

Thanks!


Re: [PATCH drm-next v5 00/14] [RFC] DRM GPUVA Manager & Nouveau VM_BIND UAPI

2023-06-20 Thread Boris Brezillon
same way other operations generated
> - hand the responsibility for mutual exclusion for a GEM's
>   drm_gpuva list to the user; simplified corresponding (un-)link functions
> 
>   Maple Tree:
> - I added two maple tree patches to the series, one to support custom tree
>   walk macros and one to hand the locking responsibility to the user of 
> the
>   GPUVA manager without pre-defined lockdep checks.
> 
> Changes in V3:
> ==
>   Nouveau:
> - Reworked the Nouveau VM_BIND UAPI to do the job cleanup (including page
>   table cleanup) within a workqueue rather than the job_free() callback of
>   the scheduler itself. A job_free() callback can stall the execution 
> (run()
>   callback) of the next job in the queue. Since the page table cleanup
>   requires to take the same locks as need to be taken for page table
>   allocation, doing it directly in the job_free() callback would still
>   violate the fence signalling critical path.
> - Separated Nouveau fence allocation and emit, such that we do not violate
>   the fence signalling critical path in EXEC jobs.
> - Implement "regions" (for handling sparse mappings through PDEs and dual
>   page tables) within Nouveau.
> - Drop the requirement for every mapping to be contained within a region.
> - Add necassary synchronization of VM_BIND job operation sequences in 
> order
>   to work around limitations in page table handling. This will be 
> addressed
>   in a future re-work of Nouveau's page table handling.
> - Fixed a couple of race conditions found through more testing. Thanks to
>   Dave for consitently trying to break it. :-)
> 
>   GPUVA Manager:
> - Implement pre-allocation capabilities for tree modifications within 
> fence
>   signalling critical sections.
> - Implement accessors to to apply tree modification while walking the 
> GPUVA
>   tree in order to actually support processing of drm_gpuva_ops through
>   callbacks in fence signalling critical sections rather than through
>   pre-allocated operation lists.
> - Remove merging of GPUVAs; the kernel has limited to none knowlege about
>   the semantics of mapping sequences. Hence, merging is purely 
> speculative.
>   It seems that gaining a significant (or at least a measurable) 
> performance
>   increase through merging is way more likely to happen when userspace is
>   responsible for merging mappings up to the next larger page size if
>   possible.
> - Since merging was removed, regions pretty much loose their right to 
> exist.
>   They might still be useful for handling dual page tables or similar
>   mechanisms, but since Nouveau seems to be the only driver having a need
>   for this for now, regions were removed from the GPUVA manager.
> - Fixed a couple of maple_tree related issues; thanks to Liam for helping 
> me
>   out.
> 
> Changes in V4:
> ==
>   Nouveau:
> - Refactored how specific VM_BIND and EXEC jobs are created and how their
>   arguments are passed to the generic job implementation.
> - Fixed a UAF race condition where bind job ops could have been freed
>   already while still waiting for a job cleanup to finish. This is due to
>   in certain cases we need to wait for mappings actually being unmapped
>   before creating sparse regions in the same area.
> - Re-based the code onto drm_exec v4 patch.
> 
>   GPUVA Manager:
> - Fixed a maple tree related bug when pre-allocating MA states.
>   (Boris Brezillion)
> - Made struct drm_gpuva_fn_ops a const object in all occurrences.
>   (Boris Brezillion)
> 
> Changes in V5:
> ==
>   Nouveau:
> - Link and unlink GPUVAs outside the fence signalling critical path in
>   nouveau_uvmm_bind_job_submit() holding the dma-resv lock. Mutual 
> exclusion
>   of BO evicts causing mapping invalidation and regular mapping operations
>   is ensured with dma-fences.
> 
>   GPUVA Manager:
> - Removed the separate GEMs GPUVA list lock. Link and unlink as well as
>   iterating the GEM's GPUVA list should be protected with the GEM's 
> dma-resv
>   lock instead.
> - Renamed DRM_GPUVA_EVICTED flag to DRM_GPUVA_INVALIDATED. Mappings do not
>   get eviced, they might get invalidated due to eviction.
> - Maple tree uses the 'unsinged long' type for node entries. While this
>   works for GPU VA spaces larger than 32-bit on 64-bit kernel, the GPU VA
>   space is limited to 32-bit on 32-bit kernels as well.
>   As long as we do not have a 64-bit capabl

Re: [PATCH 1/2] drm: execution context for GEM buffers v5

2023-06-21 Thread Boris Brezillon
Hi Christian,

On Wed, 21 Jun 2023 15:36:59 +0200
"Christian König"  wrote:

> This adds the infrastructure for an execution context for GEM buffers
> which is similar to the existing TTMs execbuf util and intended to replace
> it in the long term.
> 
> The basic functionality is that we abstracts the necessary loop to lock
> many different GEM buffers with automated deadlock and duplicate handling.
> 
> v2: drop xarray and use dynamic resized array instead, the locking
> overhead is unecessary and measurable.
> v3: drop duplicate tracking, radeon is really the only one needing that.
> v4: fixes issues pointed out by Danilo, some typos in comments and a
> helper for lock arrays of GEM objects.
> v5: some suggestions by Boris Brezillon, especially just use one retry
> macro, drop loop in prepare_array, use flags instead of bool

One minor comment below, but otherwise, I think I'm happy with this version.

Reviewed-by: Boris Brezillon 

> +
> +/**
> + * drm_exec_prepare_array - helper to prepare an array of objects
> + * @exec: the drm_exec object with the state
> + * @objects: array of GEM object to prepare
> + * @num_objects: number of GEM objects in the array
> + * @num_fences: number of fences to reserve on each GEM object
> + *
> + * Prepares all GEM objects in an array, handles contention but aports on 
> first

   ^
   aborts

> + * error otherwise. Reserves @num_fences on each GEM object after locking it.

Either the documentation if wrong, or you unintentionally picked my
version. If that's the intended usage:

drm_exec_until_all_locked(exec) {
ret = drm_exec_prepare_array(exec, bos, num_bos, num_fences);
drm_exec_retry_on_contention(exec)
if (ret)
break;
}

you should drop the 'handles contention' part in the doc, and you
should probably give an example to show how it's supposed to be used.

> + *
> + * Returns: -EALREADY when object is already locked, -ENOMEM when memory
> + * allocation failed and zero for success.
> + */
> +int drm_exec_prepare_array(struct drm_exec *exec,
> +struct drm_gem_object **objects,
> +unsigned int num_objects,
> +unsigned int num_fences)
> +{
> + int ret;
> +
> + for (unsigned int i = 0; i < num_objects; ++i) {
> + ret = drm_exec_prepare_obj(exec, objects[i], num_fences);
> + if (unlikely(ret))
> + return ret;
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(drm_exec_prepare_array);

[...]

> +/**
> + * drm_exec_until_all_locked - loop until all GEM objects are locked
> + * @exec: drm_exec object
> + *
> + * Core functionality of the drm_exec object. Loops until all GEM objects are
> + * locked and no more contention exists. At the beginning of the loop it is
> + * guaranteed that no GEM object is locked.
> + *
> + * Since labels can't be defined local to the loops body we use a jump 
> pointer
> + * to make sure that the retry is only used from within the loops body.
> + */
> +#define drm_exec_until_all_locked(exec)  \
> + for (void *__drm_exec_retry_ptr; ({ \
> + __label__ __drm_exec_retry; \
> +__drm_exec_retry:\
> + __drm_exec_retry_ptr = &&__drm_exec_retry;  \
> + drm_exec_cleanup(exec); \
> + });)
> +
> +/**
> + * drm_exec_retry_on_contention - restart the loop to grap all locks
> + * @exec: drm_exec object
> + *
> + * Control flow helper to continue when a contention was detected and we 
> need to
> + * clean up and re-start the loop to prepare all GEM objects.
> + */
> +#define drm_exec_retry_on_contention(exec)   \
> + if (unlikely(drm_exec_is_contended(exec)))  \
> + goto *__drm_exec_retry_ptr

Glad that this ended up working.

Regards,

Boris


Re: [PATCH v6] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb()

2023-06-21 Thread Boris Brezillon
Hello Luben,

On Wed, 21 Jun 2023 09:56:40 -0400
Luben Tuikov  wrote:

> On 2023-06-19 03:19, Boris Brezillon wrote:
> > drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped
> > from the dependency array that was waited upon before
> > drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
> > so we're basically waiting for all dependencies except one.
> > 
> > In theory, this wait shouldn't be needed because resources should have
> > their users registered to the dma_resv object, thus guaranteeing that
> > future jobs wanting to access these resources wait on all the previous
> > users (depending on the access type, of course). But we want to keep
> > these explicit waits in the kill entity path just in case.
> > 
> > Let's make sure we keep all dependencies in the array in
> > drm_sched_job_dependency(), so we can iterate over the array and wait
> > in drm_sched_entity_kill_jobs_cb().
> > 
> > We also make sure we wait on drm_sched_fence::finished if we were
> > originally asked to wait on drm_sched_fence::scheduled. In that case,
> > we assume the intent was to delegate the wait to the firmware/GPU or
> > rely on the pipelining done at the entity/scheduler level, but when
> > killing jobs, we really want to wait for completion not just scheduling.
> > 
> > v6:
> > - Back to v4 implementation
> > - Add Christian's R-b
> > 
> > v5:
> > - Flag deps on which we should only wait for the scheduled event
> >   at insertion time
> > 
> > v4:
> > - Fix commit message
> > - Fix a use-after-free bug
> > 
> > v3:
> > - Always wait for drm_sched_fence::finished fences in
> >   drm_sched_entity_kill_jobs_cb() when we see a sched_fence
> > 
> > v2:
> > - Don't evict deps in drm_sched_job_dependency()  
> 
> Hmm, why is this in reverse chronological order?
> It's very confusing.

Dunno, that's how I've always ordered things, and quick look at some
dri-devel patches [1][2] makes me think I'm not the only one to start
from the latest submission.

[1]https://lkml.org/lkml/2023/6/19/941
[2]https://lore.kernel.org/dri-devel/cover.1686729444.git.sandor...@nxp.com/T/#t

> 
> > 
> > Signed-off-by: Boris Brezillon 
> > Suggested-by: "Christian König" 
> > Reviewed-by: "Christian König"   
> 
> These three lines would usually come after the CCs.

Again, I think I've always inserted those tags before the Cc, but I can
re-order things if you prefer. Let me know if you want me to send a v7
addressing the Cc+changelog ordering.

Regards,

Boris

> 
> Regards,
> Luben
> 
> > Cc: Frank Binns 
> > Cc: Sarah Walker 
> > Cc: Donald Robson 
> > Cc: Luben Tuikov 
> > Cc: David Airlie 
> > Cc: Daniel Vetter 
> > Cc: Sumit Semwal 
> > Cc: "Christian König" 
> > ---
> >  drivers/gpu/drm/scheduler/sched_entity.c | 41 +++-
> >  1 file changed, 33 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
> > b/drivers/gpu/drm/scheduler/sched_entity.c
> > index 68e807ae136a..ec41d82d0141 100644
> > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > @@ -176,16 +176,32 @@ static void drm_sched_entity_kill_jobs_cb(struct 
> > dma_fence *f,
> >  {
> > struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
> >  finish_cb);
> > -   int r;
> > +   unsigned long index;
> >  
> > dma_fence_put(f);
> >  
> > /* Wait for all dependencies to avoid data corruptions */
> > -   while (!xa_empty(&job->dependencies)) {
> > -   f = xa_erase(&job->dependencies, job->last_dependency++);
> > -   r = dma_fence_add_callback(f, &job->finish_cb,
> > -  drm_sched_entity_kill_jobs_cb);
> > -   if (!r)
> > +   xa_for_each(&job->dependencies, index, f) {
> > +   struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
> > +
> > +   if (s_fence && f == &s_fence->scheduled) {
> > +   /* The dependencies array had a reference on the 
> > scheduled
> > +* fence, and the finished fence refcount might have
> > +* dropped to zero. Use dma_fence_get_rcu() so we get
> > +* a NULL fence in that case.
> > +*/
> > +   f = dm

Re: [PATCH] drm/sched: Call drm_sched_fence_set_parent() from drm_sched_fence_scheduled()

2023-06-21 Thread Boris Brezillon
Hi Christian,

On Tue, 13 Jun 2023 13:06:06 +0200
Christian König  wrote:

> Am 13.06.23 um 11:46 schrieb Boris Brezillon:
> > On Tue, 13 Jun 2023 11:44:24 +0200
> > Boris Brezillon  wrote:
> >  
> >> Drivers that can delegate waits to the firmware/GPU pass the scheduled
> >> fence to drm_sched_job_add_dependency(), and issue wait commands to
> >> the firmware/GPU at job submission time. For this to be possible, they
> >> need all their 'native' dependencies to have a valid parent since this
> >> is where the actual HW fence information are encoded.
> >>
> >> In drm_sched_main(), we currently call drm_sched_fence_set_parent()
> >> after drm_sched_fence_set_parent(), leaving a short period of time  
> > after drm_sched_fence_scheduled(), ...  
> 
> I was just about to complain, but yeah sounds like the right idea to me.
> 
> Just let me review the patch in more detail.

Did you have time to look at this patch in more detail? Should I send a
v2 fixing the mistake in the commit message?

Regards,

Boris


Re: [PATCH v6] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb()

2023-06-21 Thread Boris Brezillon
On Wed, 21 Jun 2023 10:41:22 -0400
Luben Tuikov  wrote:

> On 2023-06-21 10:18, Boris Brezillon wrote:
> > Hello Luben,
> > 
> > On Wed, 21 Jun 2023 09:56:40 -0400
> > Luben Tuikov  wrote:
> >   
> >> On 2023-06-19 03:19, Boris Brezillon wrote:  
> >>> drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped
> >>> from the dependency array that was waited upon before
> >>> drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
> >>> so we're basically waiting for all dependencies except one.
> >>>
> >>> In theory, this wait shouldn't be needed because resources should have
> >>> their users registered to the dma_resv object, thus guaranteeing that
> >>> future jobs wanting to access these resources wait on all the previous
> >>> users (depending on the access type, of course). But we want to keep
> >>> these explicit waits in the kill entity path just in case.
> >>>
> >>> Let's make sure we keep all dependencies in the array in
> >>> drm_sched_job_dependency(), so we can iterate over the array and wait
> >>> in drm_sched_entity_kill_jobs_cb().
> >>>
> >>> We also make sure we wait on drm_sched_fence::finished if we were
> >>> originally asked to wait on drm_sched_fence::scheduled. In that case,
> >>> we assume the intent was to delegate the wait to the firmware/GPU or
> >>> rely on the pipelining done at the entity/scheduler level, but when
> >>> killing jobs, we really want to wait for completion not just scheduling.
> >>>
> >>> v6:
> >>> - Back to v4 implementation
> >>> - Add Christian's R-b
> >>>
> >>> v5:
> >>> - Flag deps on which we should only wait for the scheduled event
> >>>   at insertion time
> >>>
> >>> v4:
> >>> - Fix commit message
> >>> - Fix a use-after-free bug
> >>>
> >>> v3:
> >>> - Always wait for drm_sched_fence::finished fences in
> >>>   drm_sched_entity_kill_jobs_cb() when we see a sched_fence
> >>>
> >>> v2:
> >>> - Don't evict deps in drm_sched_job_dependency()
> >>
> >> Hmm, why is this in reverse chronological order?
> >> It's very confusing.  
> > 
> > Dunno, that's how I've always ordered things, and quick look at some
> > dri-devel patches [1][2] makes me think I'm not the only one to start
> > from the latest submission.
> > 
> > [1]https://lkml.org/lkml/2023/6/19/941
> > [2]https://lore.kernel.org/dri-devel/cover.1686729444.git.sandor...@nxp.com/T/#t
> >   
> >>  
> >>>
> >>> Signed-off-by: Boris Brezillon 
> >>> Suggested-by: "Christian König" 
> >>> Reviewed-by: "Christian König" 
> >>
> >> These three lines would usually come after the CCs.  
> > 
> > Again, I think I've always inserted those tags before the Cc, but I can
> > re-order things if you prefer. Let me know if you want me to send a v7
> > addressing the Cc+changelog ordering.  
> 
> No, it's not necessary for this patch, but in the future I'd rather follow
> chronological ordering for the versions, and in the Cc list. It's similar
> to how the patch description follows (narrative text) and to how we reply
> back to emails, and prevalently in the kernel log in drm ("git log" should
> suffice).
> 
> Reading in chronological progression builds a narrative, a picture, in one's
> mind and makes it easy to see justifications for said narrative, or see 
> reasons
> to change the narrative.
> 
> That is, one can make a better decision knowing the full history, rather than
> only the latest change.
> 
> (And in fact when I read the version revision list, my eyes skip over v[X]
> and just read down, so I was wondering why and how Christian R-B the patch
> in v2, and it wasn't until I actually saw that they were ordered in reverse
> chronological order, which was in fact v6--listed first, which I'd assumed
> was listed last.)
> 
> Do you have access or do you know who is pushing this patch to drm-misc-fixes?

I can push it.



Re: [PATCH 1/2] drm: execution context for GEM buffers v5

2023-06-21 Thread Boris Brezillon
On Wed, 21 Jun 2023 15:36:59 +0200
"Christian König"  wrote:

> +/**
> + * drm_exec_until_all_locked - loop until all GEM objects are locked
> + * @exec: drm_exec object
> + *
> + * Core functionality of the drm_exec object. Loops until all GEM objects are
> + * locked and no more contention exists. At the beginning of the loop it is
> + * guaranteed that no GEM object is locked.
> + *
> + * Since labels can't be defined local to the loops body we use a jump 
> pointer
> + * to make sure that the retry is only used from within the loops body.
> + */
> +#define drm_exec_until_all_locked(exec)  \
> + for (void *__drm_exec_retry_ptr; ({ \
> + __label__ __drm_exec_retry; \

The warning reported by the bot on 'drm: add drm_exec selftests v4'
should be fixed with a

goto __drm_exec_retry;

placed here.

> +__drm_exec_retry:\
> + __drm_exec_retry_ptr = &&__drm_exec_retry;  \
> + drm_exec_cleanup(exec); \
> + });)


Re: [PATCH 1/2] drm: execution context for GEM buffers v5

2023-06-21 Thread Boris Brezillon
On Wed, 21 Jun 2023 18:51:59 +0200
Boris Brezillon  wrote:

> On Wed, 21 Jun 2023 15:36:59 +0200
> "Christian König"  wrote:
> 
> > +/**
> > + * drm_exec_until_all_locked - loop until all GEM objects are locked
> > + * @exec: drm_exec object
> > + *
> > + * Core functionality of the drm_exec object. Loops until all GEM objects 
> > are
> > + * locked and no more contention exists. At the beginning of the loop it is
> > + * guaranteed that no GEM object is locked.
> > + *
> > + * Since labels can't be defined local to the loops body we use a jump 
> > pointer
> > + * to make sure that the retry is only used from within the loops body.
> > + */
> > +#define drm_exec_until_all_locked(exec)\
> > +   for (void *__drm_exec_retry_ptr; ({ \
> > +   __label__ __drm_exec_retry; \  
> 
> The warning reported by the bot on 'drm: add drm_exec selftests v4'
> should be fixed with a
> 
>   goto __drm_exec_retry;
> 
> placed here.

Nevermind, it's complaining about __drm_exec_retry_ptr being set but
not used. Guess __maybe_unused could cover that.

> 
> > +__drm_exec_retry:  \
> > +   __drm_exec_retry_ptr = &&__drm_exec_retry;  \
> > +   drm_exec_cleanup(exec); \
> > +   });)  



[PATCH v2] drm/sched: Call drm_sched_fence_set_parent() from drm_sched_fence_scheduled()

2023-06-22 Thread Boris Brezillon
Drivers that can delegate waits to the firmware/GPU pass the scheduled
fence to drm_sched_job_add_dependency(), and issue wait commands to
the firmware/GPU at job submission time. For this to be possible, they
need all their 'native' dependencies to have a valid parent since this
is where the actual HW fence information are encoded.

In drm_sched_main(), we currently call drm_sched_fence_set_parent()
after drm_sched_fence_scheduled(), leaving a short period of time
during which the job depending on this fence can be submitted.

Since setting parent and signaling the fence are two things that are
kinda related (you can't have a parent if the job hasn't been scheduled),
it probably makes sense to pass the parent fence to
drm_sched_fence_scheduled() and let it call drm_sched_fence_set_parent()
before it signals the scheduled fence.

v2:
* Fix commit message

Signed-off-by: Boris Brezillon 
Cc: Frank Binns 
Cc: Sarah Walker 
Cc: Donald Robson 
Cc: Luben Tuikov 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Sumit Semwal 
Cc: "Christian König" 
---
 drivers/gpu/drm/scheduler/sched_fence.c | 40 +++--
 drivers/gpu/drm/scheduler/sched_main.c  |  3 +-
 include/drm/gpu_scheduler.h |  5 ++--
 3 files changed, 28 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_fence.c 
b/drivers/gpu/drm/scheduler/sched_fence.c
index ef120475e7c6..06cedfe4b486 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -48,8 +48,32 @@ static void __exit drm_sched_fence_slab_fini(void)
kmem_cache_destroy(sched_fence_slab);
 }
 
-void drm_sched_fence_scheduled(struct drm_sched_fence *fence)
+static void drm_sched_fence_set_parent(struct drm_sched_fence *s_fence,
+  struct dma_fence *fence)
 {
+   /*
+* smp_store_release() to ensure another thread racing us
+* in drm_sched_fence_set_deadline_finished() sees the
+* fence's parent set before test_bit()
+*/
+   smp_store_release(&s_fence->parent, dma_fence_get(fence));
+   if (test_bit(DRM_SCHED_FENCE_FLAG_HAS_DEADLINE_BIT,
+&s_fence->finished.flags))
+   dma_fence_set_deadline(fence, s_fence->deadline);
+}
+
+void drm_sched_fence_scheduled(struct drm_sched_fence *fence,
+  struct dma_fence *parent)
+{
+   /* Set the parent before signaling the scheduled fence, such that,
+* any waiter expecting the parent to be filled after the job has
+* been scheduled (which is the case for drivers delegating waits
+* to some firmware) doesn't have to busy wait for parent to show
+* up.
+*/
+   if (!IS_ERR_OR_NULL(parent))
+   drm_sched_fence_set_parent(fence, parent);
+
dma_fence_signal(&fence->scheduled);
 }
 
@@ -181,20 +205,6 @@ struct drm_sched_fence *to_drm_sched_fence(struct 
dma_fence *f)
 }
 EXPORT_SYMBOL(to_drm_sched_fence);
 
-void drm_sched_fence_set_parent(struct drm_sched_fence *s_fence,
-   struct dma_fence *fence)
-{
-   /*
-* smp_store_release() to ensure another thread racing us
-* in drm_sched_fence_set_deadline_finished() sees the
-* fence's parent set before test_bit()
-*/
-   smp_store_release(&s_fence->parent, dma_fence_get(fence));
-   if (test_bit(DRM_SCHED_FENCE_FLAG_HAS_DEADLINE_BIT,
-&s_fence->finished.flags))
-   dma_fence_set_deadline(fence, s_fence->deadline);
-}
-
 struct drm_sched_fence *drm_sched_fence_alloc(struct drm_sched_entity *entity,
  void *owner)
 {
diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 7b2bfc10c1a5..506371c42745 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1043,10 +1043,9 @@ static int drm_sched_main(void *param)
trace_drm_run_job(sched_job, entity);
fence = sched->ops->run_job(sched_job);
complete_all(&entity->entity_idle);
-   drm_sched_fence_scheduled(s_fence);
+   drm_sched_fence_scheduled(s_fence, fence);
 
if (!IS_ERR_OR_NULL(fence)) {
-   drm_sched_fence_set_parent(s_fence, fence);
/* Drop for original kref_init of the fence */
dma_fence_put(fence);
 
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index e95b4837e5a3..f9544d9b670d 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -583,15 +583,14 @@ void drm_sched_entity_set_priority(struct 
drm_sched_entity *entity,
 bool drm_sched_entity_is_ready(struct drm_sched_entity *entity);
 int drm_sched_entity_error(s

Re: [PATCH v6] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb()

2023-06-22 Thread Boris Brezillon
On Wed, 21 Jun 2023 11:03:48 -0400
Luben Tuikov  wrote:

> On 2023-06-21 10:53, Boris Brezillon wrote:
> > On Wed, 21 Jun 2023 10:41:22 -0400
> > Luben Tuikov  wrote:
> >   
> >> On 2023-06-21 10:18, Boris Brezillon wrote:  
> >>> Hello Luben,
> >>>
> >>> On Wed, 21 Jun 2023 09:56:40 -0400
> >>> Luben Tuikov  wrote:
> >>> 
> >>>> On 2023-06-19 03:19, Boris Brezillon wrote:
> >>>>> drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped
> >>>>> from the dependency array that was waited upon before
> >>>>> drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
> >>>>> so we're basically waiting for all dependencies except one.
> >>>>>
> >>>>> In theory, this wait shouldn't be needed because resources should have
> >>>>> their users registered to the dma_resv object, thus guaranteeing that
> >>>>> future jobs wanting to access these resources wait on all the previous
> >>>>> users (depending on the access type, of course). But we want to keep
> >>>>> these explicit waits in the kill entity path just in case.
> >>>>>
> >>>>> Let's make sure we keep all dependencies in the array in
> >>>>> drm_sched_job_dependency(), so we can iterate over the array and wait
> >>>>> in drm_sched_entity_kill_jobs_cb().
> >>>>>
> >>>>> We also make sure we wait on drm_sched_fence::finished if we were
> >>>>> originally asked to wait on drm_sched_fence::scheduled. In that case,
> >>>>> we assume the intent was to delegate the wait to the firmware/GPU or
> >>>>> rely on the pipelining done at the entity/scheduler level, but when
> >>>>> killing jobs, we really want to wait for completion not just scheduling.
> >>>>>
> >>>>> v6:
> >>>>> - Back to v4 implementation
> >>>>> - Add Christian's R-b
> >>>>>
> >>>>> v5:
> >>>>> - Flag deps on which we should only wait for the scheduled event
> >>>>>   at insertion time
> >>>>>
> >>>>> v4:
> >>>>> - Fix commit message
> >>>>> - Fix a use-after-free bug
> >>>>>
> >>>>> v3:
> >>>>> - Always wait for drm_sched_fence::finished fences in
> >>>>>   drm_sched_entity_kill_jobs_cb() when we see a sched_fence
> >>>>>
> >>>>> v2:
> >>>>> - Don't evict deps in drm_sched_job_dependency()  
> >>>>
> >>>> Hmm, why is this in reverse chronological order?
> >>>> It's very confusing.
> >>>
> >>> Dunno, that's how I've always ordered things, and quick look at some
> >>> dri-devel patches [1][2] makes me think I'm not the only one to start
> >>> from the latest submission.
> >>>
> >>> [1]https://lkml.org/lkml/2023/6/19/941
> >>> [2]https://lore.kernel.org/dri-devel/cover.1686729444.git.sandor...@nxp.com/T/#t
> >>> 
> >>>>
> >>>>>
> >>>>> Signed-off-by: Boris Brezillon 
> >>>>> Suggested-by: "Christian König" 
> >>>>> Reviewed-by: "Christian König"   
> >>>>
> >>>> These three lines would usually come after the CCs.
> >>>
> >>> Again, I think I've always inserted those tags before the Cc, but I can
> >>> re-order things if you prefer. Let me know if you want me to send a v7
> >>> addressing the Cc+changelog ordering.
> >>
> >> No, it's not necessary for this patch, but in the future I'd rather follow
> >> chronological ordering for the versions, and in the Cc list. It's similar
> >> to how the patch description follows (narrative text) and to how we reply
> >> back to emails, and prevalently in the kernel log in drm ("git log" should
> >> suffice).
> >>
> >> Reading in chronological progression builds a narrative, a picture, in 
> >> one's
> >> mind and makes it easy to see justifications for said narrative, or see 
> >> reasons
> >> to change the narrative.
> >>
> >> That is, one can make a better decision knowing the full history, rather 
> >> than
> >> only the latest change.
> >>
> >> (And in fact when I read the version revision list, my eyes skip over v[X]
> >> and just read down, so I was wondering why and how Christian R-B the patch
> >> in v2, and it wasn't until I actually saw that they were ordered in reverse
> >> chronological order, which was in fact v6--listed first, which I'd assumed
> >> was listed last.)
> >>
> >> Do you have access or do you know who is pushing this patch to 
> >> drm-misc-fixes?  
> > 
> > I can push it.
> >   
> 
> Acked-by: Luben Tuikov 

Queued to drm-misc-fixes after re-ordering things in the commit message
as you suggested.

Regards,

Boris



Re: [PATCH drm-next v5 00/14] [RFC] DRM GPUVA Manager & Nouveau VM_BIND UAPI

2023-06-22 Thread Boris Brezillon
Hi Danilo,

On Tue, 20 Jun 2023 14:46:07 +0200
Danilo Krummrich  wrote:

> > The only thing I'm worried about is the 'sync mapping requests have to
> > go through the async path and wait for all previous async requests to
> > be processed' problem I mentioned in one of your previous submission,
> > but I'm happy leave that for later.  
> 
> Yes, I'm aware of this limitation.
> 
> Let me quickly try to explain where this limitation comes from and how I 
> intend to address it.
> 
> In order to be able to allocate the required page tables for a mapping 
> request and in order to free corresponding page tables once the (async) 
> job finished I need to know the corresponding sequence of operations 
> (drm_gpuva_ops) to fulfill the mapping request.
> 
> This requires me to update the GPUVA space in the ioctl() rather than in 
> the async stage, because otherwise I would need to wait for previous 
> jobs to finish before being able to submit subsequent jobs to the job 
> queue, since I need an up to date view of the GPUVA space in order to 
> calculate the sequence of operations to fulfill a mapping request.
> 
> As a consequence all jobs need to be processed in the order they were 
> submitted, including synchronous jobs.
> 
> @Matt: I think you will have the same limitation with synchronous jobs 
> as your implementation in XE should be similar?
> 
> In order to address it I want to switch to using callbacks rather than 
> 'pre-allocated' drm_gpuva_ops and update the GPUVA space within the 
> asynchronous stage.
> This would allow me to 'fit' synchronous jobs 
> between jobs waiting in the async job queue. However, to do this I have 
> to re-work how the page table handling in Nouveau is implemented, since 
> this would require me to be able to manage the page tables without 
> knowing the exact sequence of operations to fulfill a mapping request.

Ok, so I think that's more or less what we're trying to do right
now in PowerVR.

- First, we make sure we reserve enough MMU page tables for a given map
  operation to succeed no matter the VM state in the VM_BIND job
  submission path (our VM_BIND ioctl). That means we're always
  over-provisioning and returning unused memory back when the operation
  is done if we end up using less memory.
- We pre-allocate for the mapple-tree insertions.
- Then we map using drm_gpuva_sm_map() and the callbacks we provided in
  the drm_sched::run_job() path. We guarantee that no memory is
  allocated in that path thanks to the pre-allocation/reservation we've
  done at VM_BIND job submission time.

The problem I see with this v5 is that:

1/ We now have a dma_resv_lock_held() in drm_gpuva_{link,unlink}(),
   which, in our case, is called in the async drm_sched::run_job() path,
   and we don't hold the lock in that path (it's been released just
   after the job submission).
2/ I'm worried that Liam's plan to only reserve what's actually needed
   based on the mapple tree state is going to play against us, because
   the mapple-tree is only modified at job exec time, and we might have
   several unmaps happening between the moment we created and queued the
   jobs, and the moment they actually get executed, meaning the
   mapple-tree reservation might no longer fit the bill.

For issue #1, it shouldn't be to problematic if we use a regular lock to
insert to/remove from the GEM gpuva list.

For issue #2, I can see a way out if, instead of freeing gpuva nodes,
we flag those as unused when we see that something happening later in
the queue is going to map a section being unmapped. All of this implies
keeping access to already queued VM_BIND jobs (using the spsc queue at
the entity level is not practical), and iterating over them every time
a new sync or async job is queued to flag what needs to be retained. It
would obviously be easier if we could tell the mapple-tree API
'provision as if the tree was empty', so all we have to do is just
over-provision for both the page tables and mapple-tree insertion, and
free the unused mem when the operation is done.

Don't know if you already thought about that and/or have solutions to
solve these issues.

Regards,

Boris


Re: [PATCH drm-next v5 00/14] [RFC] DRM GPUVA Manager & Nouveau VM_BIND UAPI

2023-06-22 Thread Boris Brezillon
Hi Danilo,

On Thu, 22 Jun 2023 15:58:23 +0200
Danilo Krummrich  wrote:

> Hi Boris,
> 
> On 6/22/23 15:01, Boris Brezillon wrote:
> > Hi Danilo,
> > 
> > On Tue, 20 Jun 2023 14:46:07 +0200
> > Danilo Krummrich  wrote:
> >   
> >>> The only thing I'm worried about is the 'sync mapping requests have to
> >>> go through the async path and wait for all previous async requests to
> >>> be processed' problem I mentioned in one of your previous submission,
> >>> but I'm happy leave that for later.  
> >>
> >> Yes, I'm aware of this limitation.
> >>
> >> Let me quickly try to explain where this limitation comes from and how I
> >> intend to address it.
> >>
> >> In order to be able to allocate the required page tables for a mapping
> >> request and in order to free corresponding page tables once the (async)
> >> job finished I need to know the corresponding sequence of operations
> >> (drm_gpuva_ops) to fulfill the mapping request.
> >>
> >> This requires me to update the GPUVA space in the ioctl() rather than in
> >> the async stage, because otherwise I would need to wait for previous
> >> jobs to finish before being able to submit subsequent jobs to the job
> >> queue, since I need an up to date view of the GPUVA space in order to
> >> calculate the sequence of operations to fulfill a mapping request.
> >>
> >> As a consequence all jobs need to be processed in the order they were
> >> submitted, including synchronous jobs.
> >>
> >> @Matt: I think you will have the same limitation with synchronous jobs
> >> as your implementation in XE should be similar?
> >>
> >> In order to address it I want to switch to using callbacks rather than
> >> 'pre-allocated' drm_gpuva_ops and update the GPUVA space within the
> >> asynchronous stage.
> >> This would allow me to 'fit' synchronous jobs
> >> between jobs waiting in the async job queue. However, to do this I have
> >> to re-work how the page table handling in Nouveau is implemented, since
> >> this would require me to be able to manage the page tables without
> >> knowing the exact sequence of operations to fulfill a mapping request.  
> > 
> > Ok, so I think that's more or less what we're trying to do right
> > now in PowerVR.
> > 
> > - First, we make sure we reserve enough MMU page tables for a given map
> >operation to succeed no matter the VM state in the VM_BIND job
> >submission path (our VM_BIND ioctl). That means we're always
> >over-provisioning and returning unused memory back when the operation
> >is done if we end up using less memory.
> > - We pre-allocate for the mapple-tree insertions.
> > - Then we map using drm_gpuva_sm_map() and the callbacks we provided in
> >the drm_sched::run_job() path. We guarantee that no memory is
> >allocated in that path thanks to the pre-allocation/reservation we've
> >done at VM_BIND job submission time.
> > 
> > The problem I see with this v5 is that:
> > 
> > 1/ We now have a dma_resv_lock_held() in drm_gpuva_{link,unlink}(),
> > which, in our case, is called in the async drm_sched::run_job() path,
> > and we don't hold the lock in that path (it's been released just
> > after the job submission).  
> 
> My solution to this, as by now, is to - in the same way we pre-allocate 
> - to just pre-link and pre-unlink. And then fix things up in the cleanup 
> path.
> 
> However, depending on the driver, this might require you to set a flag 
> in the driver specific structure (embedding struct drm_gpuva) whether 
> the gpuva is actually mapped (as in has active page table entries). 
> Maybe we could also just add such a flag to struct drm_gpuva. But yeah, 
> doesn't sound too nice to be honest...
> 
> > 2/ I'm worried that Liam's plan to only reserve what's actually needed
> > based on the mapple tree state is going to play against us, because
> > the mapple-tree is only modified at job exec time, and we might have
> > several unmaps happening between the moment we created and queued the
> > jobs, and the moment they actually get executed, meaning the
> > mapple-tree reservation might no longer fit the bill.  
> 
> Yes, I'm aware and I explained to Liam in detail why we need the 
> mas_preallocate_worst_case() way of doing it.
> 
> See this mail: 
> https://lore.kernel.org/nouveau/68cd25de-e767-725e-2e7b

[PATCH v3] drm/sched: Call drm_sched_fence_set_parent() from drm_sched_fence_scheduled()

2023-06-23 Thread Boris Brezillon
Drivers that can delegate waits to the firmware/GPU pass the scheduled
fence to drm_sched_job_add_dependency(), and issue wait commands to
the firmware/GPU at job submission time. For this to be possible, they
need all their 'native' dependencies to have a valid parent since this
is where the actual HW fence information are encoded.

In drm_sched_main(), we currently call drm_sched_fence_set_parent()
after drm_sched_fence_scheduled(), leaving a short period of time
during which the job depending on this fence can be submitted.

Since setting parent and signaling the fence are two things that are
kinda related (you can't have a parent if the job hasn't been scheduled),
it probably makes sense to pass the parent fence to
drm_sched_fence_scheduled() and let it call drm_sched_fence_set_parent()
before it signals the scheduled fence.

Here is a detailed description of the race we are fixing here:

Thread AThread B

- calls drm_sched_fence_scheduled()
- signals s_fence->scheduled which
  wakes up thread B

- entity dep signaled, checking
  the next dep
- no more deps waiting
- entity is picked for job
  submission by drm_gpu_scheduler
- run_job() is called
- run_job() tries to
  collect native fence info from
  s_fence->parent, but it's
  NULL =>
  BOOM, we can't do our native
  wait

- calls drm_sched_fence_set_parent()

v2:
* Fix commit message

v3:
* Add a detailed description of the race to the commit message
* Add Luben's R-b

Signed-off-by: Boris Brezillon 
Cc: Frank Binns 
Cc: Sarah Walker 
Cc: Donald Robson 
Cc: Luben Tuikov 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Sumit Semwal 
Cc: "Christian König" 
Reviewed-by: Luben Tuikov 
---
 drivers/gpu/drm/scheduler/sched_fence.c | 40 +++--
 drivers/gpu/drm/scheduler/sched_main.c  |  3 +-
 include/drm/gpu_scheduler.h |  5 ++--
 3 files changed, 28 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_fence.c 
b/drivers/gpu/drm/scheduler/sched_fence.c
index fe9c6468e440..b6e70ddb4ee5 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -48,8 +48,32 @@ static void __exit drm_sched_fence_slab_fini(void)
kmem_cache_destroy(sched_fence_slab);
 }
 
-void drm_sched_fence_scheduled(struct drm_sched_fence *fence)
+static void drm_sched_fence_set_parent(struct drm_sched_fence *s_fence,
+  struct dma_fence *fence)
 {
+   /*
+* smp_store_release() to ensure another thread racing us
+* in drm_sched_fence_set_deadline_finished() sees the
+* fence's parent set before test_bit()
+*/
+   smp_store_release(&s_fence->parent, dma_fence_get(fence));
+   if (test_bit(DRM_SCHED_FENCE_FLAG_HAS_DEADLINE_BIT,
+&s_fence->finished.flags))
+   dma_fence_set_deadline(fence, s_fence->deadline);
+}
+
+void drm_sched_fence_scheduled(struct drm_sched_fence *fence,
+  struct dma_fence *parent)
+{
+   /* Set the parent before signaling the scheduled fence, such that,
+* any waiter expecting the parent to be filled after the job has
+* been scheduled (which is the case for drivers delegating waits
+* to some firmware) doesn't have to busy wait for parent to show
+* up.
+*/
+   if (!IS_ERR_OR_NULL(parent))
+   drm_sched_fence_set_parent(fence, parent);
+
dma_fence_signal(&fence->scheduled);
 }
 
@@ -179,20 +203,6 @@ struct drm_sched_fence *to_drm_sched_fence(struct 
dma_fence *f)
 }
 EXPORT_SYMBOL(to_drm_sched_fence);
 
-void drm_sched_fence_set_parent(struct drm_sched_fence *s_fence,
-   struct dma_fence *fence)
-{
-   /*
-* smp_store_release() to ensure another thread racing us
-* in drm_sched_fence_set_deadline_finished() sees the
-* fence's parent set before test_bit()
-*/
-   smp_store_release(&s_fence->parent, dma_fence_get(fence));
-   if (test_bit(DRM_SCHED_FENCE_FLAG_HAS_DEADLINE_BIT,
-&s_fence->finished.flags))
-   dma_fence_set_deadline(fence, s_fence->deadline);
-}
-
 struct drm_sched_fence *drm_sched_fence_alloc(struct drm_sched_entity *entity,
  void *owner)
 {
diff --git a/drivers/gpu/drm/scheduler/sched_

Re: [PATCH v3] drm/sched: Call drm_sched_fence_set_parent() from drm_sched_fence_scheduled()

2023-06-23 Thread Boris Brezillon
On Fri, 23 Jun 2023 09:52:04 +0200
Boris Brezillon  wrote:

> Drivers that can delegate waits to the firmware/GPU pass the scheduled
> fence to drm_sched_job_add_dependency(), and issue wait commands to
> the firmware/GPU at job submission time. For this to be possible, they
> need all their 'native' dependencies to have a valid parent since this
> is where the actual HW fence information are encoded.
> 
> In drm_sched_main(), we currently call drm_sched_fence_set_parent()
> after drm_sched_fence_scheduled(), leaving a short period of time
> during which the job depending on this fence can be submitted.
> 
> Since setting parent and signaling the fence are two things that are
> kinda related (you can't have a parent if the job hasn't been scheduled),
> it probably makes sense to pass the parent fence to
> drm_sched_fence_scheduled() and let it call drm_sched_fence_set_parent()
> before it signals the scheduled fence.
> 
> Here is a detailed description of the race we are fixing here:
> 
> Thread A  Thread B
> 
> - calls drm_sched_fence_scheduled()
> - signals s_fence->scheduled which
>   wakes up thread B
> 
>   - entity dep signaled, checking
> the next dep
>   - no more deps waiting
>   - entity is picked for job
> submission by drm_gpu_scheduler
>   - run_job() is called
>   - run_job() tries to
> collect native fence info from
> s_fence->parent, but it's
> NULL =>
> BOOM, we can't do our native
> wait
> 
> - calls drm_sched_fence_set_parent()
> 
> v2:
> * Fix commit message
> 
> v3:
> * Add a detailed description of the race to the commit message
> * Add Luben's R-b
> 

FYI, I didn't put a Fixes tag because the various moves/modifications
that happened on this file will make it hard to backport anyway, and no
one complained about it so far. But if we want to have one, it would
probably be:

Fixes: 754ce0fa55c4 ("drm/amd: add parent for sched fence")

> Signed-off-by: Boris Brezillon 
> Cc: Frank Binns 
> Cc: Sarah Walker 
> Cc: Donald Robson 
> Cc: Luben Tuikov 
> Cc: David Airlie 
> Cc: Daniel Vetter 
> Cc: Sumit Semwal 
> Cc: "Christian König" 
> Reviewed-by: Luben Tuikov 
> ---
>  drivers/gpu/drm/scheduler/sched_fence.c | 40 +++--
>  drivers/gpu/drm/scheduler/sched_main.c  |  3 +-
>  include/drm/gpu_scheduler.h |  5 ++--
>  3 files changed, 28 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_fence.c 
> b/drivers/gpu/drm/scheduler/sched_fence.c
> index fe9c6468e440..b6e70ddb4ee5 100644
> --- a/drivers/gpu/drm/scheduler/sched_fence.c
> +++ b/drivers/gpu/drm/scheduler/sched_fence.c
> @@ -48,8 +48,32 @@ static void __exit drm_sched_fence_slab_fini(void)
>   kmem_cache_destroy(sched_fence_slab);
>  }
>  
> -void drm_sched_fence_scheduled(struct drm_sched_fence *fence)
> +static void drm_sched_fence_set_parent(struct drm_sched_fence *s_fence,
> +struct dma_fence *fence)
>  {
> + /*
> +  * smp_store_release() to ensure another thread racing us
> +  * in drm_sched_fence_set_deadline_finished() sees the
> +  * fence's parent set before test_bit()
> +  */
> + smp_store_release(&s_fence->parent, dma_fence_get(fence));
> + if (test_bit(DRM_SCHED_FENCE_FLAG_HAS_DEADLINE_BIT,
> +  &s_fence->finished.flags))
> + dma_fence_set_deadline(fence, s_fence->deadline);
> +}
> +
> +void drm_sched_fence_scheduled(struct drm_sched_fence *fence,
> +struct dma_fence *parent)
> +{
> + /* Set the parent before signaling the scheduled fence, such that,
> +  * any waiter expecting the parent to be filled after the job has
> +  * been scheduled (which is the case for drivers delegating waits
> +  * to some firmware) doesn't have to busy wait for parent to show
> +  * up.
> +  */
> + if (!IS_ERR_OR_NULL(parent))
> + drm_sched_fence_set_parent(fence, parent);
> +
>   dma_fence_signal(&fence->scheduled);
>  }
>  
> @@ -179,20 +203,6 @@ struct drm_sched_fence *to_drm_sched_fence(struct 
> dma_fence *f)
>  }
>  EXPORT_S

Re: [PATCH v3] drm/sched: Call drm_sched_fence_set_parent() from drm_sched_fence_scheduled()

2023-06-26 Thread Boris Brezillon
On Fri, 23 Jun 2023 14:37:57 -0400
Luben Tuikov  wrote:

> On 2023-06-23 04:03, Boris Brezillon wrote:
> > On Fri, 23 Jun 2023 09:52:04 +0200
> > Boris Brezillon  wrote:
> >   
> >> Drivers that can delegate waits to the firmware/GPU pass the scheduled
> >> fence to drm_sched_job_add_dependency(), and issue wait commands to
> >> the firmware/GPU at job submission time. For this to be possible, they
> >> need all their 'native' dependencies to have a valid parent since this
> >> is where the actual HW fence information are encoded.
> >>
> >> In drm_sched_main(), we currently call drm_sched_fence_set_parent()
> >> after drm_sched_fence_scheduled(), leaving a short period of time
> >> during which the job depending on this fence can be submitted.
> >>
> >> Since setting parent and signaling the fence are two things that are
> >> kinda related (you can't have a parent if the job hasn't been scheduled),
> >> it probably makes sense to pass the parent fence to
> >> drm_sched_fence_scheduled() and let it call drm_sched_fence_set_parent()
> >> before it signals the scheduled fence.
> >>
> >> Here is a detailed description of the race we are fixing here:
> >>
> >> Thread A   Thread B
> >>
> >> - calls drm_sched_fence_scheduled()
> >> - signals s_fence->scheduled which
> >>   wakes up thread B
> >>
> >>- entity dep signaled, checking
> >>  the next dep
> >>- no more deps waiting
> >>- entity is picked for job
> >>  submission by drm_gpu_scheduler
> >>- run_job() is called
> >>- run_job() tries to
> >>  collect native fence info from
> >>  s_fence->parent, but it's
> >>  NULL =>
> >>  BOOM, we can't do our native
> >>  wait
> >>
> >> - calls drm_sched_fence_set_parent()
> >>
> >> v2:
> >> * Fix commit message
> >>
> >> v3:
> >> * Add a detailed description of the race to the commit message
> >> * Add Luben's R-b
> >>  
> > 
> > FYI, I didn't put a Fixes tag because the various moves/modifications
> > that happened on this file will make it hard to backport anyway, and no
> > one complained about it so far. But if we want to have one, it would
> > probably be:
> > 
> > Fixes: 754ce0fa55c4 ("drm/amd: add parent for sched fence")
> >   
> 
> I agree with your assessment--the race fix doesn't seem to be pointing to
> or introduced by one particular change. Plus that fixes change is from 2016...
> So, we're good to go as is.

Queued to drm-misc-fixes.



Re: [PATCH v4 6/6] drm/shmem-helper: Switch to reservation lock

2023-06-26 Thread Boris Brezillon
Hi Dmitry,

On Tue, 30 May 2023 01:39:35 +0300
Dmitry Osipenko  wrote:

> Replace all drm-shmem locks with a GEM reservation lock. This makes locks
> consistent with dma-buf locking convention where importers are responsible
> for holding reservation lock for all operations performed over dma-bufs,
> preventing deadlock between dma-buf importers and exporters.

I've rebased some of my work on drm-misc-next this morning and noticed
that the drm_gem_shmem_get_pages() I was using to pin pages no longer
exists, so I ended looking at this patch to check what I should use
instead, and I have a few questions/comments.

> 
> Suggested-by: Daniel Vetter 
> Acked-by: Thomas Zimmermann 
> Reviewed-by: Emil Velikov 
> Signed-off-by: Dmitry Osipenko 
> ---
>  drivers/gpu/drm/drm_gem_shmem_helper.c| 210 --
>  drivers/gpu/drm/lima/lima_gem.c   |   8 +-
>  drivers/gpu/drm/panfrost/panfrost_drv.c   |   7 +-
>  .../gpu/drm/panfrost/panfrost_gem_shrinker.c  |   6 +-
>  drivers/gpu/drm/panfrost/panfrost_mmu.c   |  19 +-
>  include/drm/drm_gem_shmem_helper.h|  14 +-
>  6 files changed, 116 insertions(+), 148 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index 4ea6507a77e5..a783d2245599 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -88,8 +88,6 @@ __drm_gem_shmem_create(struct drm_device *dev, size_t size, 
> bool private)
>   if (ret)
>   goto err_release;
>  
> - mutex_init(&shmem->pages_lock);
> - mutex_init(&shmem->vmap_lock);
>   INIT_LIST_HEAD(&shmem->madv_list);
>  
>   if (!private) {
> @@ -141,11 +139,13 @@ void drm_gem_shmem_free(struct drm_gem_shmem_object 
> *shmem)
>  {
>   struct drm_gem_object *obj = &shmem->base;
>  
> - drm_WARN_ON(obj->dev, shmem->vmap_use_count);
> -
>   if (obj->import_attach) {
>   drm_prime_gem_destroy(obj, shmem->sgt);
>   } else {
> + dma_resv_lock(shmem->base.resv, NULL);
> +
> + drm_WARN_ON(obj->dev, shmem->vmap_use_count);
> +
>   if (shmem->sgt) {
>   dma_unmap_sgtable(obj->dev->dev, shmem->sgt,
> DMA_BIDIRECTIONAL, 0);
> @@ -154,22 +154,24 @@ void drm_gem_shmem_free(struct drm_gem_shmem_object 
> *shmem)
>   }
>   if (shmem->pages)
>   drm_gem_shmem_put_pages(shmem);
> - }
>  
> - drm_WARN_ON(obj->dev, shmem->pages_use_count);
> + drm_WARN_ON(obj->dev, shmem->pages_use_count);
> +
> + dma_resv_unlock(shmem->base.resv);
> + }
>  
>   drm_gem_object_release(obj);
> - mutex_destroy(&shmem->pages_lock);
> - mutex_destroy(&shmem->vmap_lock);
>   kfree(shmem);
>  }
>  EXPORT_SYMBOL_GPL(drm_gem_shmem_free);
>  
> -static int drm_gem_shmem_get_pages_locked(struct drm_gem_shmem_object *shmem)
> +static int drm_gem_shmem_get_pages(struct drm_gem_shmem_object *shmem)

I find this name change confusing, because the function requires the
GEM resv lock to be held, and the _locked suffix was making it pretty
clear.

>  {
>   struct drm_gem_object *obj = &shmem->base;
>   struct page **pages;
>  
> + dma_resv_assert_held(shmem->base.resv);
> +
>   if (shmem->pages_use_count++ > 0)
>   return 0;
>  
> @@ -197,35 +199,16 @@ static int drm_gem_shmem_get_pages_locked(struct 
> drm_gem_shmem_object *shmem)
>  }
>  
>  /*
> - * drm_gem_shmem_get_pages - Allocate backing pages for a shmem GEM object
> + * drm_gem_shmem_put_pages - Decrease use count on the backing pages for a 
> shmem GEM object
>   * @shmem: shmem GEM object
>   *
> - * This function makes sure that backing pages exists for the shmem GEM 
> object
> - * and increases the use count.
> - *
> - * Returns:
> - * 0 on success or a negative error code on failure.
> + * This function decreases the use count and puts the backing pages when use 
> drops to zero.
>   */
> -int drm_gem_shmem_get_pages(struct drm_gem_shmem_object *shmem)
> +void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem)

Same comment about the name change. That's even more confusing since
this function was previously taking care of the locking. Also not sure
why you'd want to expose this _put() helper when the _get() helper is
private.

>  {
>   struct drm_gem_object *obj = &shmem->base;
> - int ret;
>  
> - drm_WARN_ON(obj->dev, obj->import_attach);
> -
> - ret = mutex_lock_interruptible(&shmem->pages_lock);
> - if (ret)
> - return ret;
> - ret = drm_gem_shmem_get_pages_locked(shmem);
> - mutex_unlock(&shmem->pages_lock);
> -
> - return ret;
> -}
> -EXPORT_SYMBOL(drm_gem_shmem_get_pages);
> -
> -static void drm_gem_shmem_put_pages_locked(struct drm_gem_shmem_object 
> *shmem)
> -{
> - struct drm_gem_object *obj = &shmem->base;
> + dma_resv_assert_held(shmem

Re: [PATCH v4 6/6] drm/shmem-helper: Switch to reservation lock

2023-06-26 Thread Boris Brezillon
On Mon, 26 Jun 2023 11:40:14 +0200
Boris Brezillon  wrote:

> Hi Dmitry,
> 
> On Tue, 30 May 2023 01:39:35 +0300
> Dmitry Osipenko  wrote:
> 
> > Replace all drm-shmem locks with a GEM reservation lock. This makes locks
> > consistent with dma-buf locking convention where importers are responsible
> > for holding reservation lock for all operations performed over dma-bufs,
> > preventing deadlock between dma-buf importers and exporters.  
> 
> I've rebased some of my work on drm-misc-next this morning and noticed
> that the drm_gem_shmem_get_pages() I was using to pin pages no longer
> exists, so I ended looking at this patch to check what I should use
> instead, and I have a few questions/comments.
> 
> > 
> > Suggested-by: Daniel Vetter 
> > Acked-by: Thomas Zimmermann 
> > Reviewed-by: Emil Velikov 
> > Signed-off-by: Dmitry Osipenko 
> > ---
> >  drivers/gpu/drm/drm_gem_shmem_helper.c| 210 --
> >  drivers/gpu/drm/lima/lima_gem.c   |   8 +-
> >  drivers/gpu/drm/panfrost/panfrost_drv.c   |   7 +-
> >  .../gpu/drm/panfrost/panfrost_gem_shrinker.c  |   6 +-
> >  drivers/gpu/drm/panfrost/panfrost_mmu.c   |  19 +-
> >  include/drm/drm_gem_shmem_helper.h|  14 +-
> >  6 files changed, 116 insertions(+), 148 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> > b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > index 4ea6507a77e5..a783d2245599 100644
> > --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> > +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > @@ -88,8 +88,6 @@ __drm_gem_shmem_create(struct drm_device *dev, size_t 
> > size, bool private)
> > if (ret)
> > goto err_release;
> >  
> > -   mutex_init(&shmem->pages_lock);
> > -   mutex_init(&shmem->vmap_lock);
> > INIT_LIST_HEAD(&shmem->madv_list);
> >  
> > if (!private) {
> > @@ -141,11 +139,13 @@ void drm_gem_shmem_free(struct drm_gem_shmem_object 
> > *shmem)
> >  {
> > struct drm_gem_object *obj = &shmem->base;
> >  
> > -   drm_WARN_ON(obj->dev, shmem->vmap_use_count);
> > -
> > if (obj->import_attach) {
> > drm_prime_gem_destroy(obj, shmem->sgt);
> > } else {
> > +   dma_resv_lock(shmem->base.resv, NULL);
> > +
> > +   drm_WARN_ON(obj->dev, shmem->vmap_use_count);
> > +
> > if (shmem->sgt) {
> > dma_unmap_sgtable(obj->dev->dev, shmem->sgt,
> >   DMA_BIDIRECTIONAL, 0);
> > @@ -154,22 +154,24 @@ void drm_gem_shmem_free(struct drm_gem_shmem_object 
> > *shmem)
> > }
> > if (shmem->pages)
> > drm_gem_shmem_put_pages(shmem);
> > -   }
> >  
> > -   drm_WARN_ON(obj->dev, shmem->pages_use_count);
> > +   drm_WARN_ON(obj->dev, shmem->pages_use_count);
> > +
> > +   dma_resv_unlock(shmem->base.resv);
> > +   }
> >  
> > drm_gem_object_release(obj);
> > -   mutex_destroy(&shmem->pages_lock);
> > -   mutex_destroy(&shmem->vmap_lock);
> > kfree(shmem);
> >  }
> >  EXPORT_SYMBOL_GPL(drm_gem_shmem_free);
> >  
> > -static int drm_gem_shmem_get_pages_locked(struct drm_gem_shmem_object 
> > *shmem)
> > +static int drm_gem_shmem_get_pages(struct drm_gem_shmem_object *shmem)  
> 
> I find this name change confusing, because the function requires the
> GEM resv lock to be held, and the _locked suffix was making it pretty
> clear.
> 
> >  {
> > struct drm_gem_object *obj = &shmem->base;
> > struct page **pages;
> >  
> > +   dma_resv_assert_held(shmem->base.resv);
> > +
> > if (shmem->pages_use_count++ > 0)
> > return 0;
> >  
> > @@ -197,35 +199,16 @@ static int drm_gem_shmem_get_pages_locked(struct 
> > drm_gem_shmem_object *shmem)
> >  }
> >  
> >  /*
> > - * drm_gem_shmem_get_pages - Allocate backing pages for a shmem GEM object
> > + * drm_gem_shmem_put_pages - Decrease use count on the backing pages for a 
> > shmem GEM object
> >   * @shmem: shmem GEM object
> >   *
> > - * This function makes sure that backing pages exists for the shmem GEM 
> > object
> > - * and increases the use count.
> > - *
> > - * Returns:
> > - * 0 on success or a negative error code on failure.
> > + * This function decreases the use count and pu

[PATCH 3/5] drm/shmem-helper: Inline drm_gem_shmem_{get,put}_pages()

2023-06-26 Thread Boris Brezillon
Move code drm_gem_shmem_{get,put}_pages() code to
drm_gem_shmem_{pin,unpin}_locked().

Signed-off-by: Boris Brezillon 
Cc: Daniel Vetter 
Cc: Thomas Zimmermann 
Cc: Emil Velikov 
Cc: Dmitry Osipenko 
---
 drivers/gpu/drm/drm_gem_shmem_helper.c | 108 ++---
 1 file changed, 41 insertions(+), 67 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
b/drivers/gpu/drm/drm_gem_shmem_helper.c
index d6fc034164c0..f406556e42e0 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -128,46 +128,7 @@ struct drm_gem_shmem_object *drm_gem_shmem_create(struct 
drm_device *dev, size_t
 }
 EXPORT_SYMBOL_GPL(drm_gem_shmem_create);
 
-static void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem);
-
-/**
- * drm_gem_shmem_free - Free resources associated with a shmem GEM object
- * @shmem: shmem GEM object to free
- *
- * This function cleans up the GEM object state and frees the memory used to
- * store the object itself.
- */
-void drm_gem_shmem_free(struct drm_gem_shmem_object *shmem)
-{
-   struct drm_gem_object *obj = &shmem->base;
-
-   if (obj->import_attach) {
-   drm_prime_gem_destroy(obj, shmem->sgt);
-   } else {
-   dma_resv_lock(shmem->base.resv, NULL);
-
-   drm_WARN_ON(obj->dev, shmem->vmap_use_count);
-
-   if (shmem->sgt) {
-   dma_unmap_sgtable(obj->dev->dev, shmem->sgt,
- DMA_BIDIRECTIONAL, 0);
-   sg_free_table(shmem->sgt);
-   kfree(shmem->sgt);
-   }
-   if (shmem->pages)
-   drm_gem_shmem_put_pages(shmem);
-
-   drm_WARN_ON(obj->dev, shmem->pages_use_count);
-
-   dma_resv_unlock(shmem->base.resv);
-   }
-
-   drm_gem_object_release(obj);
-   kfree(shmem);
-}
-EXPORT_SYMBOL_GPL(drm_gem_shmem_free);
-
-static int drm_gem_shmem_get_pages(struct drm_gem_shmem_object *shmem)
+static int drm_gem_shmem_pin_locked(struct drm_gem_shmem_object *shmem)
 {
struct drm_gem_object *obj = &shmem->base;
struct page **pages;
@@ -200,13 +161,7 @@ static int drm_gem_shmem_get_pages(struct 
drm_gem_shmem_object *shmem)
return 0;
 }
 
-/*
- * drm_gem_shmem_put_pages - Decrease use count on the backing pages for a 
shmem GEM object
- * @shmem: shmem GEM object
- *
- * This function decreases the use count and puts the backing pages when use 
drops to zero.
- */
-static void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem)
+static void drm_gem_shmem_unpin_locked(struct drm_gem_shmem_object *shmem)
 {
struct drm_gem_object *obj = &shmem->base;
 
@@ -229,23 +184,42 @@ static void drm_gem_shmem_put_pages(struct 
drm_gem_shmem_object *shmem)
shmem->pages = NULL;
 }
 
-static int drm_gem_shmem_pin_locked(struct drm_gem_shmem_object *shmem)
+/**
+ * drm_gem_shmem_free - Free resources associated with a shmem GEM object
+ * @shmem: shmem GEM object to free
+ *
+ * This function cleans up the GEM object state and frees the memory used to
+ * store the object itself.
+ */
+void drm_gem_shmem_free(struct drm_gem_shmem_object *shmem)
 {
-   int ret;
+   struct drm_gem_object *obj = &shmem->base;
 
-   dma_resv_assert_held(shmem->base.resv);
+   if (obj->import_attach) {
+   drm_prime_gem_destroy(obj, shmem->sgt);
+   } else {
+   dma_resv_lock(shmem->base.resv, NULL);
 
-   ret = drm_gem_shmem_get_pages(shmem);
+   drm_WARN_ON(obj->dev, shmem->vmap_use_count);
 
-   return ret;
-}
-
-static void drm_gem_shmem_unpin_locked(struct drm_gem_shmem_object *shmem)
-{
-   dma_resv_assert_held(shmem->base.resv);
-
-   drm_gem_shmem_put_pages(shmem);
+   if (shmem->sgt) {
+   dma_unmap_sgtable(obj->dev->dev, shmem->sgt,
+ DMA_BIDIRECTIONAL, 0);
+   sg_free_table(shmem->sgt);
+   kfree(shmem->sgt);
+   }
+   if (shmem->pages)
+   drm_gem_shmem_unpin_locked(shmem);
+
+   drm_WARN_ON(obj->dev, shmem->pages_use_count);
+
+   dma_resv_unlock(shmem->base.resv);
+   }
+
+   drm_gem_object_release(obj);
+   kfree(shmem);
 }
+EXPORT_SYMBOL_GPL(drm_gem_shmem_free);
 
 /**
  * drm_gem_shmem_pin - Pin backing pages for a shmem GEM object
@@ -332,7 +306,7 @@ int drm_gem_shmem_vmap(struct drm_gem_shmem_object *shmem,
return 0;
}
 
-   ret = drm_gem_shmem_get_pages(shmem);
+   ret = drm_gem_shmem_pin_locked(shmem);
if (ret)
goto err_zero_use;
 
@@ -355,7 +329,7 @@ int drm_gem_shmem_vmap(struc

[PATCH 0/5] drm/shmem-helper: Follow-up on 'Switch to reservation lock'

2023-06-26 Thread Boris Brezillon
Hello,

As mentioned here [1], after rebasing some of my work on
drm-misc-next this morning I noticed that the
drm_gem_shmem_get_pages() I was using to pin pages to a GEM no longer
exists, so I ended up looking at 21aa27ddc582 ("drm/shmem-helper: Switch
to reservation lock") and came up with a few changes to help clarify
the situation.

Note that we will soon need to have drm_gem_shmem_[un]pin_locked()
exposed for the PowerVR and new Mali drivers so we can pin memory
after we've acquired the GEM locks using drm_exec. Not entirely sure
if this should take the form of some generic
drm_gem_[un]pin[_unlocked]() helpers like we have for v[un]map()
operations, or if this should stay shmem-specific.

Regards,

Boris

[1]https://patchwork.freedesktop.org/patch/539994/

Cc: Daniel Vetter 
Cc: Thomas Zimmermann 
Cc: Emil Velikov 
Cc: Dmitry Osipenko 

Boris Brezillon (5):
  drm/panfrost: Stop using drm_gem_shmem_put_pages()
  drm/shmem-helper: Stop exposing drm_gem_shmem_put_pages()
  drm/shmem-helper: Inline drm_gem_shmem_{get,put}_pages()
  drm/shmem-helper: Make dma_resv_assert_held() unconditional in
drm_gem_shmem_v[un]map()
  drm/shmem-helper: Clarify drm_gem_shmem_v[un]map() usage

 drivers/gpu/drm/drm_gem_shmem_helper.c  | 125 +++-
 drivers/gpu/drm/panfrost/panfrost_mmu.c |  13 ++-
 include/drm/drm_gem_shmem_helper.h  |   1 -
 3 files changed, 64 insertions(+), 75 deletions(-)

-- 
2.41.0



[PATCH 4/5] drm/shmem-helper: Make dma_resv_assert_held() unconditional in drm_gem_shmem_v[un]map()

2023-06-26 Thread Boris Brezillon
dma_resv lock should be held in both the dma_buf and native GEM case,
so let's just move the dma_resv_assert_held() check out of the !dma-buf
block.

Signed-off-by: Boris Brezillon 
Cc: Daniel Vetter 
Cc: Thomas Zimmermann 
Cc: Emil Velikov 
Cc: Dmitry Osipenko 
---
 drivers/gpu/drm/drm_gem_shmem_helper.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
b/drivers/gpu/drm/drm_gem_shmem_helper.c
index f406556e42e0..2b8a32f6b656 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -288,6 +288,8 @@ int drm_gem_shmem_vmap(struct drm_gem_shmem_object *shmem,
struct drm_gem_object *obj = &shmem->base;
int ret = 0;
 
+   dma_resv_assert_held(shmem->base.resv);
+
if (obj->import_attach) {
ret = dma_buf_vmap(obj->import_attach->dmabuf, map);
if (!ret) {
@@ -299,8 +301,6 @@ int drm_gem_shmem_vmap(struct drm_gem_shmem_object *shmem,
} else {
pgprot_t prot = PAGE_KERNEL;
 
-   dma_resv_assert_held(shmem->base.resv);
-
if (shmem->vmap_use_count++ > 0) {
iosys_map_set_vaddr(map, shmem->vaddr);
return 0;
@@ -354,11 +354,11 @@ void drm_gem_shmem_vunmap(struct drm_gem_shmem_object 
*shmem,
 {
struct drm_gem_object *obj = &shmem->base;
 
+   dma_resv_assert_held(shmem->base.resv);
+
if (obj->import_attach) {
dma_buf_vunmap(obj->import_attach->dmabuf, map);
} else {
-   dma_resv_assert_held(shmem->base.resv);
-
if (drm_WARN_ON_ONCE(obj->dev, !shmem->vmap_use_count))
return;
 
-- 
2.41.0



[PATCH 5/5] drm/shmem-helper: Clarify drm_gem_shmem_v[un]map() usage

2023-06-26 Thread Boris Brezillon
Drivers are not supposed to call these functions directly when they
want to map/unamp a GEM in kernel space. They should instead go
through drm_gem_v[un]map[_unlocked]() that will forward the request
to drm_gem_object_funcs::v[un]map() which in turn will call
drm_gem_shmem_v[un]map().

Let's clarify that in the functions doc.

Signed-off-by: Boris Brezillon 
Cc: Daniel Vetter 
Cc: Thomas Zimmermann 
Cc: Emil Velikov 
Cc: Dmitry Osipenko 
---
 drivers/gpu/drm/drm_gem_shmem_helper.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
b/drivers/gpu/drm/drm_gem_shmem_helper.c
index 2b8a32f6b656..daada172fe70 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -279,6 +279,11 @@ EXPORT_SYMBOL(drm_gem_shmem_unpin);
  *
  * Acquired mappings should be cleaned up by calling drm_gem_shmem_vunmap().
  *
+ * This function is not meant to be used directly, but rather used as a helper
+ * to implement driver-specific versions of drm_gem_object_funcs::vmap(). If
+ * you need to vmap() a GEM object from your driver, use
+ * drm_gem_vmap[_unlocked]() instead.
+ *
  * Returns:
  * 0 on success or a negative error code on failure.
  */
@@ -348,6 +353,11 @@ EXPORT_SYMBOL(drm_gem_shmem_vmap);
  *
  * This function hides the differences between dma-buf imported and natively
  * allocated objects.
+ *
+ * This function is not meant to be used directly, but rather used as a helper
+ * to implement driver-specific versions of drm_gem_object_funcs::vunmap(). If
+ * you need to vunmap() a GEM object from your driver, use
+ * drm_gem_vunmap[_unlocked]() instead.
  */
 void drm_gem_shmem_vunmap(struct drm_gem_shmem_object *shmem,
  struct iosys_map *map)
-- 
2.41.0



[PATCH 1/5] drm/panfrost: Stop using drm_gem_shmem_put_pages()

2023-06-26 Thread Boris Brezillon
We want to get rid of this helper function, so let's use
drm_gem_shmem_unpin() and move this call out of the
dma_resv-locked section.

Signed-off-by: Boris Brezillon 
Cc: Daniel Vetter 
Cc: Thomas Zimmermann 
Cc: Emil Velikov 
Cc: Dmitry Osipenko 
Cc: Rob Herring 
Cc: Steven Price 
---
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index c0123d09f699..0b12f03ef0be 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -447,6 +447,7 @@ static int panfrost_mmu_map_fault_addr(struct 
panfrost_device *pfdev, int as,
pgoff_t page_offset;
struct sg_table *sgt;
struct page **pages;
+   bool pinned = false;
 
bomapping = addr_to_mapping(pfdev, as, addr);
if (!bomapping)
@@ -488,12 +489,14 @@ static int panfrost_mmu_map_fault_addr(struct 
panfrost_device *pfdev, int as,
}
bo->base.pages = pages;
bo->base.pages_use_count = 1;
+   pinned = true;
} else {
pages = bo->base.pages;
if (pages[page_offset]) {
/* Pages are already mapped, bail out. */
goto out;
}
+   pinned = true;
}
 
mapping = bo->base.base.filp->f_mapping;
@@ -504,7 +507,7 @@ static int panfrost_mmu_map_fault_addr(struct 
panfrost_device *pfdev, int as,
if (IS_ERR(pages[i])) {
ret = PTR_ERR(pages[i]);
pages[i] = NULL;
-   goto err_pages;
+   goto err_unlock;
}
}
 
@@ -512,7 +515,7 @@ static int panfrost_mmu_map_fault_addr(struct 
panfrost_device *pfdev, int as,
ret = sg_alloc_table_from_pages(sgt, pages + page_offset,
NUM_FAULT_PAGES, 0, SZ_2M, GFP_KERNEL);
if (ret)
-   goto err_pages;
+   goto err_unlock;
 
ret = dma_map_sgtable(pfdev->dev, sgt, DMA_BIDIRECTIONAL, 0);
if (ret)
@@ -534,10 +537,12 @@ static int panfrost_mmu_map_fault_addr(struct 
panfrost_device *pfdev, int as,
 
 err_map:
sg_free_table(sgt);
-err_pages:
-   drm_gem_shmem_put_pages(&bo->base);
 err_unlock:
dma_resv_unlock(obj->resv);
+
+   if (ret && pinned)
+   drm_gem_shmem_unpin(&bo->base);
+
 err_bo:
panfrost_gem_mapping_put(bomapping);
return ret;
-- 
2.41.0



[PATCH 2/5] drm/shmem-helper: Stop exposing drm_gem_shmem_put_pages()

2023-06-26 Thread Boris Brezillon
The last user (panfrost) moved to drm_gem_shmem_unpin(), so it's now
safe to make this function private.

Signed-off-by: Boris Brezillon 
Cc: Daniel Vetter 
Cc: Thomas Zimmermann 
Cc: Emil Velikov 
Cc: Dmitry Osipenko 
---
 drivers/gpu/drm/drm_gem_shmem_helper.c | 5 +++--
 include/drm/drm_gem_shmem_helper.h | 1 -
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
b/drivers/gpu/drm/drm_gem_shmem_helper.c
index a783d2245599..d6fc034164c0 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -128,6 +128,8 @@ struct drm_gem_shmem_object *drm_gem_shmem_create(struct 
drm_device *dev, size_t
 }
 EXPORT_SYMBOL_GPL(drm_gem_shmem_create);
 
+static void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem);
+
 /**
  * drm_gem_shmem_free - Free resources associated with a shmem GEM object
  * @shmem: shmem GEM object to free
@@ -204,7 +206,7 @@ static int drm_gem_shmem_get_pages(struct 
drm_gem_shmem_object *shmem)
  *
  * This function decreases the use count and puts the backing pages when use 
drops to zero.
  */
-void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem)
+static void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem)
 {
struct drm_gem_object *obj = &shmem->base;
 
@@ -226,7 +228,6 @@ void drm_gem_shmem_put_pages(struct drm_gem_shmem_object 
*shmem)
  shmem->pages_mark_accessed_on_put);
shmem->pages = NULL;
 }
-EXPORT_SYMBOL(drm_gem_shmem_put_pages);
 
 static int drm_gem_shmem_pin_locked(struct drm_gem_shmem_object *shmem)
 {
diff --git a/include/drm/drm_gem_shmem_helper.h 
b/include/drm/drm_gem_shmem_helper.h
index 2867d2aba88b..f55f8739acc0 100644
--- a/include/drm/drm_gem_shmem_helper.h
+++ b/include/drm/drm_gem_shmem_helper.h
@@ -99,7 +99,6 @@ struct drm_gem_shmem_object {
 struct drm_gem_shmem_object *drm_gem_shmem_create(struct drm_device *dev, 
size_t size);
 void drm_gem_shmem_free(struct drm_gem_shmem_object *shmem);
 
-void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem);
 int drm_gem_shmem_pin(struct drm_gem_shmem_object *shmem);
 void drm_gem_shmem_unpin(struct drm_gem_shmem_object *shmem);
 int drm_gem_shmem_vmap(struct drm_gem_shmem_object *shmem,
-- 
2.41.0



Re: [PATCH 1/5] drm/panfrost: Stop using drm_gem_shmem_put_pages()

2023-06-26 Thread Boris Brezillon
On Mon, 26 Jun 2023 16:20:53 +0300
Dmitry Osipenko  wrote:

> On 6/26/23 15:02, Boris Brezillon wrote:
> > -err_pages:
> > -   drm_gem_shmem_put_pages(&bo->base);
> >  err_unlock:
> > dma_resv_unlock(obj->resv);
> > +
> > +   if (ret && pinned)
> > +   drm_gem_shmem_unpin(&bo->base);  
> 
> The drm_gem_shmem_unpin() was supposed to be used only in conjunction
> with drm_gem_shmem_pin(). I've a pending patch to enable the pin/unpin
> refcounting needed by drm-shmem shrinker, it will prohibit invocation of
> unpin without a previous pin.

That driver is a bit special in that, in the growable BO case
(AKA pin-on-demand), the driver replaces the drm_gem_shmem_pin()
implementation by a custom one (the logic in
panfrost_mmu_map_fault_addr()), but still relies on the
default implementation to release things. We do increment the
pages_use_count manually to make sure the drm_gem_shmem_unpin() is
balanced.

> 
> I'm wondering whether it will be okay to simply remove
> drm_gem_shmem_put_pages() from the Panfrost code, letting pages to be
> kept allocated in a error case. They will be freed once BO is destroyed.
> 

I'm pretty sure the implementation will then complain about unbalanced
pin/unamp (or get_pages/put_pages) if we do that. I guess one option
would be to completely bypass drm_gem_shmem_[un]pin() for growable BOs
and manage the pages separately at the panfrost_gem_object level, but
the original idea was probably to re-use some of the fields/logic we
had in drm_gem_shmem_object and make partial pinning as close as
possible to regular pinning. Another option would be to teach the shmem
about partial pinning, but I'm not sure we want to expose such a
feature.


Re: [PATCH v13 03/10] drm/shmem-helper: Add pages_pin_count field

2023-06-26 Thread Boris Brezillon
Hi Dmitry,

Sorry for chiming in only now :-/.

On Tue, 14 Mar 2023 05:26:52 +0300
Dmitry Osipenko  wrote:

> And new pages_pin_count field to struct drm_gem_shmem_object that will
> determine whether pages are evictable by memory shrinker. The pages will
> be evictable only when pages_pin_count=0. This patch prepares code for
> addition of the memory shrinker that will utilize the new field.
> 
> Signed-off-by: Dmitry Osipenko 
> ---
>  drivers/gpu/drm/drm_gem_shmem_helper.c | 7 +++
>  include/drm/drm_gem_shmem_helper.h | 9 +
>  2 files changed, 16 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index 4da9c9c39b9a..81d61791f874 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -277,6 +277,8 @@ static int drm_gem_shmem_pin_locked(struct 
> drm_gem_shmem_object *shmem)
>   drm_WARN_ON(obj->dev, obj->import_attach);
>  
>   ret = drm_gem_shmem_get_pages(shmem);
> + if (!ret)
> + shmem->pages_pin_count++;
>  
>   return ret;
>  }
> @@ -289,7 +291,12 @@ static void drm_gem_shmem_unpin_locked(struct 
> drm_gem_shmem_object *shmem)
>  
>   drm_WARN_ON(obj->dev, obj->import_attach);
>  
> + if (drm_WARN_ON_ONCE(obj->dev, !shmem->pages_pin_count))
> + return;
> +
>   drm_gem_shmem_put_pages(shmem);
> +
> + shmem->pages_pin_count--;
>  }
>  
>  /**
> diff --git a/include/drm/drm_gem_shmem_helper.h 
> b/include/drm/drm_gem_shmem_helper.h
> index 20ddcd799df9..7d823c9fc480 100644
> --- a/include/drm/drm_gem_shmem_helper.h
> +++ b/include/drm/drm_gem_shmem_helper.h
> @@ -39,6 +39,15 @@ struct drm_gem_shmem_object {
>*/
>   unsigned int pages_use_count;
>  
> + /**
> +  * @pages_pin_count:
> +  *
> +  * Reference count on the pinned pages table.
> +  * The pages allowed to be evicted by memory shrinker
> +  * only when the count is zero.
> +  */
> + unsigned int pages_pin_count;

s/pages_pin_count/pin_count/ ?

And do we really need both pages_pin_count and pages_use_count. Looks
like they both serve the same purpose, with one exception:
pages_use_count is also incremented in the get_pages_sgt_locked() path,
but you probably don't want it to prevent GEM eviction. Assuming
your goal with this pin_count field is to check if a GEM object is
evictable, it can be done with something like

bool
drm_gem_shmem_is_evictable_locked(struct drm_gem_shmem_object *shmem)
{
dma_resv_assert_held(shmem->base.resv);

return shmem->pages_use_count == (shmem->sgt ? 1 : 0);
}

I mean, I'm not against renaming pages_use_count into pin_count, but,
unless I'm missing something, I don't see a good reason to keep both.

Regards,

Boris


Re: [PATCH v13 03/10] drm/shmem-helper: Add pages_pin_count field

2023-06-26 Thread Boris Brezillon
On Mon, 26 Jun 2023 17:04:57 +0200
Boris Brezillon  wrote:

> Hi Dmitry,
> 
> Sorry for chiming in only now :-/.
> 
> On Tue, 14 Mar 2023 05:26:52 +0300
> Dmitry Osipenko  wrote:
> 
> > And new pages_pin_count field to struct drm_gem_shmem_object that will
> > determine whether pages are evictable by memory shrinker. The pages will
> > be evictable only when pages_pin_count=0. This patch prepares code for
> > addition of the memory shrinker that will utilize the new field.
> > 
> > Signed-off-by: Dmitry Osipenko 
> > ---
> >  drivers/gpu/drm/drm_gem_shmem_helper.c | 7 +++
> >  include/drm/drm_gem_shmem_helper.h | 9 +
> >  2 files changed, 16 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> > b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > index 4da9c9c39b9a..81d61791f874 100644
> > --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> > +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > @@ -277,6 +277,8 @@ static int drm_gem_shmem_pin_locked(struct 
> > drm_gem_shmem_object *shmem)
> > drm_WARN_ON(obj->dev, obj->import_attach);
> >  
> > ret = drm_gem_shmem_get_pages(shmem);
> > +   if (!ret)
> > +   shmem->pages_pin_count++;
> >  
> > return ret;
> >  }
> > @@ -289,7 +291,12 @@ static void drm_gem_shmem_unpin_locked(struct 
> > drm_gem_shmem_object *shmem)
> >  
> > drm_WARN_ON(obj->dev, obj->import_attach);
> >  
> > +   if (drm_WARN_ON_ONCE(obj->dev, !shmem->pages_pin_count))
> > +   return;
> > +
> > drm_gem_shmem_put_pages(shmem);
> > +
> > +   shmem->pages_pin_count--;
> >  }
> >  
> >  /**
> > diff --git a/include/drm/drm_gem_shmem_helper.h 
> > b/include/drm/drm_gem_shmem_helper.h
> > index 20ddcd799df9..7d823c9fc480 100644
> > --- a/include/drm/drm_gem_shmem_helper.h
> > +++ b/include/drm/drm_gem_shmem_helper.h
> > @@ -39,6 +39,15 @@ struct drm_gem_shmem_object {
> >  */
> > unsigned int pages_use_count;
> >  
> > +   /**
> > +* @pages_pin_count:
> > +*
> > +* Reference count on the pinned pages table.
> > +* The pages allowed to be evicted by memory shrinker
> > +* only when the count is zero.
> > +*/
> > +   unsigned int pages_pin_count;  
> 
> s/pages_pin_count/pin_count/ ?
> 
> And do we really need both pages_pin_count and pages_use_count. Looks
> like they both serve the same purpose, with one exception:
> pages_use_count is also incremented in the get_pages_sgt_locked() path,
> but you probably don't want it to prevent GEM eviction. Assuming
> your goal with this pin_count field is to check if a GEM object is
> evictable, it can be done with something like
> 
> bool
> drm_gem_shmem_is_evictable_locked(struct drm_gem_shmem_object *shmem)
> {
>   dma_resv_assert_held(shmem->base.resv);
> 
>   return shmem->pages_use_count == (shmem->sgt ? 1 : 0);
> }
> 
> I mean, I'm not against renaming pages_use_count into pin_count, but,
> unless I'm missing something, I don't see a good reason to keep both.

My bad, I think I found one place calling drm_gem_shmem_get_pages()
where we want pin_count and pages_use_count to differ:
drm_gem_shmem_mmap(). We certainly don't want userspace mappings to
prevent eviction.


Re: [PATCH 3/5] drm/shmem-helper: Inline drm_gem_shmem_{get,put}_pages()

2023-06-26 Thread Boris Brezillon
On Mon, 26 Jun 2023 14:02:45 +0200
Boris Brezillon  wrote:

> Move code drm_gem_shmem_{get,put}_pages() code to
> drm_gem_shmem_{pin,unpin}_locked().

After having a closer look at 'Add generic memory shrinker to VirtIO-GPU
and  Panfrost DRM drivers', I realize that's not what we want. We must
differentiate hard-pinning (as in, can't be evicted until all users
give up the ref they have) and soft-pinning (users can survive a
swapout, basically userspace mappings created through
drm_gem_shmem_mmap()).

> 
> Signed-off-by: Boris Brezillon 
> Cc: Daniel Vetter 
> Cc: Thomas Zimmermann 
> Cc: Emil Velikov 
> Cc: Dmitry Osipenko 
> ---
>  drivers/gpu/drm/drm_gem_shmem_helper.c | 108 ++---
>  1 file changed, 41 insertions(+), 67 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index d6fc034164c0..f406556e42e0 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -128,46 +128,7 @@ struct drm_gem_shmem_object *drm_gem_shmem_create(struct 
> drm_device *dev, size_t
>  }
>  EXPORT_SYMBOL_GPL(drm_gem_shmem_create);
>  
> -static void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem);
> -
> -/**
> - * drm_gem_shmem_free - Free resources associated with a shmem GEM object
> - * @shmem: shmem GEM object to free
> - *
> - * This function cleans up the GEM object state and frees the memory used to
> - * store the object itself.
> - */
> -void drm_gem_shmem_free(struct drm_gem_shmem_object *shmem)
> -{
> - struct drm_gem_object *obj = &shmem->base;
> -
> - if (obj->import_attach) {
> - drm_prime_gem_destroy(obj, shmem->sgt);
> - } else {
> - dma_resv_lock(shmem->base.resv, NULL);
> -
> - drm_WARN_ON(obj->dev, shmem->vmap_use_count);
> -
> - if (shmem->sgt) {
> - dma_unmap_sgtable(obj->dev->dev, shmem->sgt,
> -   DMA_BIDIRECTIONAL, 0);
> - sg_free_table(shmem->sgt);
> - kfree(shmem->sgt);
> - }
> - if (shmem->pages)
> - drm_gem_shmem_put_pages(shmem);
> -
> - drm_WARN_ON(obj->dev, shmem->pages_use_count);
> -
> - dma_resv_unlock(shmem->base.resv);
> - }
> -
> - drm_gem_object_release(obj);
> - kfree(shmem);
> -}
> -EXPORT_SYMBOL_GPL(drm_gem_shmem_free);
> -
> -static int drm_gem_shmem_get_pages(struct drm_gem_shmem_object *shmem)
> +static int drm_gem_shmem_pin_locked(struct drm_gem_shmem_object *shmem)
>  {
>   struct drm_gem_object *obj = &shmem->base;
>   struct page **pages;
> @@ -200,13 +161,7 @@ static int drm_gem_shmem_get_pages(struct 
> drm_gem_shmem_object *shmem)
>   return 0;
>  }
>  
> -/*
> - * drm_gem_shmem_put_pages - Decrease use count on the backing pages for a 
> shmem GEM object
> - * @shmem: shmem GEM object
> - *
> - * This function decreases the use count and puts the backing pages when use 
> drops to zero.
> - */
> -static void drm_gem_shmem_put_pages(struct drm_gem_shmem_object *shmem)
> +static void drm_gem_shmem_unpin_locked(struct drm_gem_shmem_object *shmem)
>  {
>   struct drm_gem_object *obj = &shmem->base;
>  
> @@ -229,23 +184,42 @@ static void drm_gem_shmem_put_pages(struct 
> drm_gem_shmem_object *shmem)
>   shmem->pages = NULL;
>  }
>  
> -static int drm_gem_shmem_pin_locked(struct drm_gem_shmem_object *shmem)
> +/**
> + * drm_gem_shmem_free - Free resources associated with a shmem GEM object
> + * @shmem: shmem GEM object to free
> + *
> + * This function cleans up the GEM object state and frees the memory used to
> + * store the object itself.
> + */
> +void drm_gem_shmem_free(struct drm_gem_shmem_object *shmem)
>  {
> - int ret;
> + struct drm_gem_object *obj = &shmem->base;
>  
> - dma_resv_assert_held(shmem->base.resv);
> + if (obj->import_attach) {
> + drm_prime_gem_destroy(obj, shmem->sgt);
> + } else {
> + dma_resv_lock(shmem->base.resv, NULL);
>  
> - ret = drm_gem_shmem_get_pages(shmem);
> + drm_WARN_ON(obj->dev, shmem->vmap_use_count);
>  
> - return ret;
> -}
> -
> -static void drm_gem_shmem_unpin_locked(struct drm_gem_shmem_object *shmem)
> -{
> - dma_resv_assert_held(shmem->base.resv);
> -
> - drm_gem_shmem_put_pages(shmem);
> + if (shmem->sgt) {
> + dma_unmap_sgtable(obj-&g

Re: [PATCH 1/5] drm/panfrost: Stop using drm_gem_shmem_put_pages()

2023-06-26 Thread Boris Brezillon
On Mon, 26 Jun 2023 16:20:53 +0300
Dmitry Osipenko  wrote:

> On 6/26/23 15:02, Boris Brezillon wrote:
> > -err_pages:
> > -   drm_gem_shmem_put_pages(&bo->base);
> >  err_unlock:
> > dma_resv_unlock(obj->resv);
> > +
> > +   if (ret && pinned)
> > +   drm_gem_shmem_unpin(&bo->base);  
> 
> The drm_gem_shmem_unpin() was supposed to be used only in conjunction
> with drm_gem_shmem_pin(). I've a pending patch to enable the pin/unpin
> refcounting needed by drm-shmem shrinker, it will prohibit invocation of
> unpin without a previous pin.
> 
> I'm wondering whether it will be okay to simply remove
> drm_gem_shmem_put_pages() from the Panfrost code, letting pages to be
> kept allocated in a error case. They will be freed once BO is destroyed.
> 

Okay, so after looking at your shmem-shrinker series, I confirm we need
to take a pin ref here (hard-pin), otherwise the buffer might be
evicted before the GPU is done, especially after you drop gpu_usecount
and use only pin_count to check whether a GEM object can be evicted or
not.


Re: [PATCH 1/5] drm/panfrost: Stop using drm_gem_shmem_put_pages()

2023-06-26 Thread Boris Brezillon
On Mon, 26 Jun 2023 19:06:55 +0300
Dmitry Osipenko  wrote:

> On 6/26/23 18:43, Boris Brezillon wrote:
> > On Mon, 26 Jun 2023 16:20:53 +0300
> > Dmitry Osipenko  wrote:
> >   
> >> On 6/26/23 15:02, Boris Brezillon wrote:  
> >>> -err_pages:
> >>> - drm_gem_shmem_put_pages(&bo->base);
> >>>  err_unlock:
> >>>   dma_resv_unlock(obj->resv);
> >>> +
> >>> + if (ret && pinned)
> >>> + drm_gem_shmem_unpin(&bo->base);
> >>
> >> The drm_gem_shmem_unpin() was supposed to be used only in conjunction
> >> with drm_gem_shmem_pin(). I've a pending patch to enable the pin/unpin
> >> refcounting needed by drm-shmem shrinker, it will prohibit invocation of
> >> unpin without a previous pin.
> >>
> >> I'm wondering whether it will be okay to simply remove
> >> drm_gem_shmem_put_pages() from the Panfrost code, letting pages to be
> >> kept allocated in a error case. They will be freed once BO is destroyed.
> >>  
> > 
> > Okay, so after looking at your shmem-shrinker series, I confirm we need
> > to take a pin ref here (hard-pin), otherwise the buffer might be
> > evicted before the GPU is done, especially after you drop gpu_usecount
> > and use only pin_count to check whether a GEM object can be evicted or
> > not.  
> 
> See the drm_gem_evict() [1], it checks whether GEM is busy, preventing
> BO eviction while it is in-use by GPU. Note that in case of Panfrost,
> shrinker isn't enabled for growable BOs.

Okay, we should be good then, sorry for the confusion.


Re: [PATCH drm-next v6 02/13] drm: manager to keep track of GPUs VA mappings

2023-06-30 Thread Boris Brezillon
Hi Danilo,

On Fri, 30 Jun 2023 00:25:18 +0200
Danilo Krummrich  wrote:

> + *   int driver_gpuva_remap(struct drm_gpuva_op *op, void *__ctx)
> + *   {
> + *   struct driver_context *ctx = __ctx;
> + *
> + *   drm_gpuva_remap(ctx->prev_va, ctx->next_va, &op->remap);
> + *
> + *   drm_gpuva_unlink(op->remap.unmap->va);
> + *   kfree(op->remap.unmap->va);
> + *
> + *   if (op->remap.prev) {
> + *   drm_gpuva_link(ctx->prev_va);

I ended up switching to dma_resv-based locking for the GEMs and I
wonder what the locking is supposed to look like in the async-mapping
case, where we insert/remove the VA nodes in the drm_sched::run_job()
path.

What I have right now is something like:

dma_resv_lock(vm->resv);

// split done in drm_gpuva_sm_map(), each iteration
// of the loop is a call to the driver ->[re,un]map()
// hook
for_each_sub_op() {

// Private BOs have their resv field pointing to the
// VM resv and we take the VM resv lock before calling
// drm_gpuva_sm_map()
if (vm->resv != gem->resv)
dma_resv_lock(gem->resv);

drm_gpuva_[un]link(va);
gem_[un]pin(gem);

if (vm->resv != gem->resv)
dma_resv_unlock(gem->resv);
}

dma_resv_unlock(vm->resv);

In practice, I don't expect things to deadlock, because the VM resv is
not supposed to be taken outside the VM context and the locking order
is always the same (VM lock first, and then each shared BO
taken/released independently), but I'm not super thrilled by this
nested lock, and I'm wondering if we shouldn't have a pass collecting
locks in a drm_exec context first, and then have
the operations executed. IOW, something like that:

drm_exec_init(exec, DRM_EXEC_IGNORE_DUPLICATES)
drm_exec_until_all_locked(exec) {
// Dummy GEM is the dummy GEM object I use to make the VM
// participate in the locking without having to teach
// drm_exec how to deal with raw dma_resv objects.
ret = drm_exec_lock_obj(exec, vm->dummy_gem);
drm_exec_retry_on_contention(exec);
if (ret)
return ret;

// Could take the form of drm_gpuva_sm_[un]map_acquire_locks()
// helpers
for_each_sub_op() {
ret = drm_exec_lock_obj(exec, gem);
if (ret)
return ret;
}
}

// each iteration of the loop is a call to the driver
// ->[re,un]map() hook
for_each_sub_op() {
...
gem_[un]pin_locked(gem);
drm_gpuva_[un]link(va);
...
}

drm_exec_fini(exec);

Don't know if I got this right, or if I'm just confused again by how
the drm_gpuva API is supposed to be used.

Regards,

Boris

> + *   ctx->prev_va = NULL;
> + *   }
> + *
> + *   if (op->remap.next) {
> + *   drm_gpuva_link(ctx->next_va);
> + *   ctx->next_va = NULL;
> + *   }
> + *
> + *   return 0;
> + *   }


Re: [PATCH drm-next v6 02/13] drm: manager to keep track of GPUs VA mappings

2023-06-30 Thread Boris Brezillon
On Fri, 30 Jun 2023 10:02:52 +0200
Boris Brezillon  wrote:

> Hi Danilo,
> 
> On Fri, 30 Jun 2023 00:25:18 +0200
> Danilo Krummrich  wrote:
> 
> > + * int driver_gpuva_remap(struct drm_gpuva_op *op, void *__ctx)
> > + * {
> > + * struct driver_context *ctx = __ctx;
> > + *
> > + * drm_gpuva_remap(ctx->prev_va, ctx->next_va, &op->remap);
> > + *
> > + * drm_gpuva_unlink(op->remap.unmap->va);
> > + * kfree(op->remap.unmap->va);
> > + *
> > + * if (op->remap.prev) {
> > + * drm_gpuva_link(ctx->prev_va);
> 
> I ended up switching to dma_resv-based locking for the GEMs and I
> wonder what the locking is supposed to look like in the async-mapping
> case, where we insert/remove the VA nodes in the drm_sched::run_job()
> path.
> 
> What I have right now is something like:
> 
>   dma_resv_lock(vm->resv);
> 
>   // split done in drm_gpuva_sm_map(), each iteration
>   // of the loop is a call to the driver ->[re,un]map()
>   // hook
>   for_each_sub_op() {
>   
>   // Private BOs have their resv field pointing to the
>   // VM resv and we take the VM resv lock before calling
>   // drm_gpuva_sm_map()
>   if (vm->resv != gem->resv)
>   dma_resv_lock(gem->resv);
> 
>   drm_gpuva_[un]link(va);
>   gem_[un]pin(gem);
> 
>   if (vm->resv != gem->resv)
>   dma_resv_unlock(gem->resv);
>   }
> 
>   dma_resv_unlock(vm->resv);
> 
> In practice, I don't expect things to deadlock, because the VM resv is
> not supposed to be taken outside the VM context and the locking order
> is always the same (VM lock first, and then each shared BO
> taken/released independently), but I'm not super thrilled by this
> nested lock, and I'm wondering if we shouldn't have a pass collecting
> locks in a drm_exec context first, and then have
> the operations executed. IOW, something like that:
> 
>   drm_exec_init(exec, DRM_EXEC_IGNORE_DUPLICATES)
>   drm_exec_until_all_locked(exec) {
>   // Dummy GEM is the dummy GEM object I use to make the VM
>   // participate in the locking without having to teach
>   // drm_exec how to deal with raw dma_resv objects.
>   ret = drm_exec_lock_obj(exec, vm->dummy_gem);
>   drm_exec_retry_on_contention(exec);
>   if (ret)
>   return ret;
> 
>   // Could take the form of drm_gpuva_sm_[un]map_acquire_locks()
>   // helpers
>   for_each_sub_op() {
>   ret = drm_exec_lock_obj(exec, gem);
>   if (ret)
>   return ret;
>   }
>   }
> 
>   // each iteration of the loop is a call to the driver
>   // ->[re,un]map() hook
>   for_each_sub_op() {
>   ...
>   gem_[un]pin_locked(gem);

Just wanted to clarify that the pages have been pinned at VM_BIND job
creation time, so this gem_pin_locked() call is effectively just a
pin_count++, not the whole page allocation, which we don't want to
happen in a dma-signaling path.

>   drm_gpuva_[un]link(va);
>   ...
>   }
> 
>   drm_exec_fini(exec);


Re: [PATCH drm-next v6 02/13] drm: manager to keep track of GPUs VA mappings

2023-06-30 Thread Boris Brezillon
On Fri, 30 Jun 2023 10:02:52 +0200
Boris Brezillon  wrote:

> In practice, I don't expect things to deadlock, because the VM resv is
> not supposed to be taken outside the VM context and the locking order
> is always the same (VM lock first, and then each shared BO
> taken/released independently), but I'm not super thrilled by this
> nested lock, and I'm wondering if we shouldn't have a pass collecting
> locks in a drm_exec context first, and then have
> the operations executed. IOW, something like that:
> 
>   drm_exec_init(exec, DRM_EXEC_IGNORE_DUPLICATES)
>   drm_exec_until_all_locked(exec) {
>   // Dummy GEM is the dummy GEM object I use to make the VM
>   // participate in the locking without having to teach
>   // drm_exec how to deal with raw dma_resv objects.
>   ret = drm_exec_lock_obj(exec, vm->dummy_gem);
>   drm_exec_retry_on_contention(exec);
>   if (ret)
>   return ret;
> 
>   // Could take the form of drm_gpuva_sm_[un]map_acquire_locks()
>   // helpers

Nevermind, I implemented a driver specific acquire_op_locks(), and it's
fairly simple with the gpuva iter (we just have to iterate over all VAs
covered by the operation range and call drm_exec_lock_obj() on the GEM
attached to these VAs), so it's probably not worth providing a generic
helper for that.


[PATCH] drm/managed: Define drmm_mutex_init() as a macro to fix lockdep

2023-05-19 Thread Boris Brezillon
drmm_mutex_init() needs to be defined as a macro if we want
lockdep to classify locks properly. If we don't do that, all locks
will be considered as belonging to the same lock class, leading to
false positive deadlock reports.

Signed-off-by: Boris Brezillon 
Reported-by: Sarah Walker 
---
 drivers/gpu/drm/drm_managed.c | 26 --
 include/drm/drm_managed.h | 30 +-
 2 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/drm_managed.c b/drivers/gpu/drm/drm_managed.c
index 4cf214de50c4..71c49819a7a2 100644
--- a/drivers/gpu/drm/drm_managed.c
+++ b/drivers/gpu/drm/drm_managed.c
@@ -263,29 +263,3 @@ void drmm_kfree(struct drm_device *dev, void *data)
free_dr(dr_match);
 }
 EXPORT_SYMBOL(drmm_kfree);
-
-static void drmm_mutex_release(struct drm_device *dev, void *res)
-{
-   struct mutex *lock = res;
-
-   mutex_destroy(lock);
-}
-
-/**
- * drmm_mutex_init - &drm_device-managed mutex_init()
- * @dev: DRM device
- * @lock: lock to be initialized
- *
- * Returns:
- * 0 on success, or a negative errno code otherwise.
- *
- * This is a &drm_device-managed version of mutex_init(). The initialized
- * lock is automatically destroyed on the final drm_dev_put().
- */
-int drmm_mutex_init(struct drm_device *dev, struct mutex *lock)
-{
-   mutex_init(lock);
-
-   return drmm_add_action_or_reset(dev, drmm_mutex_release, lock);
-}
-EXPORT_SYMBOL(drmm_mutex_init);
diff --git a/include/drm/drm_managed.h b/include/drm/drm_managed.h
index 359883942612..87ffb92a16ba 100644
--- a/include/drm/drm_managed.h
+++ b/include/drm/drm_managed.h
@@ -105,6 +105,34 @@ char *drmm_kstrdup(struct drm_device *dev, const char *s, 
gfp_t gfp);
 
 void drmm_kfree(struct drm_device *dev, void *data);
 
-int drmm_mutex_init(struct drm_device *dev, struct mutex *lock);
+/* Private function, don't use. */
+static inline void __drmm_mutex_release(struct drm_device *dev, void *res)
+{
+   struct mutex *lock = res;
+
+   mutex_destroy(lock);
+}
+
+/**
+ * drmm_mutex_init - &drm_device-managed mutex_init()
+ * @dev: DRM device
+ * @lock: lock to be initialized
+ *
+ * Returns:
+ * 0 on success, or a negative errno code otherwise.
+ *
+ * This is a &drm_device-managed version of mutex_init(). The initialized
+ * lock is automatically destroyed on the final drm_dev_put().
+ *
+ * This needs to be defined as a macro to let lockdep classify locks
+ * properly. If we don't do that, all locks will be considered as
+ * belonging to the same lock class, leading to false positive lockdep
+ * reports.
+ */
+#define drmm_mutex_init(dev, lock) \
+   ({\
+   mutex_init(lock); \
+   drmm_add_action_or_reset(dev, __drmm_mutex_release, lock); \
+   })
 
 #endif
-- 
2.40.1



Re: [PATCH v2] drm: fix drmm_mutex_init()

2023-05-19 Thread Boris Brezillon
On Fri, 19 May 2023 10:07:33 +0100
Matthew Auld  wrote:

> In mutex_init() lockdep identifies a lock by defining a special static
> key for each lock class. However if we wrap the macro in a function,
> like in drmm_mutex_init(), we end up generating:
> 
> int drmm_mutex_init(struct drm_device *dev, struct mutex *lock)
> {
>   static struct lock_class_key __key;
> 
>   __mutex_init((lock), "lock", &__key);
>   
> }
> 
> The static __key here is what lockdep uses to identify the lock class,
> however since this is just a normal function the key here will be
> created once, where all callers then use the same key. In effect the
> mutex->depmap.key will be the same pointer for different
> drmm_mutex_init() callers. This then results in impossible lockdep
> splats since lockdep thinks completely unrelated locks are the same lock
> class.
> 
> To fix this turn drmm_mutex_init() into a macro such that it generates a
> different "static struct lock_class_key __key" for each invocation,
> which looks to be inline with what mutex_init() wants.
> 
> v2:
>   - Revamp the commit message with clearer explanation of the issue.
>   - Rather export __drmm_mutex_release() than static inline.
> 
> Reported-by: Thomas Hellström 
> Reported-by: Sarah Walker 
> Fixes: e13f13e039dc ("drm: Add DRM-managed mutex_init()")
> Cc: Stanislaw Gruszka 
> Cc: Boris Brezillon 

Reviewed-by: Boris Brezillon 

> Cc: Thomas Zimmermann 
> Cc: Jocelyn Falempe 
> Cc: Daniel Vetter 
> Cc: dri-devel@lists.freedesktop.org
> Signed-off-by: Matthew Auld 
> ---
>  drivers/gpu/drm/drm_managed.c | 22 ++
>  include/drm/drm_managed.h | 18 +-
>  2 files changed, 19 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_managed.c b/drivers/gpu/drm/drm_managed.c
> index 4cf214de50c4..c21c3f623033 100644
> --- a/drivers/gpu/drm/drm_managed.c
> +++ b/drivers/gpu/drm/drm_managed.c
> @@ -264,28 +264,10 @@ void drmm_kfree(struct drm_device *dev, void *data)
>  }
>  EXPORT_SYMBOL(drmm_kfree);
>  
> -static void drmm_mutex_release(struct drm_device *dev, void *res)
> +void __drmm_mutex_release(struct drm_device *dev, void *res)
>  {
>   struct mutex *lock = res;
>  
>   mutex_destroy(lock);
>  }
> -
> -/**
> - * drmm_mutex_init - &drm_device-managed mutex_init()
> - * @dev: DRM device
> - * @lock: lock to be initialized
> - *
> - * Returns:
> - * 0 on success, or a negative errno code otherwise.
> - *
> - * This is a &drm_device-managed version of mutex_init(). The initialized
> - * lock is automatically destroyed on the final drm_dev_put().
> - */
> -int drmm_mutex_init(struct drm_device *dev, struct mutex *lock)
> -{
> - mutex_init(lock);
> -
> - return drmm_add_action_or_reset(dev, drmm_mutex_release, lock);
> -}
> -EXPORT_SYMBOL(drmm_mutex_init);
> +EXPORT_SYMBOL(__drmm_mutex_release);
> diff --git a/include/drm/drm_managed.h b/include/drm/drm_managed.h
> index 359883942612..ad08f834af40 100644
> --- a/include/drm/drm_managed.h
> +++ b/include/drm/drm_managed.h
> @@ -105,6 +105,22 @@ char *drmm_kstrdup(struct drm_device *dev, const char 
> *s, gfp_t gfp);
>  
>  void drmm_kfree(struct drm_device *dev, void *data);
>  
> -int drmm_mutex_init(struct drm_device *dev, struct mutex *lock);
> +void __drmm_mutex_release(struct drm_device *dev, void *res);
> +
> +/**
> + * drmm_mutex_init - &drm_device-managed mutex_init()
> + * @dev: DRM device
> + * @lock: lock to be initialized
> + *
> + * Returns:
> + * 0 on success, or a negative errno code otherwise.
> + *
> + * This is a &drm_device-managed version of mutex_init(). The initialized
> + * lock is automatically destroyed on the final drm_dev_put().
> + */
> +#define drmm_mutex_init(dev, lock) ({
>  \
> + mutex_init(lock);\
> + drmm_add_action_or_reset(dev, __drmm_mutex_release, lock);   \
> +})\
>  
>  #endif



Re: [PATCH] drm/managed: Define drmm_mutex_init() as a macro to fix lockdep

2023-05-19 Thread Boris Brezillon
On Fri, 19 May 2023 10:05:27 +0100
Matthew Auld  wrote:

> On Fri, 19 May 2023 at 09:55, Boris Brezillon
>  wrote:
> >
> > drmm_mutex_init() needs to be defined as a macro if we want
> > lockdep to classify locks properly. If we don't do that, all locks
> > will be considered as belonging to the same lock class, leading to
> > false positive deadlock reports.
> >
> > Signed-off-by: Boris Brezillon 
> > Reported-by: Sarah Walker   
> 
> Yeah, we also encountered the same issue. Patch is here:
> https://patchwork.freedesktop.org/patch/537605/?series=117891&rev=2

Cool! Added my R-b to this patch.


Re: [PATCH v2] drm/panfrost: Sync IRQ by job's timeout handler

2023-07-23 Thread Boris Brezillon
On Sun, 23 Jul 2023 03:01:42 +0300
Dmitry Osipenko  wrote:

> Panfrost IRQ handler may stuck for a long time, for example this happens
> when there is a bad HDMI connection and HDMI handler takes a long time to
> finish processing, holding Panfrost. Make Panfrost's job timeout handler
> to sync IRQ before checking fence signal status in order to prevent
> spurious job timeouts due to a slow IRQ processing.
> 
> Signed-off-by: Dmitry Osipenko 
> ---
> 
> Changelog:
> 
> v2: - Moved synchronize_irq() after first signal-check to avoid unnecessary
>   blocking on syncing.
> 
> - Added warn message about high interrupt latency.
> 
>  drivers/gpu/drm/panfrost/panfrost_job.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
> b/drivers/gpu/drm/panfrost/panfrost_job.c
> index dbc597ab46fb..a7663d7847a2 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -720,6 +720,13 @@ static enum drm_gpu_sched_stat 
> panfrost_job_timedout(struct drm_sched_job
>   if (dma_fence_is_signaled(job->done_fence))
>   return DRM_GPU_SCHED_STAT_NOMINAL;
>  
> + synchronize_irq(pfdev->js->irq);

Can we add a comment here explaining why we're doing that?

> +
> + if (dma_fence_is_signaled(job->done_fence)) {
> + dev_warn(pfdev->dev, "unexpectedly high interrupt latency\n");
> + return DRM_GPU_SCHED_STAT_NOMINAL;
> + }
> +
>   dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, 
> status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
>   js,
>   job_read(pfdev, JS_CONFIG(js)),



Re: [drm-misc:for-linux-next 2/2] drivers/gpu/drm/drm_debugfs.c:212:33: sparse: sparse: non size-preserving pointer to integer cast

2023-07-24 Thread Boris Brezillon
On Fri, 21 Jul 2023 02:06:16 +0800
kernel test robot  wrote:

> tree:   git://anongit.freedesktop.org/drm/drm-misc for-linux-next
> head:   c7a472297169156252a50d76965eb36b081186e2
> commit: 4f66feeab173bd73e71028b8c2e1dcea07e32dd5 [2/2] drm: debugfs: provide 
> infrastructure to dump a DRM GPU VA space
> config: i386-randconfig-r092-20230720 
> (https://download.01.org/0day-ci/archive/20230721/202307210230.t2onm5g0-...@intel.com/config)
> compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> reproduce: 
> (https://download.01.org/0day-ci/archive/20230721/202307210230.t2onm5g0-...@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version 
> of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot 
> | Closes: 
> https://lore.kernel.org/oe-kbuild-all/202307210230.t2onm5g0-...@intel.com/
> 
> sparse warnings: (new ones prefixed by >>)
> >> drivers/gpu/drm/drm_debugfs.c:212:33: sparse: sparse: non size-preserving 
> >> pointer to integer cast  
> 
> vim +212 drivers/gpu/drm/drm_debugfs.c
> 
>178
>179/**
>180 * drm_debugfs_gpuva_info - dump the given DRM GPU VA space
>181 * @m: pointer to the &seq_file to write
>182 * @mgr: the &drm_gpuva_manager representing the GPU VA space
>183 *
>184 * Dumps the GPU VA mappings of a given DRM GPU VA manager.
>185 *
>186 * For each DRM GPU VA space drivers should call this function 
> from their
>187 * &drm_info_list's show callback.
>188 *
>189 * Returns: 0 on success, -ENODEV if the &mgr is not initialized
>190 */
>191int drm_debugfs_gpuva_info(struct seq_file *m,
>192   struct drm_gpuva_manager *mgr)
>193{
>194struct drm_gpuva *va, *kva = &mgr->kernel_alloc_node;
>195
>196if (!mgr->name)
>197return -ENODEV;
>198
>199seq_printf(m, "DRM GPU VA space (%s) 
> [0x%016llx;0x%016llx]\n",
>200   mgr->name, mgr->mm_start, mgr->mm_start + 
> mgr->mm_range);
>201seq_printf(m, "Kernel reserved node 
> [0x%016llx;0x%016llx]\n",
>202   kva->va.addr, kva->va.addr + kva->va.range);
>203seq_puts(m, "\n");
>204seq_puts(m, " VAs | start  | range  
> | end| object | object offset\n");
>205seq_puts(m, 
> "-\n");
>206drm_gpuva_for_each_va(va, mgr) {
>207if (unlikely(va == kva))
>208continue;
>209
>210seq_printf(m, " | 0x%016llx | 0x%016llx | 
> 0x%016llx | 0x%016llx | 0x%016llx\n",
>211   va->va.addr, va->va.range, 
> va->va.addr + va->va.range,
>  > 212   (u64)va->gem.obj, va->gem.offset);  

Oops, I didn't notice it when reviewing. You're leaking a kernel address
to user space here. You should probably use %p to print the GEM object
address, and add `no_hash_pointers` to your cmdline when you want to
debug things.

>213}
>214
>215return 0;
>216}
>217EXPORT_SYMBOL(drm_debugfs_gpuva_info);
>218
> 



[PATCH] drm/shmem-helper: Reset vma->vm_ops before calling dma_buf_mmap()

2023-07-24 Thread Boris Brezillon
The dma-buf backend is supposed to provide its own vm_ops, but some
implementation just have nothing special to do and leave vm_ops
untouched, probably expecting this field to be zero initialized (this
is the case with the system_heap implementation for instance).
Let's reset vma->vm_ops to NULL to keep things working with these
implementations.

Fixes: 26d3ac3cb04d ("drm/shmem-helpers: Redirect mmap for imported dma-buf")
Cc: 
Cc: Daniel Vetter 
Reported-by: Roman Stratiienko 
Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/drm_gem_shmem_helper.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
b/drivers/gpu/drm/drm_gem_shmem_helper.c
index 4ea6507a77e5..baaf0e0feb06 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -623,7 +623,13 @@ int drm_gem_shmem_mmap(struct drm_gem_shmem_object *shmem, 
struct vm_area_struct
int ret;
 
if (obj->import_attach) {
+   /* Reset both vm_ops and vm_private_data, so we don't end up 
with
+* vm_ops pointing to our implementation if the dma-buf backend
+* doesn't set those fields.
+*/
vma->vm_private_data = NULL;
+   vma->vm_ops = NULL;
+
ret = dma_buf_mmap(obj->dma_buf, vma, 0);
 
/* Drop the reference drm_gem_mmap_obj() acquired.*/
-- 
2.41.0



Re: [PATCH] drm/shmem-helper: Reset vma->vm_ops before calling dma_buf_mmap()

2023-07-25 Thread Boris Brezillon
On Mon, 24 Jul 2023 13:26:10 +0200
Boris Brezillon  wrote:

> The dma-buf backend is supposed to provide its own vm_ops, but some
> implementation just have nothing special to do and leave vm_ops
> untouched, probably expecting this field to be zero initialized (this
> is the case with the system_heap implementation for instance).
> Let's reset vma->vm_ops to NULL to keep things working with these
> implementations.
> 
> Fixes: 26d3ac3cb04d ("drm/shmem-helpers: Redirect mmap for imported dma-buf")
> Cc: 
> Cc: Daniel Vetter 
> Reported-by: Roman Stratiienko 

Adding Roman's tested-by coming from [1]

Tested-by: Roman Stratiienko 

[1]https://gitlab.freedesktop.org/mesa/mesa/-/issues/9416#note_2013722

> Signed-off-by: Boris Brezillon 
> ---
>  drivers/gpu/drm/drm_gem_shmem_helper.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index 4ea6507a77e5..baaf0e0feb06 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -623,7 +623,13 @@ int drm_gem_shmem_mmap(struct drm_gem_shmem_object 
> *shmem, struct vm_area_struct
>   int ret;
>  
>   if (obj->import_attach) {
> + /* Reset both vm_ops and vm_private_data, so we don't end up 
> with
> +  * vm_ops pointing to our implementation if the dma-buf backend
> +  * doesn't set those fields.
> +  */
>   vma->vm_private_data = NULL;
> + vma->vm_ops = NULL;
> +
>   ret = dma_buf_mmap(obj->dma_buf, vma, 0);
>  
>   /* Drop the reference drm_gem_mmap_obj() acquired.*/



Re: [PATCH v14 01/12] drm/shmem-helper: Factor out pages alloc/release from drm_gem_shmem_get/put_pages()

2023-07-25 Thread Boris Brezillon
On Sun, 23 Jul 2023 02:47:35 +0300
Dmitry Osipenko  wrote:

> Factor out pages allocation from drm_gem_shmem_get_pages() into
> drm_gem_shmem_acquire_pages() function and similar for the put_pages()
> in a preparation for addition of shrinker support to drm-shmem.
> 
> Once shrinker will be added, the pages_use_count>0 will no longer determine
> whether pages are pinned because pages could be swapped out by the shrinker
> and then pages_use_count will be greater than 0 in this case. We will add
> new pages_pin_count in a later patch.
> 
> The new common drm_gem_shmem_acquire/release_pages() will be used by
> shrinker code for performing the page swapping.
> 
> Signed-off-by: Dmitry Osipenko 
> ---
>  drivers/gpu/drm/drm_gem_shmem_helper.c | 65 --
>  1 file changed, 52 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index a783d2245599..267153853e2c 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -165,21 +165,26 @@ void drm_gem_shmem_free(struct drm_gem_shmem_object 
> *shmem)
>  }
>  EXPORT_SYMBOL_GPL(drm_gem_shmem_free);
>  
> -static int drm_gem_shmem_get_pages(struct drm_gem_shmem_object *shmem)
> +static int
> +drm_gem_shmem_acquire_pages(struct drm_gem_shmem_object *shmem)
>  {
>   struct drm_gem_object *obj = &shmem->base;
>   struct page **pages;
>  
>   dma_resv_assert_held(shmem->base.resv);

Not directly related to this patch, but can we start using _locked
suffixes for any function that's expecting the dma-resv lock to be held?

>  
> - if (shmem->pages_use_count++ > 0)
> - return 0;
> + if (shmem->madv < 0) {
> + drm_WARN_ON(obj->dev, shmem->pages);
> + return -ENOMEM;
> + }
> +
> + if (drm_WARN_ON(obj->dev, !shmem->pages_use_count))
> + return -EINVAL;
>  
>   pages = drm_gem_get_pages(obj);
>   if (IS_ERR(pages)) {
>   drm_dbg_kms(obj->dev, "Failed to get pages (%ld)\n",
>   PTR_ERR(pages));
> - shmem->pages_use_count = 0;
>   return PTR_ERR(pages);
>   }
>  
> @@ -198,6 +203,48 @@ static int drm_gem_shmem_get_pages(struct 
> drm_gem_shmem_object *shmem)
>   return 0;
>  }
>  
> +static int drm_gem_shmem_get_pages(struct drm_gem_shmem_object *shmem)
> +{
> + int err;
> +
> + dma_resv_assert_held(shmem->base.resv);
> +
> + if (shmem->madv < 0)
> + return -ENOMEM;
> +
> + if (shmem->pages_use_count++ > 0)
> + return 0;
> +
> + err = drm_gem_shmem_acquire_pages(shmem);
> + if (err)
> + goto err_zero_use;
> +
> + return 0;
> +
> +err_zero_use:
> + shmem->pages_use_count = 0;
> +
> + return err;
> +}
> +
> +static void
> +drm_gem_shmem_release_pages(struct drm_gem_shmem_object *shmem)
> +{
> + struct drm_gem_object *obj = &shmem->base;
> +
> + dma_resv_assert_held(shmem->base.resv);
> +
> +#ifdef CONFIG_X86
> + if (shmem->map_wc)
> + set_pages_array_wb(shmem->pages, obj->size >> PAGE_SHIFT);
> +#endif
> +
> + drm_gem_put_pages(obj, shmem->pages,
> +   shmem->pages_mark_dirty_on_put,
> +   shmem->pages_mark_accessed_on_put);
> + shmem->pages = NULL;
> +}
> +
>  /*
>   * drm_gem_shmem_put_pages - Decrease use count on the backing pages for a 
> shmem GEM object
>   * @shmem: shmem GEM object
> @@ -216,15 +263,7 @@ void drm_gem_shmem_put_pages(struct drm_gem_shmem_object 
> *shmem)
>   if (--shmem->pages_use_count > 0)
>   return;
>  
> -#ifdef CONFIG_X86
> - if (shmem->map_wc)
> - set_pages_array_wb(shmem->pages, obj->size >> PAGE_SHIFT);
> -#endif
> -
> - drm_gem_put_pages(obj, shmem->pages,
> -   shmem->pages_mark_dirty_on_put,
> -   shmem->pages_mark_accessed_on_put);
> - shmem->pages = NULL;
> + drm_gem_shmem_release_pages(shmem);
>  }
>  EXPORT_SYMBOL(drm_gem_shmem_put_pages);
>  



Re: [PATCH v14 02/12] drm/shmem-helper: Add pages_pin_count field

2023-07-25 Thread Boris Brezillon
On Sun, 23 Jul 2023 02:47:36 +0300
Dmitry Osipenko  wrote:

> And new pages_pin_count field to struct drm_gem_shmem_object that will
> determine whether pages are evictable by memory shrinker. The pages will
> be evictable only when pages_pin_count=0. This patch prepares code for
> addition of the memory shrinker that will utilize the new field.
> 
> Signed-off-by: Dmitry Osipenko 
> ---
>  drivers/gpu/drm/drm_gem_shmem_helper.c | 9 +
>  include/drm/drm_gem_shmem_helper.h | 9 +
>  2 files changed, 18 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index 267153853e2c..42ba201dda50 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -274,15 +274,24 @@ static int drm_gem_shmem_pin_locked(struct 
> drm_gem_shmem_object *shmem)
>   dma_resv_assert_held(shmem->base.resv);
>  
>   ret = drm_gem_shmem_get_pages(shmem);
> + if (!ret)
> + shmem->pages_pin_count++;
>  
>   return ret;
>  }
>  
>  static void drm_gem_shmem_unpin_locked(struct drm_gem_shmem_object *shmem)
>  {
> + struct drm_gem_object *obj = &shmem->base;
> +
>   dma_resv_assert_held(shmem->base.resv);
>  
> + if (drm_WARN_ON_ONCE(obj->dev, !shmem->pages_pin_count))
> + return;
> +
>   drm_gem_shmem_put_pages(shmem);
> +
> + shmem->pages_pin_count--;
>  }
>  
>  /**
> diff --git a/include/drm/drm_gem_shmem_helper.h 
> b/include/drm/drm_gem_shmem_helper.h
> index bf0c31aa8fbe..7111f5743006 100644
> --- a/include/drm/drm_gem_shmem_helper.h
> +++ b/include/drm/drm_gem_shmem_helper.h
> @@ -39,6 +39,15 @@ struct drm_gem_shmem_object {
>*/
>   unsigned int pages_use_count;
>  
> + /**
> +  * @pages_pin_count:
> +  *
> +  * Reference count on the pinned pages table.
> +  * The pages allowed to be evicted by memory shrinker
> +  * only when the count is zero.
> +  */
> + unsigned int pages_pin_count;

Can we make it an atomic_t, so we can avoid taking the lock when the
GEM has already been pinned. That's something I need to be able to grab
a pin-ref in a path where the GEM resv lock is already held[1]. We could
of course expose the locked version, but in my case, I want to enforce
the fact the GEM has been pinned before the drm_gem_shmem_pin() call in
the section protected by the resv lock, so catching a "refcount 0 -> 1"
situation would be useful. Beside, using an atomic to avoid the
lock/unlock dance when refcount > 1 might be beneficial to everyone.

[1]https://gitlab.freedesktop.org/bbrezillon/linux/-/commit/4420fa0d5768ebdc35b34d58d4ae5fad9fbb93f9

> +
>   /**
>* @madv: State for madvise
>*



Re: [PATCH v14 10/12] drm/shmem-helper: Refactor locked/unlocked functions

2023-07-25 Thread Boris Brezillon
On Sun, 23 Jul 2023 02:47:44 +0300
Dmitry Osipenko  wrote:

> Add locked/unlocked postfixes to drm-shmem function names to make clear
> where reservation lock is taken and where not.

Uh, ignore my comment on patch 1 then...

> Add more common helpers to drm_gem_shmem_helper.h

I'd do the renaming and exporting in separate patches.

> 
> Suggested-by: Boris Brezillon 
> Signed-off-by: Dmitry Osipenko 


Re: [PATCH v14 12/12] drm/gem: Add _unlocked postfix to drm_gem_pin/unpin()

2023-07-25 Thread Boris Brezillon
On Sun, 23 Jul 2023 02:47:46 +0300
Dmitry Osipenko  wrote:

> Make clear that drm_gem_pin/unpin() functions take reservation lock by
> adding _unlocked postfix to the function names.
> 
> Suggested-by: Boris Brezillon 
> Signed-off-by: Dmitry Osipenko 

I'm still a bit confused by the fact we sometimes use the
xxx[_locked]() pattern (version without the _locked suffix takes the
lock) and other times the xxx[_unlocked]() pattern (version with the
_unlocked suffix takes the lock). It'd be good to chose one pattern and
stick to it, at least for all core functions...

> ---
>  drivers/gpu/drm/drm_gem.c  | 4 ++--
>  drivers/gpu/drm/drm_internal.h | 4 ++--
>  drivers/gpu/drm/drm_prime.c| 4 ++--
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
> index c18686f434d4..805eb0d85297 100644
> --- a/drivers/gpu/drm/drm_gem.c
> +++ b/drivers/gpu/drm/drm_gem.c
> @@ -1146,7 +1146,7 @@ void drm_gem_print_info(struct drm_printer *p, unsigned 
> int indent,
>   obj->funcs->print_info(p, indent, obj);
>  }
>  
> -int drm_gem_pin(struct drm_gem_object *obj)
> +int drm_gem_pin_unlocked(struct drm_gem_object *obj)
>  {
>   if (obj->funcs->pin)
>   return obj->funcs->pin(obj);
> @@ -1154,7 +1154,7 @@ int drm_gem_pin(struct drm_gem_object *obj)
>   return 0;
>  }
>  
> -void drm_gem_unpin(struct drm_gem_object *obj)
> +void drm_gem_unpin_unlocked(struct drm_gem_object *obj)
>  {
>   if (obj->funcs->unpin)
>   obj->funcs->unpin(obj);
> diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h
> index d7e023bbb0d5..80f5bd1da8fd 100644
> --- a/drivers/gpu/drm/drm_internal.h
> +++ b/drivers/gpu/drm/drm_internal.h
> @@ -173,8 +173,8 @@ void drm_gem_release(struct drm_device *dev, struct 
> drm_file *file_private);
>  void drm_gem_print_info(struct drm_printer *p, unsigned int indent,
>   const struct drm_gem_object *obj);
>  
> -int drm_gem_pin(struct drm_gem_object *obj);
> -void drm_gem_unpin(struct drm_gem_object *obj);
> +int drm_gem_pin_unlocked(struct drm_gem_object *obj);
> +void drm_gem_unpin_unlocked(struct drm_gem_object *obj);
>  int drm_gem_vmap(struct drm_gem_object *obj, struct iosys_map *map);
>  void drm_gem_vunmap(struct drm_gem_object *obj, struct iosys_map *map);
>  
> diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
> index 63b709a67471..8145b49e95ff 100644
> --- a/drivers/gpu/drm/drm_prime.c
> +++ b/drivers/gpu/drm/drm_prime.c
> @@ -583,7 +583,7 @@ int drm_gem_map_attach(struct dma_buf *dma_buf,
>   if (!obj->funcs->get_sg_table)
>   return -ENOSYS;
>  
> - return drm_gem_pin(obj);
> + return drm_gem_pin_unlocked(obj);
>  }
>  EXPORT_SYMBOL(drm_gem_map_attach);
>  
> @@ -601,7 +601,7 @@ void drm_gem_map_detach(struct dma_buf *dma_buf,
>  {
>   struct drm_gem_object *obj = dma_buf->priv;
>  
> - drm_gem_unpin(obj);
> + drm_gem_unpin_unlocked(obj);
>  }
>  EXPORT_SYMBOL(drm_gem_map_detach);
>  



Re: [PATCH v14 10/12] drm/shmem-helper: Refactor locked/unlocked functions

2023-07-25 Thread Boris Brezillon
On Tue, 25 Jul 2023 09:47:02 +0200
Boris Brezillon  wrote:

> On Sun, 23 Jul 2023 02:47:44 +0300
> Dmitry Osipenko  wrote:
> 
> > Add locked/unlocked postfixes to drm-shmem function names to make clear
> > where reservation lock is taken and where not.  
> 
> Uh, ignore my comment on patch 1 then...
> 
> > Add more common helpers to drm_gem_shmem_helper.h  
> 
> I'd do the renaming and exporting in separate patches.

Actually, I'd refrain from exporting functions until someone needs
them, as you rightfully pointed out in your previous reply.

> 
> > 
> > Suggested-by: Boris Brezillon 
> > Signed-off-by: Dmitry Osipenko   



Re: [PATCH v14 02/12] drm/shmem-helper: Add pages_pin_count field

2023-07-25 Thread Boris Brezillon
On Tue, 25 Jul 2023 09:27:09 +0200
Boris Brezillon  wrote:

> On Sun, 23 Jul 2023 02:47:36 +0300
> Dmitry Osipenko  wrote:
> 
> > And new pages_pin_count field to struct drm_gem_shmem_object that will
> > determine whether pages are evictable by memory shrinker. The pages will
> > be evictable only when pages_pin_count=0. This patch prepares code for
> > addition of the memory shrinker that will utilize the new field.
> > 
> > Signed-off-by: Dmitry Osipenko 
> > ---
> >  drivers/gpu/drm/drm_gem_shmem_helper.c | 9 +
> >  include/drm/drm_gem_shmem_helper.h | 9 +
> >  2 files changed, 18 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> > b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > index 267153853e2c..42ba201dda50 100644
> > --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> > +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > @@ -274,15 +274,24 @@ static int drm_gem_shmem_pin_locked(struct 
> > drm_gem_shmem_object *shmem)
> > dma_resv_assert_held(shmem->base.resv);
> >  
> > ret = drm_gem_shmem_get_pages(shmem);
> > +   if (!ret)
> > +   shmem->pages_pin_count++;
> >  
> > return ret;
> >  }
> >  
> >  static void drm_gem_shmem_unpin_locked(struct drm_gem_shmem_object *shmem)
> >  {
> > +   struct drm_gem_object *obj = &shmem->base;
> > +
> > dma_resv_assert_held(shmem->base.resv);
> >  
> > +   if (drm_WARN_ON_ONCE(obj->dev, !shmem->pages_pin_count))
> > +   return;
> > +
> > drm_gem_shmem_put_pages(shmem);
> > +
> > +   shmem->pages_pin_count--;
> >  }
> >  
> >  /**
> > diff --git a/include/drm/drm_gem_shmem_helper.h 
> > b/include/drm/drm_gem_shmem_helper.h
> > index bf0c31aa8fbe..7111f5743006 100644
> > --- a/include/drm/drm_gem_shmem_helper.h
> > +++ b/include/drm/drm_gem_shmem_helper.h
> > @@ -39,6 +39,15 @@ struct drm_gem_shmem_object {
> >  */
> > unsigned int pages_use_count;
> >  
> > +   /**
> > +* @pages_pin_count:
> > +*
> > +* Reference count on the pinned pages table.
> > +* The pages allowed to be evicted by memory shrinker
> > +* only when the count is zero.
> > +*/
> > +   unsigned int pages_pin_count;  
> 
> Can we make it an atomic_t, so we can avoid taking the lock when the
> GEM has already been pinned. That's something I need to be able to grab
> a pin-ref in a path where the GEM resv lock is already held[1]. We could
> of course expose the locked version,

My bad, that's actually not true. The problem is not that I call
drm_gem_shmem_pin() with the resv lock already held, but that I call
drm_gem_shmem_pin() in a dma-signaling path where I'm not allowed to
take a resv lock. I know for sure pin_count > 0, because all GEM objects
mapped to a VM have their memory pinned right now, and this should
stand until we decide to add support for live-GEM eviction, at which
point we'll probably have a way to detect when a GEM is evicted, and
avoid calling drm_gem_shmem_pin() on it.

TLDR; I can't trade the atomic_t for a drm_gem_shmem_pin_locked(),
because that wouldn't solve my problem. The other solution would be to
add an atomic_t at the driver-GEM level, and only call
drm_gem_shmem_[un]pin() on 0 <-> 1 transitions, but I thought using an
atomic at the GEM-shmem level, to avoid locking when we can, would be
beneficial to the rest of the eco-system. Let me know if that's not an
option, and I'll go back to the driver-specific atomic_t.

> but in my case, I want to enforce
> the fact the GEM has been pinned before the drm_gem_shmem_pin() call in
> the section protected by the resv lock, so catching a "refcount 0 -> 1"
> situation would be useful. Beside, using an atomic to avoid the
> lock/unlock dance when refcount > 1 might be beneficial to everyone.
> 
> [1]https://gitlab.freedesktop.org/bbrezillon/linux/-/commit/4420fa0d5768ebdc35b34d58d4ae5fad9fbb93f9
> 
> > +
> > /**
> >  * @madv: State for madvise
> >  *  
> 



Re: [PATCH] drm/shmem-helper: Reset vma->vm_ops before calling dma_buf_mmap()

2023-07-26 Thread Boris Brezillon
On Tue, 25 Jul 2023 20:50:43 +0200
Thomas Zimmermann  wrote:

> Hi
> 
> Am 24.07.23 um 13:26 schrieb Boris Brezillon:
> > The dma-buf backend is supposed to provide its own vm_ops, but some
> > implementation just have nothing special to do and leave vm_ops
> > untouched, probably expecting this field to be zero initialized (this
> > is the case with the system_heap implementation for instance).
> > Let's reset vma->vm_ops to NULL to keep things working with these
> > implementations.  
> 
> Thanks for your patch. This bug could affect a number of GEM 
> implementations.

The one I found that is probably hit by the same problem is
exynos_drm_gem.c, but there might be others...

> Instead of fixing this individually, could we set the 
> fields conditionally at
> 
>  
> https://elixir.bootlin.com/linux/v6.4/source/drivers/gpu/drm/drm_gem.c#L1042
> 
> ?
> 
> Something like
> 
>if (!object->import_attach) {

If guess you meant the opposite: if (object->import_attach)

>  vma->priv =
>  vma->ops =
>}

I suspect it will break other drivers relying on the fact vma->vm_ops
is auto-magically assigned to obj->funcs->vm_ops, even for prime
buffers. The one I'm looking at right now is amdgpu: it has its own way
of mapping imported dma-bufs, and resetting vma->vm_ops to NULL means
the ttm layer will fallback to the default ttm_bo_vm_ops, which is not
what amdgpu wants.

AFAICT, etnaviv is in the same situtation, though it's probably easier
to fix, given the open/close hooks for imported objects doesn't do much.

TLDR; yes, it'd be great to have this 'fix' moved at the core level, or
even have a dedicated path for dma-buf objects, but I fear it's going
to fall apart if we do that.

One option would be to add a dma_buf_vm_ops field to
drm_gem_object_funcs, add a
DRM_GEM_OBJ_FUNCS_SET_VM_OPS(vm_ops, dma_buf_vm_ops) macro that would
assign both dma_buf_vm_ops and vm_ops, patch all existing drivers
to use this macro (mechanical change where we assign both fields to the
same value, so we don't break anything, but don't fix broken
implementations either). Once this is in place, we can have the
following in drm_gem_mmap_obj():

vma->vm_ops = object->import_attach ?
  object->funcs->dma_buf_vm_ops :
  object->funcs->vm_ops;
vma->vm_private_data = vma->vm_ops ? obj : NULL;

And then we can specialize the shmem and exynos implementations
(actually, any implementation that's entirely deferring the mmap to the
dma-buf layer), so they explicitly set dma_buf_vm_ops to NULL.

Honestly, I'm not sure this is better than manually assigning
vma->vm_ops to NULL in the driver mmap function, but at least people
will have to consider it when they write their driver ('do I want
the same mmap behavior for dmabuf and !dmabuf?').

Anyway, I think this fix is worth applying, because it's self-contained
and easy to backport. We can discuss and sort out how we want to fix the
problem more generically later on.

> 
> plus a descriptive comment like the one you have in your patch.
> 
> Best regards
> Thomas
> 
> > 
> > Fixes: 26d3ac3cb04d ("drm/shmem-helpers: Redirect mmap for imported 
> > dma-buf")
> > Cc: 
> > Cc: Daniel Vetter 
> > Reported-by: Roman Stratiienko 
> > Signed-off-by: Boris Brezillon 
> > ---
> >   drivers/gpu/drm/drm_gem_shmem_helper.c | 6 ++
> >   1 file changed, 6 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> > b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > index 4ea6507a77e5..baaf0e0feb06 100644
> > --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> > +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> > @@ -623,7 +623,13 @@ int drm_gem_shmem_mmap(struct drm_gem_shmem_object 
> > *shmem, struct vm_area_struct
> > int ret;
> >   
> > if (obj->import_attach) {
> > +   /* Reset both vm_ops and vm_private_data, so we don't end up 
> > with
> > +* vm_ops pointing to our implementation if the dma-buf backend
> > +* doesn't set those fields.
> > +*/
> > vma->vm_private_data = NULL;
> > +   vma->vm_ops = NULL;
> > +
> > ret = dma_buf_mmap(obj->dma_buf, vma, 0);
> >   
> > /* Drop the reference drm_gem_mmap_obj() acquired.*/  
> 



Re: [drm-misc:for-linux-next 2/2] drivers/gpu/drm/drm_debugfs.c:212:33: sparse: sparse: non size-preserving pointer to integer cast

2023-07-26 Thread Boris Brezillon
On Wed, 26 Jul 2023 00:25:36 +0200
Danilo Krummrich  wrote:

> On 7/24/23 09:27, Boris Brezillon wrote:
> > On Fri, 21 Jul 2023 02:06:16 +0800
> > kernel test robot  wrote:
> >   
> >> tree:   git://anongit.freedesktop.org/drm/drm-misc for-linux-next
> >> head:   c7a472297169156252a50d76965eb36b081186e2
> >> commit: 4f66feeab173bd73e71028b8c2e1dcea07e32dd5 [2/2] drm: debugfs: 
> >> provide infrastructure to dump a DRM GPU VA space
> >> config: i386-randconfig-r092-20230720 
> >> (https://download.01.org/0day-ci/archive/20230721/202307210230.t2onm5g0-...@intel.com/config)
> >> compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> >> reproduce: 
> >> (https://download.01.org/0day-ci/archive/20230721/202307210230.t2onm5g0-...@intel.com/reproduce)
> >>
> >> If you fix the issue in a separate patch/commit (i.e. not just a new 
> >> version of
> >> the same patch/commit), kindly add following tags
> >> | Reported-by: kernel test robot 
> >> | Closes: 
> >> https://lore.kernel.org/oe-kbuild-all/202307210230.t2onm5g0-...@intel.com/
> >>
> >> sparse warnings: (new ones prefixed by >>)  
> >>>> drivers/gpu/drm/drm_debugfs.c:212:33: sparse: sparse: non 
> >>>> size-preserving pointer to integer cast  
> >>
> >> vim +212 drivers/gpu/drm/drm_debugfs.c
> >>
> >> 178
> >> 179/**
> >> 180 * drm_debugfs_gpuva_info - dump the given DRM GPU VA space
> >> 181 * @m: pointer to the &seq_file to write
> >> 182 * @mgr: the &drm_gpuva_manager representing the GPU VA space
> >> 183 *
> >> 184 * Dumps the GPU VA mappings of a given DRM GPU VA manager.
> >> 185 *
> >> 186 * For each DRM GPU VA space drivers should call this function 
> >> from their
> >> 187 * &drm_info_list's show callback.
> >> 188 *
> >> 189 * Returns: 0 on success, -ENODEV if the &mgr is not initialized
> >> 190 */
> >> 191int drm_debugfs_gpuva_info(struct seq_file *m,
> >> 192   struct drm_gpuva_manager *mgr)
> >> 193{
> >> 194struct drm_gpuva *va, *kva = &mgr->kernel_alloc_node;
> >> 195
> >> 196if (!mgr->name)
> >> 197return -ENODEV;
> >> 198
> >> 199seq_printf(m, "DRM GPU VA space (%s) 
> >> [0x%016llx;0x%016llx]\n",
> >> 200   mgr->name, mgr->mm_start, mgr->mm_start + 
> >> mgr->mm_range);
> >> 201seq_printf(m, "Kernel reserved node 
> >> [0x%016llx;0x%016llx]\n",
> >> 202   kva->va.addr, kva->va.addr + kva->va.range);
> >> 203seq_puts(m, "\n");
> >> 204seq_puts(m, " VAs | start  | range  
> >> | end| object | object offset\n");
> >> 205seq_puts(m, 
> >> "-\n");
> >> 206drm_gpuva_for_each_va(va, mgr) {
> >> 207if (unlikely(va == kva))
> >> 208continue;
> >> 209
> >> 210seq_printf(m, " | 0x%016llx | 0x%016llx | 
> >> 0x%016llx | 0x%016llx | 0x%016llx\n",
> >> 211   va->va.addr, va->va.range, 
> >> va->va.addr + va->va.range,  
> >>   > 212   (u64)va->gem.obj, va->gem.offset);  
> > 
> > Oops, I didn't notice it when reviewing. You're leaking a kernel address
> > to user space here. You should probably use %p to print the GEM object
> > address, and add `no_hash_pointers` to your cmdline when you want to
> > debug things.  
> 
> %p doesn't really work well in terms of formatting, plus for debugfs I 
> thought this might be fine. I could maybe use ptr_to_hashval(), but then 
> 'no_hash_pointers' wouldn't do anything for it.

Right, it's probably fine for debugfs indeed. Guess the uintptr_t cast
Steve suggested is the right fix then.


Re: [PATCH drm-misc-next v8 01/12] drm: manager to keep track of GPUs VA mappings

2023-07-28 Thread Boris Brezillon
On Fri, 28 Jul 2023 13:31:36 +0200
Maxime Ripard  wrote:

> Hi Danilo,
> 
> On Thu, Jul 20, 2023 at 02:14:22AM +0200, Danilo Krummrich wrote:
> > Add infrastructure to keep track of GPU virtual address (VA) mappings
> > with a decicated VA space manager implementation.
> > 
> > New UAPIs, motivated by Vulkan sparse memory bindings graphics drivers
> > start implementing, allow userspace applications to request multiple and
> > arbitrary GPU VA mappings of buffer objects. The DRM GPU VA manager is
> > intended to serve the following purposes in this context.
> > 
> > 1) Provide infrastructure to track GPU VA allocations and mappings,
> >making using an interval tree (RB-tree).
> > 
> > 2) Generically connect GPU VA mappings to their backing buffers, in
> >particular DRM GEM objects.
> > 
> > 3) Provide a common implementation to perform more complex mapping
> >operations on the GPU VA space. In particular splitting and merging
> >of GPU VA mappings, e.g. for intersecting mapping requests or partial
> >    unmap requests.
> > 
> > Acked-by: Thomas Hellström 
> > Acked-by: Matthew Brost 
> > Reviewed-by: Boris Brezillon 
> > Tested-by: Matthew Brost 
> > Tested-by: Donald Robson 
> > Suggested-by: Dave Airlie 
> > Signed-off-by: Danilo Krummrich   
> 
> For some reason this breaks the drm_exec kunit patches:

Fix available here [1].

[1]https://lore.kernel.org/dri-devel/cbf4ccf9-8131-27a0-332c-694286634...@igalia.com/T/#t


Re: [PATCH v2] drm: fix indirect goto into statement expression UB

2023-07-31 Thread Boris Brezillon
On Fri, 28 Jul 2023 10:17:57 -0700
Nathan Chancellor  wrote:

> + people from trailers of 09593216bff1
> 
> On Thu, Jul 27, 2023 at 03:50:58PM -0700, ndesaulni...@google.com wrote:
> > A new diagnostic in clang-17 now produces the following build error:
> > 
> > drivers/gpu/drm/tests/drm_exec_test.c:41:3: error: cannot jump from this
> > indirect goto statement to one of its possible targets
> >41 | drm_exec_retry_on_contention(&exec);
> >   | ^
> > include/drm/drm_exec.h:96:4: note: expanded from macro
> > 'drm_exec_retry_on_contention'
> >96 | goto *__drm_exec_retry_ptr;
> >   | ^
> > drivers/gpu/drm/tests/drm_exec_test.c:39:2: note: possible target of
> > indirect goto statement
> >39 | drm_exec_until_all_locked(&exec) {
> >   | ^
> > include/drm/drm_exec.h:79:33: note: expanded from macro
> > 'drm_exec_until_all_locked'
> >79 | __label__ __drm_exec_retry;
> > drivers/gpu/drm/tests/drm_exec_test.c:39:2: note: jump enters a
> > statement expression
> > 
> > The GCC manually currently states that:  
> 
>   ^ manual
> 
> > >> Jumping into a statement expression with a computed goto (see Labels
> > >> as Values) has undefined behavior.  
> > 
> > So the diagnostic appears correct, even if codegen happened to produce
> > working code.
> > 
> > Looking closer at this code, while the original combination of statement
> > expression, local label, and computed/indirect goto GNU C expressions
> > were clever, a simple while loop and continue block might have sufficed.
> > 
> > This approach might not work as expected if drm_exec_until_all_locked
> > "loops" can be nested, but that doesn't appear to be an existing use
> > case in the codebase.

Hm, that's exactly the sort of things we were trying to be robust
against with the original approach. With this version, we're back to a
situation where

drm_exec_until_all_locked(exec) {
for (...) {
drm_exec_retry_on_contention(exec);
}
}

doesn't do what we expect it to do, and that's a use case we want to
support.

> > 
> > Fixes: commit 09593216bff1 ("drm: execution context for GEM buffers v7")
> > Link: https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html
> > Link: https://github.com/ClangBuiltLinux/linux/issues/1890
> > Link: 
> > https://github.com/llvm/llvm-project/commit/20219106060208f0c2f5d096eb3aed7b712f5067
> > Reported-by: Nathan Chancellor 
> > Reported-by: Naresh Kamboju 
> > Signed-off-by: Nick Desaulniers   
> 
> Thanks for the patch!
> 
> Tested-by: Nathan Chancellor  # build
> 
> > ---
> > Changes in v2:
> > Fix the continue to be outside of the do while
> > - Link to v1: 
> > https://lore.kernel.org/r/20230727-amdgpu-v1-1-a95690e75...@google.com
> > ---
> >  include/drm/drm_exec.h | 21 +
> >  1 file changed, 5 insertions(+), 16 deletions(-)
> > 
> > diff --git a/include/drm/drm_exec.h b/include/drm/drm_exec.h
> > index 73205afec162..fa1cc5c3d021 100644
> > --- a/include/drm/drm_exec.h
> > +++ b/include/drm/drm_exec.h
> > @@ -70,18 +70,8 @@ struct drm_exec {
> >   * Core functionality of the drm_exec object. Loops until all GEM objects 
> > are
> >   * locked and no more contention exists. At the beginning of the loop it is
> >   * guaranteed that no GEM object is locked.
> > - *
> > - * Since labels can't be defined local to the loops body we use a jump 
> > pointer
> > - * to make sure that the retry is only used from within the loops body.
> >   */
> > -#define drm_exec_until_all_locked(exec)\
> > -   for (void *__drm_exec_retry_ptr; ({ \
> > -   __label__ __drm_exec_retry; \
> > -__drm_exec_retry:  \
> > -   __drm_exec_retry_ptr = &&__drm_exec_retry;  \
> > -   (void)__drm_exec_retry_ptr; \
> > -   drm_exec_cleanup(exec); \
> > -   });)
> > +#define drm_exec_until_all_locked(exec)while(drm_exec_cleanup(exec))
> >  
> >  /**
> >   * drm_exec_retry_on_contention - restart the loop to grap all locks
> > @@ -90,11 +80,10 @@ __drm_exec_retry:   
> > \
> >   * Control flow helper to continue when a contention was detected and we 
> > need to
> >   * clean up and re-start the loop to prepare all GEM objects.
> >   */
> > -#define drm_exec_retry_on_contention(exec) \
> > -   do {\
> > -   if (unlikely(drm_exec_is_contended(exec)))  \
> > -   goto *__drm_exec_retry_ptr; \
> > -   } while (0)
> > +#define drm_exec_retry_on_contention(exec) \
> > +   if (unlikely(drm_exec_is_contended(exec)))  \
> > +   continue;   \
> > +   do {} while (

Re: [RFC PATCH 06/10] drm/sched: Submit job before starting TDR

2023-07-31 Thread Boris Brezillon
+the PVR devs

On Mon, 31 Jul 2023 01:00:59 +
Matthew Brost  wrote:

> On Thu, May 04, 2023 at 01:23:05AM -0400, Luben Tuikov wrote:
> > On 2023-04-03 20:22, Matthew Brost wrote:  
> > > If the TDR is set to a value, it can fire before a job is submitted in
> > > drm_sched_main. The job should be always be submitted before the TDR
> > > fires, fix this ordering.
> > > 
> > > Signed-off-by: Matthew Brost 
> > > ---
> > >  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 6ae710017024..4eac02d212c1 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -1150,10 +1150,10 @@ static void drm_sched_main(struct work_struct *w)
> > >   s_fence = sched_job->s_fence;
> > >  
> > >   atomic_inc(&sched->hw_rq_count);
> > > - drm_sched_job_begin(sched_job);
> > >  
> > >   trace_drm_run_job(sched_job, entity);
> > >   fence = sched->ops->run_job(sched_job);
> > > + drm_sched_job_begin(sched_job);
> > >   complete_all(&entity->entity_idle);
> > >   drm_sched_fence_scheduled(s_fence);
> > >
> > 
> > Not sure if this is correct. In drm_sched_job_begin() we add the job to the 
> > "pending_list"
> > (meaning it is pending execution in the hardware) and we also start a 
> > timeout timer. Both
> > of those should be started before the job is given to the hardware.
> >   
> 
> The correct solution is probably add to pending list before run_job()
> and kick TDR after run_job().

This would make the PVR driver simpler too. Right now, the driver
iterates over the pending job list to signal jobs done_fences, but
there's a race between the interrupt handler (that's iterating over
this list to signal fences) and the drm_sched logic (that's inserting
the job in the pending_list after run_job() returns). The race is taken
care of with an addition field that's pointing to the last submitted
job [1], but if we can get rid of that logic, that's for the best.

[1]https://gitlab.freedesktop.org/frankbinns/powervr/-/blob/powervr-next/drivers/gpu/drm/imagination/pvr_queue.h#L119


Re: [PATCH v14 02/12] drm/shmem-helper: Add pages_pin_count field

2023-07-31 Thread Boris Brezillon
+Danilo, to confirm my understanding of the gpuva remap operation is
correct.

On Mon, 31 Jul 2023 15:27:31 +0300
Dmitry Osipenko  wrote:

> On 7/25/23 11:32, Boris Brezillon wrote:
> >> Can we make it an atomic_t, so we can avoid taking the lock when the
> >> GEM has already been pinned. That's something I need to be able to grab
> >> a pin-ref in a path where the GEM resv lock is already held[1]. We could
> >> of course expose the locked version,  
> > My bad, that's actually not true. The problem is not that I call
> > drm_gem_shmem_pin() with the resv lock already held, but that I call
> > drm_gem_shmem_pin() in a dma-signaling path where I'm not allowed to
> > take a resv lock. I know for sure pin_count > 0, because all GEM objects
> > mapped to a VM have their memory pinned right now, and this should
> > stand until we decide to add support for live-GEM eviction, at which
> > point we'll probably have a way to detect when a GEM is evicted, and
> > avoid calling drm_gem_shmem_pin() on it.
> > 
> > TLDR; I can't trade the atomic_t for a drm_gem_shmem_pin_locked(),
> > because that wouldn't solve my problem. The other solution would be to
> > add an atomic_t at the driver-GEM level, and only call
> > drm_gem_shmem_[un]pin() on 0 <-> 1 transitions, but I thought using an
> > atomic at the GEM-shmem level, to avoid locking when we can, would be
> > beneficial to the rest of the eco-system. Let me know if that's not an
> > option, and I'll go back to the driver-specific atomic_t.  
> 
> Could you please explain why do you need to pin GEM in a signal handler?
> This is not something drivers usually do or need to do. You likely also
> shouldn't need to detect that GEM is evicted in yours driver. I'd expect
> that Panthor shouldn't differ from Panfrost in regards to how GEM memory
> management is done and Panfrost doesn't need to do anything special.

Panthor VM management is completely different, and the case I'm
referring to is 'asynchronous VM_BIND': mapping a GEM object to a GPU VM
asynchronously, so we can make it depend on other operations, encoded as
syncobjs passed to the VM_BIND operation.

Here is the workflow we have for this use case:

1. Create + push a VM_BIND job to the VM_BIND queue (a drm_sched_entity
that's taking care of asynchronous VM map/unmap operations). Because
this operation is asynchronous, and the execution itself happens in a
dma-signaling path (drm_sched::run_job()), we need to pre-allocate the
MMU page tables for the worst case scenario, and make sure the GEM pages
are pinned at job creation time.

2. The VM operation itself is executed when all dependencies are met
(drm_sched calls run_job()). In case of a map operation, we call
drm_gpuva_sm_map(), which might split the map operation into
remap+unamp+map ones if the region being mapped is covering a region
that was previously mapped to a different GEM object or a different
portion of the same GEM object (see the gpuva_mgr doc [1]). A
remap operation is just a way to split an existing mapping in 2 mappings
covering the left/right side of the previous mapping, plus a hole in
the middle. This means that our VM mapping object (drm_gpuva), which
was pointing to a GEM object that had its pages pinned, is now turned
into 2 mapping objects, and we need to make sure those 2 mappings own a
reference to the pages, otherwise we'll have an unbalanced refcount
when we release those 2 mappings further down the road.

3. Release resources attached to mappings that were removed (that
includes releasing the ref we had on GEM pages) and free the mapping
objects. We do that asynchronously, outside of the dma-signaling path.

> 
> Note that patch #14 makes locked pin/unpin functions public and turns
> the unlocked variants into helpers, you'll be able to experiment with
> these funcs in the Panthor driver.

Unfortunately, those won't help. I really need a way to increment the
refcount without holding the lock, because we're in a dma-signaling
path when we call drm_gpuva_sm_map(). Note that I could live with a
drm_shmem_gem_pin_if_already_pinned() variant that would return NULL if
pin_count == 0 instead of trying to acquire the lock, but I'd still
need this refcount to be an atomic_t.

As I said, an alternative to this approach would be to have a separate
atomic refcount at the panthor_gem_object level, but I feel like we'd
just be duplicating something that exists already.

[1]https://cgit.freedesktop.org/drm/drm-misc/tree/drivers/gpu/drm/drm_gpuva_mgr.c#n67


Re: [PATCH 1/2] drm/exec: use unique instead of local label

2023-07-31 Thread Boris Brezillon
On Mon, 31 Jul 2023 08:31:19 -0700
Nathan Chancellor  wrote:

> On Mon, Jul 31, 2023 at 02:36:24PM +0200, Christian König wrote:
> > GCC forbids to jump to labels in loop conditions and a new clang
> > check stumbled over this.
> > 
> > So instead using a local label inside the loop condition use an
> > unique label outside of it.
> > 
> > Fixes: commit 09593216bff1 ("drm: execution context for GEM buffers v7")
> > Link: https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html
> > Link: https://github.com/ClangBuiltLinux/linux/issues/1890
> > Link: 
> > https://github.com/llvm/llvm-project/commit/20219106060208f0c2f5d096eb3aed7b712f5067
> > Reported-by: Nathan Chancellor 
> > Reported-by: Naresh Kamboju 
> > CC: Boris Brezillon 

Reviewed-by: Boris Brezillon 

> > Signed-off-by: Christian König   
> 
> Passes my build tests and I inspected the preprocessed output to make
> sure it should work. I ran the KUnit tests, which all pass (although [1]
> is needed to fix a tangential issue):
> 
> Tested-by: Nathan Chancellor 
> 
> Thanks for fixing this!
> 
> [1]: https://lore.kernel.org/20230728183400.306193-1-arthurgri...@riseup.net/
> 
> > ---
> >  include/drm/drm_exec.h | 14 +++---
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/drm/drm_exec.h b/include/drm/drm_exec.h
> > index 73205afec162..e0462361adf9 100644
> > --- a/include/drm/drm_exec.h
> > +++ b/include/drm/drm_exec.h
> > @@ -3,6 +3,7 @@
> >  #ifndef __DRM_EXEC_H__
> >  #define __DRM_EXEC_H__
> >  
> > +#include 
> >  #include 
> >  
> >  #define DRM_EXEC_INTERRUPTIBLE_WAITBIT(0)
> > @@ -74,13 +75,12 @@ struct drm_exec {
> >   * Since labels can't be defined local to the loops body we use a jump 
> > pointer
> >   * to make sure that the retry is only used from within the loops body.
> >   */
> > -#define drm_exec_until_all_locked(exec)\
> > -   for (void *__drm_exec_retry_ptr; ({ \
> > -   __label__ __drm_exec_retry; \
> > -__drm_exec_retry:  \
> > -   __drm_exec_retry_ptr = &&__drm_exec_retry;  \
> > -   (void)__drm_exec_retry_ptr; \
> > -   drm_exec_cleanup(exec); \
> > +#define drm_exec_until_all_locked(exec)
> > \
> > +__PASTE(__drm_exec_, __LINE__):
> > \
> > +   for (void *__drm_exec_retry_ptr; ({ \
> > +   __drm_exec_retry_ptr = &&__PASTE(__drm_exec_, __LINE__);\
> > +   (void)__drm_exec_retry_ptr; \
> > +   drm_exec_cleanup(exec); \
> > });)
> >  
> >  /**
> > -- 
> > 2.34.1
> > 
> >   



Re: [PATCH v3] drm/panfrost: Sync IRQ by job's timeout handler

2023-08-01 Thread Boris Brezillon
On Tue,  1 Aug 2023 03:14:27 +0300
Dmitry Osipenko  wrote:

> Panfrost IRQ handler may stuck for a long time, for example this happens
> when there is a bad HDMI connection and HDMI handler takes a long time to
> finish processing, holding Panfrost. Make Panfrost's job timeout handler
> to sync IRQ before checking fence signal status in order to prevent
> spurious job timeouts due to a slow IRQ processing.
> 
> Reviewed-by: Steven Price 
> Reviewed-by: AngeloGioacchino Del Regno 
> 
> Tested-by: AngeloGioacchino Del Regno 
>  # MediaTek MT8192 and MT8195 
> Chromebooks:
> Signed-off-by: Dmitry Osipenko 

Reviewed-by: Boris Brezillon 

Just a couple nits below.

> ---
> 
> Changelog:
> 
> v3: - Added comment to the code as was suggested by Boris
> 
> - Added r-b/t-b from Steven and Angelo
> 
> v2: - Moved synchronize_irq() after first signal-check to avoid unnecessary
>   blocking on syncing.
> 
> - Added warn message about high interrupt latency.
> 
>  drivers/gpu/drm/panfrost/panfrost_job.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
> b/drivers/gpu/drm/panfrost/panfrost_job.c
> index dbc597ab46fb..ea1149354f9d 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -720,6 +720,21 @@ static enum drm_gpu_sched_stat 
> panfrost_job_timedout(struct drm_sched_job
>   if (dma_fence_is_signaled(job->done_fence))
>   return DRM_GPU_SCHED_STAT_NOMINAL;
>  
> + /*
> +  * Panfrost IRQ handler may take long time to process if there is

^ may take a long time to process an
interrupt if there is ...

> +  * another IRQ handler hogging the processing. For example, HDMI
> +  * may stuck in IRQ handler for a significant time in a case of bad

For example, the HDMI encoder driver might be stuck in the IRQ
handler ...

> +  * cable connection. In order to catch such cases and not report
> +  * spurious Panfrost job timeouts, synchronize the IRQ handler and
> +  * re-check the fence status.
> +  */



> + synchronize_irq(pfdev->js->irq);
> +
> + if (dma_fence_is_signaled(job->done_fence)) {
> + dev_warn(pfdev->dev, "unexpectedly high interrupt latency\n");
> + return DRM_GPU_SCHED_STAT_NOMINAL;
> + }
> +
>   dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, 
> status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
>   js,
>   job_read(pfdev, JS_CONFIG(js)),



Re: [PATCH 1/2] drm/exec: use unique instead of local label

2023-08-02 Thread Boris Brezillon
On Tue, 1 Aug 2023 13:35:13 -0700
Nick Desaulniers  wrote:

> On Mon, Jul 31, 2023 at 5:36 AM Christian König
>  wrote:
> >
> > GCC forbids to jump to labels in loop conditions and a new clang
> > check stumbled over this.
> >
> > So instead using a local label inside the loop condition use an
> > unique label outside of it.
> >
> > Fixes: commit 09593216bff1 ("drm: execution context for GEM buffers v7")
> > Link: https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html
> > Link: https://github.com/ClangBuiltLinux/linux/issues/1890
> > Link: 
> > https://github.com/llvm/llvm-project/commit/20219106060208f0c2f5d096eb3aed7b712f5067
> > Reported-by: Nathan Chancellor 
> > Reported-by: Naresh Kamboju 
> > CC: Boris Brezillon 
> > Signed-off-by: Christian König   
> 
> Works for me; thanks for the patch!
> Reviewed-by: Nick Desaulniers 
> 
> I suspect it's possible to change the indirect goto into a direct goto
> with some further refactoring (macros can take block statements; if
> drm_exec_until_all_locked accepted a block statement arg then you
> could introduce a new scope, and a new local label to that scope, then
> just use direct goto),

Maybe I'm wrong, but this sounds like the version I proposed here [1].

> but this will probably apply cleaner. (oh, is
> 09593216bff1 only in next at the moment? The AuthorDate threw me.)
> 
> There are some curious cases where __attribute__((cleanup())) doesn't
> mesh well with indirect gotos.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=37722
> 
> May not ever be a problem here...

[1]https://patchwork.freedesktop.org/patch/543077/


Re: [PATCH v14 02/12] drm/shmem-helper: Add pages_pin_count field

2023-08-02 Thread Boris Brezillon
On Wed, 2 Aug 2023 04:31:52 +0200
Danilo Krummrich  wrote:

> On 7/31/23 15:35, Boris Brezillon wrote:
> > +Danilo, to confirm my understanding of the gpuva remap operation is
> > correct.  
> 
> Your understanding is correct.
> 
> Unfortunately, re-mapping things has such implications.
> 
> I'm currently working on tracking external GEM objects in the GPUVA 
> manager, where, ideally, you'd want to add the extobj to the VM when the 
> first mapping being backed by this GEM is created and removed when the 
> last mapping being backed by this GEM is removed. Hence, extobjs need to 
> be ref-counted based on how many mappings they back.

Uh, right. I went for a much simpler (but also less efficient) approach
where I basically track things at the mapping level (my panthor_vma
object, which inherits from drm_gpuva, has a list node so it can be
inserted in a shared_bos list tracked at the VM level), instead of the
GEM level. So we'd basically be trying to acquire resv locks multiple
times and reserving multiple slots if the same shared GEM is mapped
multiple times. With the IGNORE_DUPLICATES flag passed to drm_exec,
that works, but it might not be ideal if we expect shared BOs to be
mapped multiple times in the same VM.

> 
> However, when re-mapping such a mapping, the reference counter might 
> drop to 0 temporarily and the slot of the data structure tracking the 
> extobj is cleaned up and needs to be re-allocated. Surely, we could just 
> increase the reference count while re-mapping or for the whole 
> transaction (job), but this would make the API kinda bulky.

With things happening in the dma-signaling path, we'd need to
pre-allocate this shared-bo container object anyway, because we can't
assume there will be one available by the time we get to run the VM
operation. So I think it's safe to assume that, even if the unmap part
of the remap operation drops the last ref of this container object, when
you get to map the same BO again, you'll have another container to play
with. It's just a matter of pre-allocating one more thing when
bo_is_shared==true && op==map, I think.


[RFC PATCH] drm/sched: Wait for the currently popped dependency in kill_jobs_cb()

2023-06-07 Thread Boris Brezillon
If I understand correctly, drm_sched_entity_kill_jobs_cb() is supposed
to wait on all the external dependencies (those added to
drm_sched_job::dependencies) before signaling the job finished fence.
This is done this way to prevent jobs depending on these canceled jobs
from considering the resources they want to access as ready, when
they're actually still used by other components, thus leading to
potential data corruptions.

The problem is, the kill_jobs logic is omitting the last fence popped
from the dependencies array that was waited upon before
drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
so we're basically waiting for all dependencies except one.

This is an attempt at fixing that, but also an opportunity to make sure
I understood the drm_sched_entity_kill(), because I'm not 100% sure if
skipping this currently popped dependency was intentional or not. I can't
see a good reason why we'd want to skip that one, but I might be missing
something.

Signed-off-by: Boris Brezillon 
Cc: Frank Binns 
Cc: Sarah Walker 
Cc: Donald Robson 
Cc: Luben Tuikov 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Sumit Semwal 
Cc: "Christian König" 
---
Stumbled across this issue while working on native dependency support
with Donald (which will be posted soon). Flagged as RFC, because I'm
not sure this is legit, and also not sure we want to fix it this way.
I tried re-using drm_sched_entity::dependency, but it's a bit of a mess
because of the asynchronousity of the wait, and the fact we use
drm_sched_entity::dependency to know if we have a clear_dep()
callback registered, so we can easily reset it without removing the
callback.
---
 drivers/gpu/drm/scheduler/sched_entity.c | 40 ++--
 drivers/gpu/drm/scheduler/sched_main.c   |  3 ++
 include/drm/gpu_scheduler.h  | 12 +++
 3 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index 68e807ae136a..3821f9adf7bd 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -176,20 +176,35 @@ static void drm_sched_entity_kill_jobs_cb(struct 
dma_fence *f,
 {
struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
 finish_cb);
-   int r;
 
dma_fence_put(f);
 
-   /* Wait for all dependencies to avoid data corruptions */
-   while (!xa_empty(&job->dependencies)) {
-   f = xa_erase(&job->dependencies, job->last_dependency++);
-   r = dma_fence_add_callback(f, &job->finish_cb,
-  drm_sched_entity_kill_jobs_cb);
-   if (!r)
+   /* Wait for all remaining dependencies to avoid data corruptions.
+*
+* We first check the last dependency popped from job->dependencies,
+* and then walk job->dependencies.
+*
+* Note that we don't wait on the last fence returned by
+* drm_gpu_scheduler_ops::prepare_job(), nor do we call
+* drm_gpu_scheduler_ops::prepare_job() to empty the list of potential
+* internal dependencies the driver might want to wait on before
+* scheduling the job. We simply assume skipping internal dependencies
+* can't cause data corruption on resources passed to the job.
+*/
+   do {
+   f = job->cur_dep;
+
+   if (!xa_empty(&job->dependencies))
+   job->cur_dep = xa_erase(&job->dependencies, 
job->last_dependency++);
+   else
+   job->cur_dep = NULL;
+
+   if (f &&
+   !dma_fence_add_callback(f, &job->finish_cb, 
drm_sched_entity_kill_jobs_cb))
return;
 
dma_fence_put(f);
-   }
+   } while (job->cur_dep);
 
INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
schedule_work(&job->work);
@@ -415,8 +430,13 @@ static struct dma_fence *
 drm_sched_job_dependency(struct drm_sched_job *job,
 struct drm_sched_entity *entity)
 {
-   if (!xa_empty(&job->dependencies))
-   return xa_erase(&job->dependencies, job->last_dependency++);
+   dma_fence_put(job->cur_dep);
+   job->cur_dep = NULL;
+
+   if (!xa_empty(&job->dependencies)) {
+   job->cur_dep = xa_erase(&job->dependencies, 
job->last_dependency++);
+   return dma_fence_get(job->cur_dep);
+   }
 
if (job->sched->ops->prepare_job)
return job->sched->ops->prepare_job(job, entity);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 394010a60821..70ab60e5760c 100644

Re: [RFC PATCH] drm/sched: Wait for the currently popped dependency in kill_jobs_cb()

2023-06-08 Thread Boris Brezillon
On Thu,  8 Jun 2023 08:55:51 +0200
Boris Brezillon  wrote:

> If I understand correctly, drm_sched_entity_kill_jobs_cb() is supposed
> to wait on all the external dependencies (those added to
> drm_sched_job::dependencies) before signaling the job finished fence.
> This is done this way to prevent jobs depending on these canceled jobs
> from considering the resources they want to access as ready, when
> they're actually still used by other components, thus leading to
> potential data corruptions.
> 
> The problem is, the kill_jobs logic is omitting the last fence popped
> from the dependencies array that was waited upon before
> drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
> so we're basically waiting for all dependencies except one.
> 
> This is an attempt at fixing that, but also an opportunity to make sure
> I understood the drm_sched_entity_kill(), because I'm not 100% sure if

   ^ the drm_sched_entity_kill() logic correctly, ...

> skipping this currently popped dependency was intentional or not. I can't
> see a good reason why we'd want to skip that one, but I might be missing
> something.
> 
> Signed-off-by: Boris Brezillon 
> Cc: Frank Binns 
> Cc: Sarah Walker 
> Cc: Donald Robson 
> Cc: Luben Tuikov 
> Cc: David Airlie 
> Cc: Daniel Vetter 
> Cc: Sumit Semwal 
> Cc: "Christian König" 
> ---
> Stumbled across this issue while working on native dependency support
> with Donald (which will be posted soon). Flagged as RFC, because I'm
> not sure this is legit, and also not sure we want to fix it this way.
> I tried re-using drm_sched_entity::dependency, but it's a bit of a mess
> because of the asynchronousity of the wait, and the fact we use
> drm_sched_entity::dependency to know if we have a clear_dep()
> callback registered, so we can easily reset it without removing the

  ^ we can't ...

> callback.


Re: [RFC PATCH 01/10] drm/sched: Convert drm scheduler to use a work queue rather than kthread

2023-06-08 Thread Boris Brezillon
Hi Matthew,

On Mon,  3 Apr 2023 17:22:02 -0700
Matthew Brost  wrote:

> -static int drm_sched_main(void *param)
> +static void drm_sched_main(struct work_struct *w)
>  {
> - struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param;
> + struct drm_gpu_scheduler *sched =
> + container_of(w, struct drm_gpu_scheduler, work_run);
>   int r;
>  
> - sched_set_fifo_low(current);
> -
> - while (!kthread_should_stop()) {
> - struct drm_sched_entity *entity = NULL;
> + while (!READ_ONCE(sched->pause_run_wq)) {

During an informal discussion on IRC I mentioned that this loop might
become problematic if all the 1:1 entities share the same wq
(especially if it's an ordered wq), and one of them is getting passed a
lot of requests. Just wanted to tell you that we've hit that case in
PowerVR:

Geometry and fragment queues get passed X requests respectively, each
pair of request corresponding to a rendering operation. Because we're
using an ordered wq (which I know we shouldn't do, and I intend to
fix that, but I think it shows the problem exists by making it more
visible), all geometry requests get submitted first, then come the
fragment requests. It turns out the submission time is non-negligible
compared to the geometry job execution time, and geometry jobs end up
generating data for the fragment jobs that are not consumed fast enough
by the fragment job to allow the following geom jobs to re-use the same
portion of memory, leading to on-demand allocation of extra memory
chunks which wouldn't happen if submissions were interleaved.

I know you were not fundamentally opposed to killing this loop and doing
one iteration at a time (you even provided a patch doing that), just
wanted to share my findings to prove this is not just a theoretical
issue, and the lack of fairness in the submission path can cause trouble
in practice.

Best Regards,

Boris

> + struct drm_sched_entity *entity;
>   struct drm_sched_fence *s_fence;
>   struct drm_sched_job *sched_job;
>   struct dma_fence *fence;
> - struct drm_sched_job *cleanup_job = NULL;
> + struct drm_sched_job *cleanup_job;
>  
> - wait_event_interruptible(sched->wake_up_worker,
> -  (cleanup_job = 
> drm_sched_get_cleanup_job(sched)) ||
> -  (!drm_sched_blocked(sched) &&
> -   (entity = 
> drm_sched_select_entity(sched))) ||
> -  kthread_should_stop());
> + cleanup_job = drm_sched_get_cleanup_job(sched);
> + entity = drm_sched_select_entity(sched);
>  
>   if (cleanup_job)
>   sched->ops->free_job(cleanup_job);
>  
> - if (!entity)
> + if (!entity) {
> + if (!cleanup_job)
> + break;
>   continue;
> + }
>  
>   sched_job = drm_sched_entity_pop_job(entity);
>  
>   if (!sched_job) {
>   complete_all(&entity->entity_idle);
> + if (!cleanup_job)
> + break;
>   continue;
>   }
>  
> @@ -1055,14 +1083,14 @@ static int drm_sched_main(void *param)
> r);
>   } else {
>   if (IS_ERR(fence))
> - dma_fence_set_error(&s_fence->finished, 
> PTR_ERR(fence));
> + dma_fence_set_error(&s_fence->finished,
> + PTR_ERR(fence));
>  
>   drm_sched_job_done(sched_job);
>   }
>  
>   wake_up(&sched->job_scheduled);
>   }
> - return 0;
>  }


Re: [RFC PATCH] drm/sched: Wait for the currently popped dependency in kill_jobs_cb()

2023-06-09 Thread Boris Brezillon
Hello Christian,

On Fri, 9 Jun 2023 13:53:59 +0200
Christian König  wrote:

> Am 08.06.23 um 08:55 schrieb Boris Brezillon:
> > If I understand correctly, drm_sched_entity_kill_jobs_cb() is supposed
> > to wait on all the external dependencies (those added to
> > drm_sched_job::dependencies) before signaling the job finished fence.
> > This is done this way to prevent jobs depending on these canceled jobs
> > from considering the resources they want to access as ready, when
> > they're actually still used by other components, thus leading to
> > potential data corruptions.
> >
> > The problem is, the kill_jobs logic is omitting the last fence popped
> > from the dependencies array that was waited upon before
> > drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
> > so we're basically waiting for all dependencies except one.
> >
> > This is an attempt at fixing that, but also an opportunity to make sure
> > I understood the drm_sched_entity_kill(), because I'm not 100% sure if
> > skipping this currently popped dependency was intentional or not. I can't
> > see a good reason why we'd want to skip that one, but I might be missing
> > something.
> >
> > Signed-off-by: Boris Brezillon 
> > Cc: Frank Binns 
> > Cc: Sarah Walker 
> > Cc: Donald Robson 
> > Cc: Luben Tuikov 
> > Cc: David Airlie 
> > Cc: Daniel Vetter 
> > Cc: Sumit Semwal 
> > Cc: "Christian König" 
> > ---
> > Stumbled across this issue while working on native dependency support
> > with Donald (which will be posted soon). Flagged as RFC, because I'm
> > not sure this is legit, and also not sure we want to fix it this way.
> > I tried re-using drm_sched_entity::dependency, but it's a bit of a mess
> > because of the asynchronousity of the wait, and the fact we use
> > drm_sched_entity::dependency to know if we have a clear_dep()
> > callback registered, so we can easily reset it without removing the
> > callback.  
> 
> Well yes, that's a known problem. But this is really not the right 
> approach to fixing this.
> 
> Trying to wait for all the dependencies before killing jobs was added 
> because of the way we kept track of dma_fences in dma_resv objects. 
> Basically adding exclusive fences removed all other fences leading to a 
> bit fragile memory management.

Okay.

> 
> This handling was removed by now and so the workaround for waiting for 
> dependencies is not really necessary any more, but I think it is still 
> better to do so.
> 
> The right approach of getting this waiting for dependencies completely 
> straight is also not to touch entity->dependency in any way, but to stop 
> removing them from the XA in drm_sched_job_dependency(). Otherwise you 
> don't catch the pipeline optimized ones either.

Do you want me to post a v2 doing that, or should I forget about it?
If we decide to keep things like that, it might be good to at least
add a comment explaining why we don't care.

Regards,

Boris


[PATCH v2] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb()

2023-06-09 Thread Boris Brezillon
drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped
from the dependency array that was waited upon before
drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
so we're basically waiting for all dependencies except one.

In theory, this wait shouldn't be needed because resources should have
their users registered to the dma_resv object, thus guaranteeing that
future jobs wanting to access these resources wait on all the previous
users (depending on the access type, of course). But we want to keep
these explicit waits in the kill entity path just in case.

Let's make sure we keep all dependencies in the array in
drm_sched_job_dependency(), so we can iterate over the array and wait
in drm_sched_entity_kill_jobs_cb().

Signed-off-by: Boris Brezillon 
Suggested-by: "Christian König" 
Cc: Frank Binns 
Cc: Sarah Walker 
Cc: Donald Robson 
Cc: Luben Tuikov 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Sumit Semwal 
Cc: "Christian König" 
---
 drivers/gpu/drm/scheduler/sched_entity.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index 68e807ae136a..e1b437e66f3c 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -176,13 +176,14 @@ static void drm_sched_entity_kill_jobs_cb(struct 
dma_fence *f,
 {
struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
 finish_cb);
+   unsigned long index;
int r;
 
dma_fence_put(f);
 
/* Wait for all dependencies to avoid data corruptions */
-   while (!xa_empty(&job->dependencies)) {
-   f = xa_erase(&job->dependencies, job->last_dependency++);
+   xa_for_each(&job->dependencies, index, f) {
+   xa_erase(&job->dependencies, index);
r = dma_fence_add_callback(f, &job->finish_cb,
   drm_sched_entity_kill_jobs_cb);
if (!r)
@@ -415,8 +416,17 @@ static struct dma_fence *
 drm_sched_job_dependency(struct drm_sched_job *job,
 struct drm_sched_entity *entity)
 {
-   if (!xa_empty(&job->dependencies))
-   return xa_erase(&job->dependencies, job->last_dependency++);
+   struct dma_fence *f;
+
+   /* We keep the fence around, so we can iterate over all dependencies
+* in drm_sched_entity_kill_jobs_cb() to make all deps are signaled
+* before killing the job.
+*/
+   f = xa_load(&job->dependencies, job->last_dependency);
+   if (f) {
+   job->last_dependency++;
+   return dma_fence_get(f);
+   }
 
if (job->sched->ops->prepare_job)
return job->sched->ops->prepare_job(job, entity);
-- 
2.40.1



Re: [PATCH] drm/sched: Add native dependency support to drm_sched

2023-06-12 Thread Boris Brezillon
Hi Donald,

On Thu, 8 Jun 2023 13:23:26 +
Donald Robson  wrote:

>  /**
>   * drm_sched_job_arm - arm a scheduler job for execution
>   * @job: scheduler job to arm
> @@ -669,6 +755,7 @@ void drm_sched_job_arm(struct drm_sched_job *job)
>   job->s_priority = entity->rq - sched->sched_rq;
>   job->id = atomic64_inc_return(&sched->job_id_count);
>  
> + drm_sched_sort_native_deps(job);

If we get [1] accepted, we no longer need to sort the array. We can
just skip native dependencies as we iterate over the array in
drm_sched_job_dependency() with something like:

   f = xa_load(&job->dependencies, job->last_dependency);
   while (f) {
   struct drm_sched_fence *s_fence;
   struct dma_fence *scheduled_fence;

   job->last_dependency++;

   /* Not a native dependency, return the fence directly. */
   if (!job->sched->ops->dependency_is_native ||
   !job->sched->ops->dependency_is_native(f))
   return dma_fence_get(f);

   /*
* If the native fence is a drm_sched_fence object, we
* ensure the job has been submitted so drm_sched_fence
* ::parent points to a valid dma_fence object.
*/
   s_fence = to_drm_sched_fence(f);
   scheduled_fence = s_fence ?
 dma_fence_get_rcu(&s_fence->scheduled) :
 NULL;

   if (scheduled_fence)
   return scheduled_fence;

   /* Otherwise we skip the native fence and check the next fence. 
*/
   f = xa_load(&job->dependencies, job->last_dependency);
}

And, in the driver, when you get to submit the job, you can gather
the native deps with a simple xa_for_each() loop:

xa_for_each(&job->dependencies, index, f) {
/* If the fence is not signaled, it must be a native fence,
 * because drm_sched_entity waited for all non-native ones.
 */
if (!dma_fence_is_signaled(f))
// DO SOMETHING
}

>   drm_sched_fence_init(job->s_fence, job->entity);
>  }

Regards,

Boris

[1]https://patchwork.freedesktop.org/patch/541956/


Re: [PATCH] drm/sched: Add native dependency support to drm_sched

2023-06-12 Thread Boris Brezillon
Hi Christian,

On Mon, 12 Jun 2023 15:16:03 +0200
Christian König  wrote:

> Am 08.06.23 um 15:23 schrieb Donald Robson:
> > This patch adds support for 'native' dependencies to DRM scheduler.  In
> > drivers that use a firmware based scheduler there are performance gains
> > to be had by allowing waits to happen in the firmware, as this reduces
> > the latency between signalling and job submission.  
> 
> Well, if I'm not completely mistaken this patch here is superfluous 
> since we already use that functionality.
> 
> This strongly sounds like the HW dependencies we have in amdgpu. See 
> AMDGPU_CHUNK_ID_SCHEDULED_DEPENDENCIES.

I'll look at it in more details. Thanks for the pointer.

> 
> Basically userspace can instead of giving a hard dependency to finish 
> something before the current submission starts also give a soft 
> dependency and only let the other submission be scheduled.
> 
> This way you can then wait for the firmware for certain operations of 
> the previous submission to complete by comparing memory or registers.
> 
> You don't necessarily need to give control over this to userspace, if 
> your kernel driver can determine a fw assisted wait by itself that 
> should also work fine.

That's what we did initially. We had a separate 'native_deps' xarray in
pvr_job that we were managing ourselves, and that worked fine, except
for the kill_entity() stuff. If you don't wait for those
'native-fences', you're potentially signaling the job finished fence
earlier than it should.

Just to make sure we're on the same page, the native fences we
have here are really dma_fences that can be waited upon FW side:
they're not exposed to userspace, the GPU can't access the memory
region containing the counter (it's visible to the FW VM only, and
a kernel side CPU mapping), and we do make sure they signal in finite
time thanks to the job timeout. Feels a bit different compared to
USER_FENCEs most GPUs have nowadays, on which you don't have this
isolation guarantee, and which, AFAIU, are currently used to do some
advanced userspace driven scheduling between queues belonging to the
same context. My understanding, after discussing it with Daniel a few
weeks back, was that exposing USER_FENCEs as dma_fences was risky,
especially if they're used to do inter-context synchronization,
but the FW-visible-only ones were okay to expose as dma_fences. Maybe I
misunderstood what he suggested.

I'm done with this digression, now back to the original topic: we can of
course wait for all those native fences before calling
drm_sched_enity_destroy(), but that's a bit weird to do some partial
wait in the driver while the entity is still active (pretty sure that's
racy anyway), and then delegate the rest to the core.

If we decide we don't care about waiting for native fences when
killing jobs in the kill_entity() path, because we assume drm_resv is
covering us, that's fine, but then I don't really see why
drm_sched_kill_entity() should wait at all, because this 'should wait,
but maybe not for all your deps' behavior is quite confusing.

Regards,

Boris


Re: [PATCH] drm/sched: Add native dependency support to drm_sched

2023-06-12 Thread Boris Brezillon
On Mon, 12 Jun 2023 16:59:02 +0200
Boris Brezillon  wrote:

> Hi Christian,
> 
> On Mon, 12 Jun 2023 15:16:03 +0200
> Christian König  wrote:
> 
> > Am 08.06.23 um 15:23 schrieb Donald Robson:  
> > > This patch adds support for 'native' dependencies to DRM scheduler.  In
> > > drivers that use a firmware based scheduler there are performance gains
> > > to be had by allowing waits to happen in the firmware, as this reduces
> > > the latency between signalling and job submission.
> > 
> > Well, if I'm not completely mistaken this patch here is superfluous 
> > since we already use that functionality.
> > 
> > This strongly sounds like the HW dependencies we have in amdgpu. See 
> > AMDGPU_CHUNK_ID_SCHEDULED_DEPENDENCIES.  
> 
> I'll look at it in more details. Thanks for the pointer.

I had a quick look, and it looks pretty similar, indeed.

> 
> > 
> > Basically userspace can instead of giving a hard dependency to finish 
> > something before the current submission starts also give a soft 
> > dependency and only let the other submission be scheduled.
> > 
> > This way you can then wait for the firmware for certain operations of 
> > the previous submission to complete by comparing memory or registers.
> > 
> > You don't necessarily need to give control over this to userspace, if 
> > your kernel driver can determine a fw assisted wait by itself that 
> > should also work fine.  
> 
> That's what we did initially. We had a separate 'native_deps' xarray in
> pvr_job that we were managing ourselves, and that worked fine, except
> for the kill_entity() stuff. If you don't wait for those
> 'native-fences', you're potentially signaling the job finished fence
> earlier than it should.

Hm, I think we could get drm_sched_entity_kill_jobs_cb() to do the
right thing here without teaching drm_sched about native deps. If we
turn back scheduled fences into finished fences in
drm_sched_entity_kill_jobs_cb(), this should work:

static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
  struct dma_fence_cb *cb)
{
struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
 finish_cb);
unsigned long index;
int r;

dma_fence_put(f);

/* Wait for all dependencies to avoid data corruptions */
xa_for_each(&job->dependencies, index, f) {
struct drm_sched_fence *s_fence = to_drm_sched_fence(f);

/* Make sure we wait for the finished fence here, so we can
 * guarantee that any job we depend on that is still accessing
 * resources is done before we signal this job finished fence
 * and unblock further accesses on these resources.
 */
if (s_fence && f == &s_fence->scheduled)
f = &s_fence->finished;

xa_erase(&job->dependencies, index);
r = dma_fence_add_callback(f, &job->finish_cb,
   drm_sched_entity_kill_jobs_cb);
if (!r)
return;

dma_fence_put(f);
}

INIT_WORK(&job->work, drm_sched_entity_kill_jobs_work);
schedule_work(&job->work);
}

Then, for native fences, we just have to add the scheduled fence to the deps
array, as you do (and as we did in our first version), and we should be good.


[PATCH v3] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb()

2023-06-13 Thread Boris Brezillon
drm_sched_entity_kill_jobs_cb() logic is omitting the last fence popped
from the dependency array that was waited upon before
drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
so we're basically waiting for all dependencies except one.

In theory, this wait shouldn't be needed because resources should have
their users registered to the dma_resv object, thus guaranteeing that
future jobs wanting to access these resources wait on all the previous
users (depending on the access type, of course). But we want to keep
these explicit waits in the kill entity path just in case.

Let's make sure we keep all dependencies in the array in
drm_sched_job_dependency(), so we can iterate over the array and wait
in drm_sched_entity_kill_jobs_cb().

We also make sure we wait on drm_sched_fence::finished if we were asked
to wait on drm_sched_fence::scheduled, but the intent was probably to
delegate the wait to the GPU, but we want resources to be completely
idle when killing jobs.

v3:
- Always wait for drm_sched_fence::finished fences in
  drm_sched_entity_kill_jobs_cb() when we see a sched_fence

v2:
- Don't evict deps in drm_sched_job_dependency()

Signed-off-by: Boris Brezillon 
Suggested-by: "Christian König" 
Cc: Frank Binns 
Cc: Sarah Walker 
Cc: Donald Robson 
Cc: Luben Tuikov 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Sumit Semwal 
Cc: "Christian König" 
---
 drivers/gpu/drm/scheduler/sched_entity.c | 28 
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
b/drivers/gpu/drm/scheduler/sched_entity.c
index 68e807ae136a..bc1bc3d47f7f 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -176,13 +176,24 @@ static void drm_sched_entity_kill_jobs_cb(struct 
dma_fence *f,
 {
struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
 finish_cb);
+   unsigned long index;
int r;
 
dma_fence_put(f);
 
/* Wait for all dependencies to avoid data corruptions */
-   while (!xa_empty(&job->dependencies)) {
-   f = xa_erase(&job->dependencies, job->last_dependency++);
+   xa_for_each(&job->dependencies, index, f) {
+   struct drm_sched_fence *s_fence = to_drm_sched_fence(f);
+
+   /* Make sure we wait for the finished fence here, so we can
+* guarantee that any job we depend on that is still accessing
+* resources is done before we signal this job finished fence
+* and unblock further accesses on those resources.
+*/
+   if (s_fence && f == &s_fence->scheduled)
+   f = &s_fence->finished;
+
+   xa_erase(&job->dependencies, index);
r = dma_fence_add_callback(f, &job->finish_cb,
   drm_sched_entity_kill_jobs_cb);
if (!r)
@@ -415,8 +426,17 @@ static struct dma_fence *
 drm_sched_job_dependency(struct drm_sched_job *job,
 struct drm_sched_entity *entity)
 {
-   if (!xa_empty(&job->dependencies))
-   return xa_erase(&job->dependencies, job->last_dependency++);
+   struct dma_fence *f;
+
+   /* We keep the fence around, so we can iterate over all dependencies
+* in drm_sched_entity_kill_jobs_cb() to ensure all deps are signaled
+* before killing the job.
+*/
+   f = xa_load(&job->dependencies, job->last_dependency);
+   if (f) {
+   job->last_dependency++;
+   return dma_fence_get(f);
+   }
 
if (job->sched->ops->prepare_job)
return job->sched->ops->prepare_job(job, entity);
-- 
2.40.1



[PATCH] drm/sched: Call drm_sched_fence_set_parent() from drm_sched_fence_scheduled()

2023-06-13 Thread Boris Brezillon
Drivers that can delegate waits to the firmware/GPU pass the scheduled
fence to drm_sched_job_add_dependency(), and issue wait commands to
the firmware/GPU at job submission time. For this to be possible, they
need all their 'native' dependencies to have a valid parent since this
is where the actual HW fence information are encoded.

In drm_sched_main(), we currently call drm_sched_fence_set_parent()
after drm_sched_fence_set_parent(), leaving a short period of time
during which the job depending on this fence can be submitted.

Since setting parent and signaling the fence are two things that are
kinda related (you can't have a parent if the job hasn't been scheduled),
it probably makes sense to pass the parent fence to
drm_sched_fence_scheduled() and let it call drm_sched_fence_set_parent()
before it signals the scheduled fence.

Signed-off-by: Boris Brezillon 
Cc: Frank Binns 
Cc: Sarah Walker 
Cc: Donald Robson 
Cc: Luben Tuikov 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: Sumit Semwal 
Cc: "Christian König" 

---
Christian, that's the last bit remaining from [1] after your suggestion
to pass scheduled fences for those native-deps we have. It does feel
like setting the parent after signaling the fence is racy, but you might
have a good reason to do it in that order. If that's the case, could you
help us find a solution for the race exposed here?

[1]https://lore.kernel.org/dri-devel/20230612182530.6214c...@collabora.com/T/#t
---
 drivers/gpu/drm/scheduler/sched_fence.c | 40 +++--
 drivers/gpu/drm/scheduler/sched_main.c  |  3 +-
 include/drm/gpu_scheduler.h |  5 ++--
 3 files changed, 28 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_fence.c 
b/drivers/gpu/drm/scheduler/sched_fence.c
index ef120475e7c6..06cedfe4b486 100644
--- a/drivers/gpu/drm/scheduler/sched_fence.c
+++ b/drivers/gpu/drm/scheduler/sched_fence.c
@@ -48,8 +48,32 @@ static void __exit drm_sched_fence_slab_fini(void)
kmem_cache_destroy(sched_fence_slab);
 }
 
-void drm_sched_fence_scheduled(struct drm_sched_fence *fence)
+static void drm_sched_fence_set_parent(struct drm_sched_fence *s_fence,
+  struct dma_fence *fence)
 {
+   /*
+* smp_store_release() to ensure another thread racing us
+* in drm_sched_fence_set_deadline_finished() sees the
+* fence's parent set before test_bit()
+*/
+   smp_store_release(&s_fence->parent, dma_fence_get(fence));
+   if (test_bit(DRM_SCHED_FENCE_FLAG_HAS_DEADLINE_BIT,
+&s_fence->finished.flags))
+   dma_fence_set_deadline(fence, s_fence->deadline);
+}
+
+void drm_sched_fence_scheduled(struct drm_sched_fence *fence,
+  struct dma_fence *parent)
+{
+   /* Set the parent before signaling the scheduled fence, such that,
+* any waiter expecting the parent to be filled after the job has
+* been scheduled (which is the case for drivers delegating waits
+* to some firmware) doesn't have to busy wait for parent to show
+* up.
+*/
+   if (!IS_ERR_OR_NULL(parent))
+   drm_sched_fence_set_parent(fence, parent);
+
dma_fence_signal(&fence->scheduled);
 }
 
@@ -181,20 +205,6 @@ struct drm_sched_fence *to_drm_sched_fence(struct 
dma_fence *f)
 }
 EXPORT_SYMBOL(to_drm_sched_fence);
 
-void drm_sched_fence_set_parent(struct drm_sched_fence *s_fence,
-   struct dma_fence *fence)
-{
-   /*
-* smp_store_release() to ensure another thread racing us
-* in drm_sched_fence_set_deadline_finished() sees the
-* fence's parent set before test_bit()
-*/
-   smp_store_release(&s_fence->parent, dma_fence_get(fence));
-   if (test_bit(DRM_SCHED_FENCE_FLAG_HAS_DEADLINE_BIT,
-&s_fence->finished.flags))
-   dma_fence_set_deadline(fence, s_fence->deadline);
-}
-
 struct drm_sched_fence *drm_sched_fence_alloc(struct drm_sched_entity *entity,
  void *owner)
 {
diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
b/drivers/gpu/drm/scheduler/sched_main.c
index 394010a60821..27097772ad6e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1043,10 +1043,9 @@ static int drm_sched_main(void *param)
trace_drm_run_job(sched_job, entity);
fence = sched->ops->run_job(sched_job);
complete_all(&entity->entity_idle);
-   drm_sched_fence_scheduled(s_fence);
+   drm_sched_fence_scheduled(s_fence, fence);
 
if (!IS_ERR_OR_NULL(fence)) {
-   drm_sched_fence_set_parent(s_fence, fence);
/* Drop for original kref_init of the fence */
dma_

Re: [PATCH] drm/sched: Call drm_sched_fence_set_parent() from drm_sched_fence_scheduled()

2023-06-13 Thread Boris Brezillon
On Tue, 13 Jun 2023 11:44:24 +0200
Boris Brezillon  wrote:

> Drivers that can delegate waits to the firmware/GPU pass the scheduled
> fence to drm_sched_job_add_dependency(), and issue wait commands to
> the firmware/GPU at job submission time. For this to be possible, they
> need all their 'native' dependencies to have a valid parent since this
> is where the actual HW fence information are encoded.
> 
> In drm_sched_main(), we currently call drm_sched_fence_set_parent()
> after drm_sched_fence_set_parent(), leaving a short period of time

after drm_sched_fence_scheduled(), ...

> during which the job depending on this fence can be submitted.
> 
> Since setting parent and signaling the fence are two things that are
> kinda related (you can't have a parent if the job hasn't been scheduled),
> it probably makes sense to pass the parent fence to
> drm_sched_fence_scheduled() and let it call drm_sched_fence_set_parent()
> before it signals the scheduled fence.


Re: [PATCH v3] drm/sched: Make sure we wait for all dependencies in kill_jobs_cb()

2023-06-13 Thread Boris Brezillon
On Tue, 13 Jun 2023 11:28:45 +0200
Boris Brezillon  wrote:

> We also make sure we wait on drm_sched_fence::finished if we were asked
> to wait on drm_sched_fence::scheduled, but the intent was probably to
> delegate the wait to the GPU, but we want resources to be completely
> idle when killing jobs.

Uh, I need to rephrase that part:

"
We also make sure we wait on drm_sched_fence::finished if we were
originally asked to wait on drm_sched_fence::scheduled. In that case,
we assume the intent was to delegate the wait to the firmware/GPU or
rely on the pipelining done at the entity/scheduler level, but when
killing jobs, we really want to wait for completion not just scheduling.
"


Re: [PATCH drm-misc-next v9 01/11] drm/gem: fix lockdep check for dma-resv lock

2023-08-08 Thread Boris Brezillon
On Thu,  3 Aug 2023 18:52:20 +0200
Danilo Krummrich  wrote:

> When no custom lock is set to protect a GEMs GPUVA list, lockdep checks
> should fall back to the GEM objects dma-resv lock. With the current
> implementation we're setting the lock_dep_map of the GEM objects 'resv'
> pointer (in case no custom lock_dep_map is set yet) on
> drm_gem_private_object_init().
> 
> However, the GEM objects 'resv' pointer might still change after
> drm_gem_private_object_init() is called, e.g. through
> ttm_bo_init_reserved(). This can result in the wrong lock being tracked.
> 
> To fix this, call dma_resv_held() directly from
> drm_gem_gpuva_assert_lock_held() and fall back to the GEMs lock_dep_map
> pointer only if an actual custom lock is set.
> 
> Fixes: e6303f323b1a ("drm: manager to keep track of GPUs VA mappings")
> Signed-off-by: Danilo Krummrich 

Reviewed-by: Boris Brezillon 

but I'm wondering if it wouldn't be a good thing to add a
drm_gem_set_resv() helper, so the core can control drm_gem_object::resv
re-assignments (block them if it's happening after the GEM has been
exposed to the outside world or update auxiliary data if it's happening
before that).

> ---
>  include/drm/drm_gem.h | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/include/drm/drm_gem.h b/include/drm/drm_gem.h
> index c0b13c43b459..bc9f6aa2f3fe 100644
> --- a/include/drm/drm_gem.h
> +++ b/include/drm/drm_gem.h
> @@ -551,15 +551,17 @@ int drm_gem_evict(struct drm_gem_object *obj);
>   * @lock: the lock used to protect the gpuva list. The locking primitive
>   * must contain a dep_map field.
>   *
> - * Call this if you're not proctecting access to the gpuva list
> - * with the dma-resv lock, otherwise, drm_gem_gpuva_init() takes care
> - * of initializing lock_dep_map for you.
> + * Call this if you're not proctecting access to the gpuva list with the
> + * dma-resv lock, but with a custom lock.
>   */
>  #define drm_gem_gpuva_set_lock(obj, lock) \
> - if (!(obj)->gpuva.lock_dep_map) \
> + if (!WARN((obj)->gpuva.lock_dep_map, \
> +   "GEM GPUVA lock should be set only once.")) \
>   (obj)->gpuva.lock_dep_map = &(lock)->dep_map
>  #define drm_gem_gpuva_assert_lock_held(obj) \
> - lockdep_assert(lock_is_held((obj)->gpuva.lock_dep_map))
> + lockdep_assert((obj)->gpuva.lock_dep_map ? \
> +lock_is_held((obj)->gpuva.lock_dep_map) : \
> +dma_resv_held((obj)->resv))
>  #else
>  #define drm_gem_gpuva_set_lock(obj, lock) do {} while (0)
>  #define drm_gem_gpuva_assert_lock_held(obj) do {} while (0)
> @@ -573,11 +575,12 @@ int drm_gem_evict(struct drm_gem_object *obj);
>   *
>   * Calling this function is only necessary for drivers intending to support 
> the
>   * &drm_driver_feature DRIVER_GEM_GPUVA.
> + *
> + * See also drm_gem_gpuva_set_lock().
>   */
>  static inline void drm_gem_gpuva_init(struct drm_gem_object *obj)
>  {
>   INIT_LIST_HEAD(&obj->gpuva.list);
> - drm_gem_gpuva_set_lock(obj, &obj->resv->lock.base);
>  }
>  
>  /**



[PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs

2023-08-09 Thread Boris Brezillon
Hello,

This is the second version of the kernel driver meant to support new Mali
GPUs which are delegating the scheduling to a firmware.

The RFC has been dropped as the major blocking points have been addressed
(request to use drm_sched, request to implement a VM_BIND-like ioctl,
request to use drm_gpuva_mgr for the VM logic, lack of PM/devfreq support).

This series is based on drm-misc-next and depends on some drm_sched [1]
and iommu [2] changes.

A branch containing all those dependencies is available here[3], and
here [4] is another one containing all the patches needed to have
a working GPU on rk3588 on top. The CSF firmware binary can be found
here[5].

The mesa branch used to test this new driver is available here [6].
It's still under development and it's just a gallium driver right now,
but we are working on that ;-).

Here is a non-exaustive changelog, check each commit for a detailed
changelog.

v2:
- Rename the driver (pancsf -> panthor)
- Split the commit adding the driver to ease review
- Use drm_sched for dependency tracking/job submission
- Add a VM_BIND ioctl
- Add the concept of exclusive VM for BOs that are only ever mapped to a
  single VM
- Document the code and uAPI
- Add a DT binding doc

I tried to Cc anyone that was involved in any development of the code
I picked from panfrost, so they can acknowledge the GPL2 -> MIT+GPL2
change. If I missed someone, please let me know.

Best Regards,

Boris

[1]https://lore.kernel.org/dri-devel/20230801205103.627779-1-matthew.br...@intel.com/T/#t
[2]https://lore.kernel.org/linux-iommu/20230809121744.2341454-1-boris.brezil...@collabora.com/T/#t
[3]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor
[4]https://gitlab.freedesktop.org/panfrost/linux/-/tree/panthor+rk3588-evb1
[5]https://gitlab.com/firefly-linux/external/libmali/-/raw/firefly/firmware/g610/mali_csffw.bin
[6]https://gitlab.freedesktop.org/panfrost/mesa/-/tree/v10+panthor

Boris Brezillon (14):
  drm/shmem-helper: Make pages_use_count an atomic_t
  drm/panthor: Add uAPI
  drm/panthor: Add GPU register definitions
  drm/panthor: Add the device logical block
  drm/panthor: Add the GPU logical block
  drm/panthor: Add GEM logical block
  drm/panthor: Add the devfreq logical block
  drm/panthor: Add the MMU/VM logical block
  drm/panthor: Add the FW logical block
  drm/panthor: Add the heap logical block
  drm/panthor: Add the scheduler logical block
  drm/panthor: Add the driver frontend block
  drm/panthor: Allow driver compilation
  drm/panthor: Add an entry to MAINTAINERS

Liviu Dudau (1):
  dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor
driver

 .../bindings/gpu/arm,mali-valhall-csf.yaml|  148 +
 Documentation/gpu/driver-uapi.rst |5 +
 MAINTAINERS   |8 +
 drivers/gpu/drm/Kconfig   |2 +
 drivers/gpu/drm/Makefile  |1 +
 drivers/gpu/drm/drm_gem_shmem_helper.c|   28 +-
 drivers/gpu/drm/lima/lima_gem.c   |2 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c   |2 +-
 drivers/gpu/drm/panthor/Kconfig   |   16 +
 drivers/gpu/drm/panthor/Makefile  |   15 +
 drivers/gpu/drm/panthor/panthor_devfreq.c |  281 ++
 drivers/gpu/drm/panthor/panthor_devfreq.h |   25 +
 drivers/gpu/drm/panthor/panthor_device.c  |  479 +++
 drivers/gpu/drm/panthor/panthor_device.h  |  354 ++
 drivers/gpu/drm/panthor/panthor_drv.c | 1540 
 drivers/gpu/drm/panthor/panthor_fw.c  | 1417 +++
 drivers/gpu/drm/panthor/panthor_fw.h  |  505 +++
 drivers/gpu/drm/panthor/panthor_gem.c |  229 ++
 drivers/gpu/drm/panthor/panthor_gem.h |   96 +
 drivers/gpu/drm/panthor/panthor_gpu.c |  463 +++
 drivers/gpu/drm/panthor/panthor_gpu.h |   52 +
 drivers/gpu/drm/panthor/panthor_heap.c|  550 +++
 drivers/gpu/drm/panthor/panthor_heap.h|   36 +
 drivers/gpu/drm/panthor/panthor_mmu.c | 2611 +
 drivers/gpu/drm/panthor/panthor_mmu.h |   81 +
 drivers/gpu/drm/panthor/panthor_regs.h|  229 ++
 drivers/gpu/drm/panthor/panthor_sched.c   | 3272 +
 drivers/gpu/drm/panthor/panthor_sched.h   |   50 +
 include/drm/drm_gem_shmem_helper.h|2 +-
 include/uapi/drm/panthor_drm.h|  862 +
 30 files changed, 13345 insertions(+), 16 deletions(-)
 create mode 100644 
Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
 create mode 100644 drivers/gpu/drm/panthor/Kconfig
 create mode 100644 drivers/gpu/drm/panthor/Makefile
 create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_device.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_device.h
 create mode 100644 drivers/gpu/drm/panthor/panthor_drv.c
 create mode 100644 drivers/gpu/drm/pant

[PATCH v2 01/15] drm/shmem-helper: Make pages_use_count an atomic_t

2023-08-09 Thread Boris Brezillon
This way we can grab a pages ref without acquiring the resv lock when
pages_use_count > 0. Need to implement asynchronous map using the
drm_gpuva_mgr when the map/unmap operation triggers a mapping split,
requiring the new left/right regions to grab an additional page ref
to guarantee that the pages stay pinned when the middle section is
unmapped.

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/drm_gem_shmem_helper.c  | 28 +
 drivers/gpu/drm/lima/lima_gem.c |  2 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c |  2 +-
 include/drm/drm_gem_shmem_helper.h  |  2 +-
 4 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
b/drivers/gpu/drm/drm_gem_shmem_helper.c
index a783d2245599..ca6938ea1b82 100644
--- a/drivers/gpu/drm/drm_gem_shmem_helper.c
+++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
@@ -155,7 +155,7 @@ void drm_gem_shmem_free(struct drm_gem_shmem_object *shmem)
if (shmem->pages)
drm_gem_shmem_put_pages(shmem);
 
-   drm_WARN_ON(obj->dev, shmem->pages_use_count);
+   drm_WARN_ON(obj->dev, atomic_read(&shmem->pages_use_count));
 
dma_resv_unlock(shmem->base.resv);
}
@@ -172,14 +172,14 @@ static int drm_gem_shmem_get_pages(struct 
drm_gem_shmem_object *shmem)
 
dma_resv_assert_held(shmem->base.resv);
 
-   if (shmem->pages_use_count++ > 0)
+   if (atomic_inc_return(&shmem->pages_use_count) > 1)
return 0;
 
pages = drm_gem_get_pages(obj);
if (IS_ERR(pages)) {
drm_dbg_kms(obj->dev, "Failed to get pages (%ld)\n",
PTR_ERR(pages));
-   shmem->pages_use_count = 0;
+   atomic_set(&shmem->pages_use_count, 0);
return PTR_ERR(pages);
}
 
@@ -210,10 +210,10 @@ void drm_gem_shmem_put_pages(struct drm_gem_shmem_object 
*shmem)
 
dma_resv_assert_held(shmem->base.resv);
 
-   if (drm_WARN_ON_ONCE(obj->dev, !shmem->pages_use_count))
+   if (drm_WARN_ON_ONCE(obj->dev, !atomic_read(&shmem->pages_use_count)))
return;
 
-   if (--shmem->pages_use_count > 0)
+   if (atomic_dec_return(&shmem->pages_use_count) > 0)
return;
 
 #ifdef CONFIG_X86
@@ -263,6 +263,10 @@ int drm_gem_shmem_pin(struct drm_gem_shmem_object *shmem)
 
drm_WARN_ON(obj->dev, obj->import_attach);
 
+   /* If we are the first owner, we need to grab the lock. */
+   if (atomic_inc_not_zero(&shmem->pages_use_count))
+   return 0;
+
ret = dma_resv_lock_interruptible(shmem->base.resv, NULL);
if (ret)
return ret;
@@ -286,6 +290,10 @@ void drm_gem_shmem_unpin(struct drm_gem_shmem_object 
*shmem)
 
drm_WARN_ON(obj->dev, obj->import_attach);
 
+   /* If we are the last owner, we need to grab the lock. */
+   if (atomic_add_unless(&shmem->pages_use_count, -1, 1))
+   return;
+
dma_resv_lock(shmem->base.resv, NULL);
drm_gem_shmem_unpin_locked(shmem);
dma_resv_unlock(shmem->base.resv);
@@ -543,18 +551,12 @@ static void drm_gem_shmem_vm_open(struct vm_area_struct 
*vma)
 
drm_WARN_ON(obj->dev, obj->import_attach);
 
-   dma_resv_lock(shmem->base.resv, NULL);
-
/*
 * We should have already pinned the pages when the buffer was first
 * mmap'd, vm_open() just grabs an additional reference for the new
 * mm the vma is getting copied into (ie. on fork()).
 */
-   if (!drm_WARN_ON_ONCE(obj->dev, !shmem->pages_use_count))
-   shmem->pages_use_count++;
-
-   dma_resv_unlock(shmem->base.resv);
-
+   drm_WARN_ON_ONCE(obj->dev, atomic_inc_return(&shmem->pages_use_count) 
== 1);
drm_gem_vm_open(vma);
 }
 
@@ -632,7 +634,7 @@ void drm_gem_shmem_print_info(const struct 
drm_gem_shmem_object *shmem,
if (shmem->base.import_attach)
return;
 
-   drm_printf_indent(p, indent, "pages_use_count=%u\n", 
shmem->pages_use_count);
+   drm_printf_indent(p, indent, "pages_use_count=%u\n", 
atomic_read(&shmem->pages_use_count));
drm_printf_indent(p, indent, "vmap_use_count=%u\n", 
shmem->vmap_use_count);
drm_printf_indent(p, indent, "vaddr=%p\n", shmem->vaddr);
 }
diff --git a/drivers/gpu/drm/lima/lima_gem.c b/drivers/gpu/drm/lima/lima_gem.c
index 4f9736e5f929..0116518b1601 100644
--- a/drivers/gpu/drm/lima/lima_gem.c
+++ b/drivers/gpu/drm/lima/lima_gem.c
@@ -47,7 +47,7 @@ int lima_heap_alloc(struct lima_bo *bo, struct lima_vm *vm)
}
 
bo->base.pages = pages;
-   bo->base.pages_use_count = 1;
+   atomic_

[PATCH v2 03/15] drm/panthor: Add GPU register definitions

2023-08-09 Thread Boris Brezillon
Those are the registers directly accessible through the MMIO range.

FW registers are exposed in panthor_fw.h.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_regs.h | 229 +
 1 file changed, 229 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_regs.h

diff --git a/drivers/gpu/drm/panthor/panthor_regs.h 
b/drivers/gpu/drm/panthor/panthor_regs.h
new file mode 100644
index ..00e149cf9eab
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_regs.h
@@ -0,0 +1,229 @@
+/* SPDX-License-Identifier: GPL-2.0 or MIT */
+/* Copyright 2018 Marty E. Plummer  */
+/* Copyright 2019 Linaro, Ltd, Rob Herring  */
+/* Copyright 2023 Collabora ltd. */
+/*
+ * Register definitions based on mali_kbase_gpu_regmap.h and
+ * mali_kbase_gpu_regmap_csf.h
+ * (C) COPYRIGHT 2010-2022 ARM Limited. All rights reserved.
+ */
+#ifndef __PANTHOR_REGS_H__
+#define __PANTHOR_REGS_H__
+
+#define GPU_ID 0x00
+#define GPU_L2_FEATURES0x004
+#define GPU_TILER_FEATURES 0x00C
+#define GPU_MEM_FEATURES   0x010
+#define   GROUPS_L2_COHERENT   BIT(0)
+
+#define GPU_MMU_FEATURES   0x014
+#define  GPU_MMU_FEATURES_VA_BITS(x)   ((x) & GENMASK(7, 0))
+#define  GPU_MMU_FEATURES_PA_BITS(x)   (((x) >> 8) & 
GENMASK(7, 0))
+#define GPU_AS_PRESENT 0x018
+#define GPU_CSF_ID 0x01C
+
+#define GPU_INT_RAWSTAT0x20
+#define GPU_INT_CLEAR  0x24
+#define GPU_INT_MASK   0x28
+#define GPU_INT_STAT   0x2c
+#define   GPU_IRQ_FAULTBIT(0)
+#define   GPU_IRQ_PROTM_FAULT  BIT(1)
+#define   GPU_IRQ_RESET_COMPLETED  BIT(8)
+#define   GPU_IRQ_POWER_CHANGEDBIT(9)
+#define   GPU_IRQ_POWER_CHANGED_ALLBIT(10)
+#define   GPU_IRQ_CLEAN_CACHES_COMPLETED   BIT(17)
+#define   GPU_IRQ_DOORBELL_MIRROR  BIT(18)
+#define   GPU_IRQ_MCU_STATUS_CHANGED   BIT(19)
+#define GPU_CMD0x30
+#define   GPU_CMD_DEF(type, payload)   ((type) | ((payload) << 
8))
+#define   GPU_SOFT_RESET   GPU_CMD_DEF(1, 1)
+#define   GPU_HARD_RESET   GPU_CMD_DEF(1, 2)
+#define   CACHE_CLEAN  BIT(0)
+#define   CACHE_INVBIT(1)
+#define   GPU_FLUSH_CACHES(l2, lsc, oth)   \
+ GPU_CMD_DEF(4, ((l2) << 0) | ((lsc) << 4) | ((oth) << 8))
+
+#define GPU_STATUS 0x34
+#define   GPU_STATUS_ACTIVEBIT(0)
+#define   GPU_STATUS_PWR_ACTIVEBIT(1)
+#define   GPU_STATUS_PAGE_FAULTBIT(4)
+#define   GPU_STATUS_PROTM_ACTIVE  BIT(7)
+#define   GPU_STATUS_DBG_ENABLED   BIT(8)
+
+#define GPU_FAULT_STATUS   0x3C
+#define GPU_FAULT_ADDR_LO  0x40
+#define GPU_FAULT_ADDR_HI  0x44
+
+#define GPU_PWR_KEY0x50
+#define  GPU_PWR_KEY_UNLOCK0x2968A819
+#define GPU_PWR_OVERRIDE0  0x54
+#define GPU_PWR_OVERRIDE1  0x58
+
+#define GPU_TIMESTAMP_OFFSET_LO0x88
+#define GPU_TIMESTAMP_OFFSET_HI0x8C
+#define GPU_CYCLE_COUNT_LO 0x90
+#define GPU_CYCLE_COUNT_HI 0x94
+#define GPU_TIMESTAMP_LO   0x98
+#define GPU_TIMESTAMP_HI   0x9C
+
+#define GPU_THREAD_MAX_THREADS 0xA0
+#define GPU_THREAD_MAX_WORKGROUP_SIZE  0xA4
+#define GPU_THREAD_MAX_BARRIER_SIZE0xA8
+#define GPU_THREAD_FEATURES0xAC
+
+#define GPU_TEXTURE_FEATURES(n)(0xB0 + ((n) * 
4))
+
+#define GPU_SHADER_PRESENT_LO  0x100
+#define GPU_SHADER_PRESENT_HI  0x104
+#define GPU_TILER_PRESENT_LO   0x110
+#define GPU_TILER_PRESENT_HI   0x114
+#define GPU_L2_PRESENT_LO  0x120
+#define GPU_L2_PRESENT_HI  

[PATCH v2 07/15] drm/panthor: Add the devfreq logical block

2023-08-09 Thread Boris Brezillon
Every thing related to devfreq in placed in panthor_devfreq.c, and
helpers that can be called by other logical blocks are exposed through
panthor_devfreq.h.

This implementation is loosely based on the panfrost implementation,
the only difference being that we don't count device users, because
the idle/active state will be managed by the scheduler logic.

v2:
- Added in v2

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_devfreq.c | 281 ++
 drivers/gpu/drm/panthor/panthor_devfreq.h |  25 ++
 2 files changed, 306 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_devfreq.h

diff --git a/drivers/gpu/drm/panthor/panthor_devfreq.c 
b/drivers/gpu/drm/panthor/panthor_devfreq.c
new file mode 100644
index ..500ce342
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_devfreq.c
@@ -0,0 +1,281 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2019 Collabora ltd. */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "panthor_device.h"
+#include "panthor_devfreq.h"
+
+/**
+ * struct panthor_devfreq - Device frequency management
+ */
+struct panthor_devfreq {
+   /** @devfreq: devfreq device. */
+   struct devfreq *devfreq;
+
+   /** @gov_data: Governor data. */
+   struct devfreq_simple_ondemand_data gov_data;
+
+   /** @busy_time: Busy time. */
+   ktime_t busy_time;
+
+   /** @idle_time: Idle time. */
+   ktime_t idle_time;
+
+   /** @time_last_update: Last update time. */
+   ktime_t time_last_update;
+
+   /** @last_busy_state: True if the GPU was busy last time we updated the 
state. */
+   bool last_busy_state;
+
+   /*
+* Protect busy_time, idle_time, time_last_update and last_busy_state
+* because these can be accessed concurrently by 
panthor_devfreq_get_dev_status()
+* and panthor_devfreq_record_{busy,idle}().
+*/
+   spinlock_t lock;
+};
+
+static void panthor_devfreq_update_utilization(struct panthor_devfreq 
*pdevfreq)
+{
+   ktime_t now, last;
+
+   now = ktime_get();
+   last = pdevfreq->time_last_update;
+
+   if (pdevfreq->last_busy_state)
+   pdevfreq->busy_time += ktime_sub(now, last);
+   else
+   pdevfreq->idle_time += ktime_sub(now, last);
+
+   pdevfreq->time_last_update = now;
+}
+
+static int panthor_devfreq_target(struct device *dev, unsigned long *freq,
+ u32 flags)
+{
+   struct dev_pm_opp *opp;
+
+   opp = devfreq_recommended_opp(dev, freq, flags);
+   if (IS_ERR(opp))
+   return PTR_ERR(opp);
+   dev_pm_opp_put(opp);
+
+   return dev_pm_opp_set_rate(dev, *freq);
+}
+
+static void panthor_devfreq_reset(struct panthor_devfreq *pdevfreq)
+{
+   pdevfreq->busy_time = 0;
+   pdevfreq->idle_time = 0;
+   pdevfreq->time_last_update = ktime_get();
+}
+
+static int panthor_devfreq_get_dev_status(struct device *dev,
+ struct devfreq_dev_status *status)
+{
+   struct panthor_device *ptdev = dev_get_drvdata(dev);
+   struct panthor_devfreq *pdevfreq = ptdev->devfreq;
+   unsigned long irqflags;
+
+   status->current_frequency = clk_get_rate(ptdev->clks.core);
+
+   spin_lock_irqsave(&pdevfreq->lock, irqflags);
+
+   panthor_devfreq_update_utilization(pdevfreq);
+
+   status->total_time = ktime_to_ns(ktime_add(pdevfreq->busy_time,
+  pdevfreq->idle_time));
+
+   status->busy_time = ktime_to_ns(pdevfreq->busy_time);
+
+   panthor_devfreq_reset(pdevfreq);
+
+   spin_unlock_irqrestore(&pdevfreq->lock, irqflags);
+
+   drm_dbg(&ptdev->base, "busy %lu total %lu %lu %% freq %lu MHz\n",
+   status->busy_time, status->total_time,
+   status->busy_time / (status->total_time / 100),
+   status->current_frequency / 1000 / 1000);
+
+   return 0;
+}
+
+static struct devfreq_dev_profile panthor_devfreq_profile = {
+   .timer = DEVFREQ_TIMER_DELAYED,
+   .polling_ms = 50, /* ~3 frames */
+   .target = panthor_devfreq_target,
+   .get_dev_status = panthor_devfreq_get_dev_status,
+};
+
+int panthor_devfreq_init(struct panthor_device *ptdev)
+{
+   /* There's actually 2 regulators (mali and sram), but the OPP core only
+* supports one.
+*
+* We assume the sram regulator is coupled with the mali one and let
+* the coupling logic deal with voltage updates.
+*/
+   static const char *reg_names[] = { "mali", NULL };
+   struct thermal_cooling_device *cooling;
+   struct device *dev = ptdev->base.dev;
+   struct panthor_devfreq *pdevfreq;
+   struct

[PATCH v2 08/15] drm/panthor: Add the MMU/VM logical block

2023-08-09 Thread Boris Brezillon
MMU and VM management is related and placed in the same source file.

Page table updates are delegated to the io-pgtable-arm driver that's in
the iommu subsystem.

The VM management logic is based on drm_gpuva_mgr, and is assuming the
VA space is mostly managed by the usermode driver, except for a reserved
portion of this VA-space that's used for kernel objects (like the heap
contexts/chunks).

Both asynchronous and synchronous VM operations are supported, and
internal helpers are exposed to allow other logical blocks to map their
buffers in the GPU VA space.

There's one VM_BIND queue per-VM (meaning the Vulkan driver can only
expose one sparse-binding queue), and this bind queue is managed with
a 1:1 drm_sched_entity:drm_gpu_scheduler, such that each VM gets its own
independent execution queue, avoiding VM operation serialization at the
device level (things are still serialized at the VM level).

The rest is just implementation details that are hopefully well explained
in the documentation.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Document the code
- Use drm_gpuva_mgr
- Replace VM_MAP/UNMAP by VM_BIND
- Add support for asynchronous VM_BIND (VM_BIND queue implemented with
  drm_sched)
- Use drm_dev_{unplug,enter,exit}() to provide safe device removal
- Use the panthor_irq layer to manage/process IRQs

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_mmu.c | 2611 +
 drivers/gpu/drm/panthor/panthor_mmu.h |   81 +
 2 files changed, 2692 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_mmu.h

diff --git a/drivers/gpu/drm/panthor/panthor_mmu.c 
b/drivers/gpu/drm/panthor/panthor_mmu.c
new file mode 100644
index ..3ba784473023
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_mmu.c
@@ -0,0 +1,2611 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2019 Linaro, Ltd, Rob Herring  */
+/* Copyright 2023 Collabora ltd. */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "panthor_device.h"
+#include "panthor_heap.h"
+#include "panthor_mmu.h"
+#include "panthor_sched.h"
+#include "panthor_gem.h"
+#include "panthor_regs.h"
+
+#define MAX_AS_SLOTS   32
+
+struct panthor_vm;
+
+/**
+ * struct panthor_as_slot - Address space slot
+ */
+struct panthor_as_slot {
+   /** @vm: VM bound to this slot. NULL is no VM is bound. */
+   struct panthor_vm *vm;
+
+   /** @lock: Lock used to serialize access to the AS registers. */
+   spinlock_t lock;
+};
+
+/**
+ * struct panthor_mmu - MMU related data
+ */
+struct panthor_mmu {
+   /** @irq: The MMU irq. */
+   struct panthor_irq irq;
+
+   /** @as: Address space related fields.
+*
+* The GPU has a limited number of address spaces (AS) slots, forcing
+* us to re-assign them to re-assign slots on-demand.
+*/
+   struct {
+   /** @slots_lock: Lock protecting access to all other AS fields. 
*/
+   struct mutex slots_lock;
+
+   /** @alloc_mask: Bitmask encoding the allocated slots. */
+   unsigned long alloc_mask;
+
+   /** @faulty_mask: Bitmask encoding the faulty slots. */
+   unsigned long faulty_mask;
+
+   /** @slots: VMs currently bound to the AS slots. */
+   struct panthor_as_slot slots[MAX_AS_SLOTS];
+
+   /**
+* @lru_list: List of least recently used VMs.
+*
+* We use this list to pick a VM to evict when all slots are
+* used.
+*
+* There should be no more active VMs than there are AS slots,
+* so this LRU is just here to keep VMs bound until there's
+* a need to release a slot, thus avoid unnecessary TLB/cache
+* flushes.
+*/
+   struct list_head lru_list;
+   } as;
+
+   /** @vm: VMs management fields */
+   struct {
+   /** @lock: Lock protecting access to list. */
+   struct mutex lock;
+
+   /** @list: List containing all VMs. */
+   struct list_head list;
+
+   /** @reset_in_progress: True if a reset is in progress. */
+   bool reset_in_progress;
+
+   /** @wq: Workqueue used for the VM_BIND queues. */
+   struct workqueue_struct *wq;
+   } vm;
+};
+
+/**
+ * struct panthor_vm_pool - VM pool object
+ */
+struct panthor_vm_pool {
+   /** @xa: Array used for VM handle tracking. */
+   st

[PATCH v2 02/15] drm/panthor: Add uAPI

2023-08-09 Thread Boris Brezillon
Panthor follows the lead of other recently submitted drivers with
ioctls allowing us to support modern Vulkan features, like sparse memory
binding:

- Pretty standard GEM management ioctls (BO_CREATE and BO_MMAP_OFFSET),
  with the 'exclusive-VM' bit to speed-up BO reservation on job submission
- VM management ioctls (VM_CREATE, VM_DESTROY and VM_BIND). The VM_BIND
  ioctl is loosely based on the Xe model, and can handle both
  asynchronous and synchronous requests
- GPU execution context creation/destruction, tiler heap context creation
  and job submission. Those ioctls reflect how the hardware/scheduler
  works and are thus driver specific.

We also have a way to expose IO regions, such that the usermode driver
can directly access specific/well-isolate registers, like the
LATEST_FLUSH register used to implement cache-flush reduction.

This uAPI intentionally keeps usermode queues out of the scope, which
explains why doorbell registers and command stream ring-buffers are not
directly exposed to userspace.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Turn the VM_{MAP,UNMAP} ioctls into a VM_BIND ioctl
- Add the concept of exclusive_vm at BO creation time
- Add missing padding fields
- Add documentation

Signed-off-by: Boris Brezillon 
---
 Documentation/gpu/driver-uapi.rst |   5 +
 include/uapi/drm/panthor_drm.h| 862 ++
 2 files changed, 867 insertions(+)
 create mode 100644 include/uapi/drm/panthor_drm.h

diff --git a/Documentation/gpu/driver-uapi.rst 
b/Documentation/gpu/driver-uapi.rst
index c08bcbb95fb3..7a667901830f 100644
--- a/Documentation/gpu/driver-uapi.rst
+++ b/Documentation/gpu/driver-uapi.rst
@@ -17,3 +17,8 @@ VM_BIND / EXEC uAPI
 :doc: Overview
 
 .. kernel-doc:: include/uapi/drm/nouveau_drm.h
+
+drm/panthor uAPI
+
+
+.. kernel-doc:: include/uapi/drm/panthor_drm.h
diff --git a/include/uapi/drm/panthor_drm.h b/include/uapi/drm/panthor_drm.h
new file mode 100644
index ..e217eb5ad198
--- /dev/null
+++ b/include/uapi/drm/panthor_drm.h
@@ -0,0 +1,862 @@
+/* SPDX-License-Identifier: MIT */
+/* Copyright (C) 2023 Collabora ltd. */
+#ifndef _PANTHOR_DRM_H_
+#define _PANTHOR_DRM_H_
+
+#include "drm.h"
+
+#if defined(__cplusplus)
+extern "C" {
+#endif
+
+/**
+ * DOC: Introduction
+ *
+ * This documentation decribes the Panthor IOCTLs.
+ *
+ * Just a few generic rules about the data passed to the Panthor IOCTLs:
+ *
+ * - Structures must be aligned on 64-bit/8-byte. If the object is not
+ *   naturally aligned, a padding field must be added.
+ * - Fields must be explicity aligned to their natural type alignment with
+ *   pad[0..N] fields.
+ * - All padding fields will be checked by the driver to make sure they are
+ *   zeroed.
+ * - Flags can be added, but not removed/replaced.
+ * - New fields can be added to the main structures (the structures
+ *   directly passed to the ioctl). Those fiels can be added at the end of
+ *   the structure, or replace existing padding fields. Any new field being
+ *   added must preserve the behavior that existed before those fields were
+ *   added when a value of zero is passed.
+ * - New fields can be added to indirect objects (objects pointed by the
+ *   main structure), iff those objects are passed a size to reflect the
+ *   size known by the userspace driver (see drm_panthor_obj_array::stride
+ *   or drm_panthor_dev_query::size).
+ * - If the kernel driver is too old to know some fields, those will
+ *   be ignored (input) and set back to zero (output).
+ * - If userspace is too old to know some fields, those will be zeroed
+ *   (input) before the structure is parsed by the kernel driver.
+ * - Each new flag/field addition must come with a driver version update so
+ *   the userspace driver doesn't have to trial and error to know which
+ *   flags are supported.
+ * - Structures should not contain unions, as this would defeat the
+ *   extensibility of such structures.
+ * - IOCTLs can't be removed or replaced. New IOCTL IDs should be placed
+ *   at the end of the drm_panthor_ioctl_id enum.
+ */
+
+/**
+ * DOC: MMIO regions exposed to userspace.
+ *
+ * .. c:macro:: DRM_PANTHOR_USER_MMIO_OFFSET
+ *
+ * File offset for all MMIO regions being exposed to userspace. Don't use
+ * this value directly, use DRM_PANTHOR_USER__OFFSET values instead.
+ *
+ * .. c:macro:: DRM_PANTHOR_USER_FLUSH_ID_MMIO_OFFSET
+ *
+ * File offset for the LATEST_FLUSH_ID register. The Userspace driver controls
+ * GPU cache flushling through CS instructions, but the flush reduction
+ * mechanism requires a flush_id. This flush_id could be queried with an
+ * ioctl, but Arm provides a well-isolated register page containing only this
+ * read-only register, so let's expose this page through a static mmap offset
+ * and allow direct mapping of this MMIO region so we can

[PATCH v2 04/15] drm/panthor: Add the device logical block

2023-08-09 Thread Boris Brezillon
The panthor driver is designed in a modular way, where each logical
block is dealing with a specific HW-block or software feature. In order
for those blocks to communicate with each other, we need a central
panthor_device collecting all the blocks, and exposing some common
features, like interrupt handling, power management, reset, ...

This what this panthor_device logical block is about.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Add devfreq/PM support
- Use drm_dev_{unplug,enter,exit}() to provide safe device removal

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_device.c | 479 +++
 drivers/gpu/drm/panthor/panthor_device.h | 354 +
 2 files changed, 833 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_device.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_device.h

diff --git a/drivers/gpu/drm/panthor/panthor_device.c 
b/drivers/gpu/drm/panthor/panthor_device.c
new file mode 100644
index ..15f102116fa0
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_device.c
@@ -0,0 +1,479 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2018 Marty E. Plummer  */
+/* Copyright 2019 Linaro, Ltd, Rob Herring  */
+/* Copyright 2023 Collabora ltd. */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include "panthor_sched.h"
+#include "panthor_device.h"
+#include "panthor_devfreq.h"
+#include "panthor_gpu.h"
+#include "panthor_fw.h"
+#include "panthor_mmu.h"
+#include "panthor_regs.h"
+
+static int panthor_clk_init(struct panthor_device *ptdev)
+{
+   ptdev->clks.core = devm_clk_get(ptdev->base.dev, NULL);
+   if (IS_ERR(ptdev->clks.core)) {
+   drm_err(&ptdev->base, "get 'core' clock failed %ld\n",
+   PTR_ERR(ptdev->clks.core));
+   return PTR_ERR(ptdev->clks.core);
+   }
+
+   ptdev->clks.stacks = devm_clk_get_optional(ptdev->base.dev, "stacks");
+   if (IS_ERR(ptdev->clks.stacks)) {
+   drm_err(&ptdev->base, "get 'stacks' clock failed %ld\n",
+   PTR_ERR(ptdev->clks.stacks));
+   return PTR_ERR(ptdev->clks.stacks);
+   }
+
+   ptdev->clks.coregroup = devm_clk_get_optional(ptdev->base.dev, 
"coregroup");
+   if (IS_ERR(ptdev->clks.coregroup)) {
+   drm_err(&ptdev->base, "get 'coregroup' clock failed %ld\n",
+   PTR_ERR(ptdev->clks.coregroup));
+   return PTR_ERR(ptdev->clks.coregroup);
+   }
+
+   drm_info(&ptdev->base, "clock rate = %lu\n", 
clk_get_rate(ptdev->clks.core));
+   return 0;
+}
+
+void panthor_device_unplug(struct panthor_device *ptdev)
+{
+   /* FIXME: This is racy. */
+   if (drm_dev_is_unplugged(&ptdev->base))
+   return;
+
+   drm_WARN_ON(&ptdev->base, pm_runtime_get_sync(ptdev->base.dev) < 0);
+
+   /* Call drm_dev_unplug() so any access to HW block happening after
+* that point get rejected.
+*/
+   drm_dev_unplug(&ptdev->base);
+
+   /* Now, try to cleanly shutdown the GPU before the device resources
+* get reclaimed.
+*/
+   panthor_sched_unplug(ptdev);
+   panthor_fw_unplug(ptdev);
+   panthor_mmu_unplug(ptdev);
+   panthor_gpu_unplug(ptdev);
+
+   pm_runtime_dont_use_autosuspend(ptdev->base.dev);
+   pm_runtime_put_sync_suspend(ptdev->base.dev);
+}
+
+static void panthor_device_reset_cleanup(struct drm_device *ddev, void *data)
+{
+   struct panthor_device *ptdev = container_of(ddev, struct 
panthor_device, base);
+
+   cancel_work_sync(&ptdev->reset.work);
+   destroy_workqueue(ptdev->reset.wq);
+}
+
+static void panthor_device_reset_work(struct work_struct *work)
+{
+   struct panthor_device *ptdev = container_of(work, struct 
panthor_device, reset.work);
+   int ret, cookie;
+
+   if (!drm_dev_enter(&ptdev->base, &cookie))
+   return;
+
+   panthor_sched_pre_reset(ptdev);
+   panthor_fw_pre_reset(ptdev, true);
+   panthor_mmu_pre_reset(ptdev);
+   panthor_gpu_soft_reset(ptdev);
+   panthor_gpu_l2_power_on(ptdev);
+   panthor_mmu_post_reset(ptdev);
+   ret = panthor_fw_post_reset(ptdev);
+   if (ret)
+   goto out;
+
+   atomic_set(&ptdev->reset.pending, 0);
+   panthor_sched_post_reset(ptdev);
+   drm_dev_exit(cookie);
+
+out:
+   if (ret) {
+   panthor_device_unplug(ptdev);
+   drm_err(&ptdev->base, "Failed to boot MCU after reset, making 
device unusable.");
+ 

[PATCH v2 10/15] drm/panthor: Add the heap logical block

2023-08-09 Thread Boris Brezillon
Tiler heap growing requires some kernel driver involvement: when the
tiler runs out of heap memory, it will raise an exception which is
either directly handled by the firmware if some free heap chunks are
available in the heap context, or passed back to the kernel otherwise.
The heap helpers will be used by the scheduler logic to allocate more
heap chunks to a heap context, when such a situation happens.

Heap context creation is explicitly requested by userspace (using
the TILER_HEAP_CREATE ioctl), and the returned context is attached to a
queue through some command stream instruction.

All the kernel does is keep the list of heap chunks allocated to a
context, so they can be freed when TILER_HEAP_DESTROY is called, or
extended when the FW requests a new chunk.

v2:
- Rename the driver (pancsf -> panthor)
- Split the driver addition commit
- Document the code
- Fix various bugs

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_heap.c | 550 +
 drivers/gpu/drm/panthor/panthor_heap.h |  36 ++
 2 files changed, 586 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_heap.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_heap.h

diff --git a/drivers/gpu/drm/panthor/panthor_heap.c 
b/drivers/gpu/drm/panthor/panthor_heap.c
new file mode 100644
index ..39244efc2eaa
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_heap.c
@@ -0,0 +1,550 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2023 Collabora ltd. */
+
+#include 
+#include 
+
+#include 
+
+#include "panthor_device.h"
+#include "panthor_gem.h"
+#include "panthor_heap.h"
+#include "panthor_mmu.h"
+
+/**
+ * struct panthor_heap_gpu_ctx - Heap context used by the GPU/FW.
+ */
+struct panthor_heap_gpu_ctx {
+   /**
+* @first_heap_chunk: GPU VA of the first free heap chunk.
+*
+* This forms a single-link list, where each chunk points to the
+* next free chunk, and the last element points to NULL.
+*
+* Heap chunks get freed and returned to the heap context when fragment
+* jobs picking data from those heap chunks complete. When this happens
+* this field is updated to insert the heap chunks that were freed.
+*
+* When the tiler runs out of memory, it will first check if there
+* are free heap chunks in the heap context, and pick those if there 
are.
+*
+* When there is no free heap chunks left, the FW will raise a TILER_OOM
+* interrupt, letting the kernel driver allocate more heap chunks.
+*
+* If the heap context reached its heap chunk limit, the FW will wait
+* for fragment jobs to consume some data and return chunks to the
+* context.
+*
+* As a last resort, if there is no in-flight fragment jobs, the FW
+* will try to execute the exception handler set on the command stream.
+* This exception handler is expected to issue fragment job to store
+* the partial rendering results, free up some heap chunks.
+*/
+   u64 first_heap_chunk;
+
+   /** @unused1: MBZ. */
+   u32 unused1[2];
+
+   /**
+* @vt_started_count: Number of vertex/tiling operations started.
+*
+* This is marking the beginning of a render pass, and is explicity
+* flagged with a HEAP_OPERATION.vt_start instruction. If the render 
pass
+* contains multiple vertex/tiler/IDVS jobs, this 
HEAP_OPERATION.vt_start
+* is only called once.
+*/
+   u32 vt_started_count;
+
+   /**
+* @vt_completed_count: Number of completed vertex/tiler jobs.
+*
+* This is marking the end of the geometry processing part of a render
+* pass, and is explicity flagged by the user command stream with
+* a HEAP_OPERATION.vt_completed instruction. If the render pass 
contains
+* multiple vertex/tiler/IDVS jobs, this HEAP_OPERATION.vt_end
+* instruction is only issued once.
+*/
+   u32 vt_completed_count;
+
+   /** @unused2: MBZ. */
+   u32 unused2;
+
+   /**
+* @frag_completed_count: Number of completed fragment jobs.
+*
+* @vt_started_count - @frag_completed_count is the number of in-flight
+* render targets that's used by the driver to determine if it's worth
+* allocating new chunk or if we should instead wait for fragment jobs
+* to complete.
+*
+* Fragment completion is explicitly flagged by the user command stream
+* with a HEAP_OPERATION.frag_end or FINISH_FRAGMENT.frag_end 
instruction.
+*/
+   u32 frag_completed_count;
+};
+
+/**
+ * struct panthor_heap_chunk_header - Heap chunk header
+ */
+struct panthor_heap_chunk_header {
+   /**
+* @next: Next heap chunk in the list.
+*
+* This is a GPU VA.
+*/
+   u64 next

[PATCH v2 05/15] drm/panthor: Add the GPU logical block

2023-08-09 Thread Boris Brezillon
Handles everything that's not related to the FW, the MMU or the
scheduler. This is the block dealing with the GPU property retrieval,
the GPU block power on/off logic, and some global operations, like
global cache flushing.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Use drm_dev_{unplug,enter,exit}() to provide safe device removal
- Use the panthor_irq layer to manage/process IRQs

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_gpu.c | 463 ++
 drivers/gpu/drm/panthor/panthor_gpu.h |  52 +++
 2 files changed, 515 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_gpu.h

diff --git a/drivers/gpu/drm/panthor/panthor_gpu.c 
b/drivers/gpu/drm/panthor/panthor_gpu.c
new file mode 100644
index ..47d15334b46e
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_gpu.c
@@ -0,0 +1,463 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2018 Marty E. Plummer  */
+/* Copyright 2019 Linaro, Ltd., Rob Herring  */
+/* Copyright 2019 Collabora ltd. */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include "panthor_device.h"
+#include "panthor_gpu.h"
+#include "panthor_regs.h"
+
+/**
+ * struct panthor_gpu - GPU block management data.
+ */
+struct panthor_gpu {
+   /** @irq: GPU irq. */
+   struct panthor_irq irq;
+
+   /** @reqs_lock: Lock protecting access to pending_reqs. */
+   spinlock_t reqs_lock;
+
+   /** @pending_reqs: Pending GPU requests. */
+   u32 pending_reqs;
+
+   /** @reqs_acked: GPU request wait queue. */
+   wait_queue_head_t reqs_acked;
+};
+
+/**
+ * struct panthor_model - GPU model description
+ */
+struct panthor_model {
+   /** @name: Model name. */
+   const char *name;
+
+   /** @id: Model ID. */
+   u32 id;
+};
+
+/**
+ * GPU_MODEL() - Define a GPU model.
+ */
+#define GPU_MODEL(_name, _id, ...) \
+{\
+   .name = __stringify(_name), \
+   .id = _id,  \
+}
+
+#define GPU_MODEL_ID_MASK  0xf00f
+
+static const struct panthor_model gpu_models[] = {
+   GPU_MODEL(g610, 0xa007),
+   {},
+};
+
+#define GPU_INTERRUPTS_MASK\
+   (GPU_IRQ_FAULT | \
+GPU_IRQ_PROTM_FAULT | \
+GPU_IRQ_RESET_COMPLETED | \
+GPU_IRQ_MCU_STATUS_CHANGED | \
+GPU_IRQ_CLEAN_CACHES_COMPLETED)
+
+static void panthor_gpu_init_info(struct panthor_device *ptdev)
+{
+   const struct panthor_model *model;
+   u32 major, minor, status;
+   unsigned int i;
+
+   ptdev->gpu_info.gpu_id = gpu_read(ptdev, GPU_ID);
+   ptdev->gpu_info.csf_id = gpu_read(ptdev, GPU_CSF_ID);
+   ptdev->gpu_info.gpu_rev = gpu_read(ptdev, GPU_REVID);
+   ptdev->gpu_info.l2_features = gpu_read(ptdev, GPU_L2_FEATURES);
+   ptdev->gpu_info.tiler_features = gpu_read(ptdev, GPU_TILER_FEATURES);
+   ptdev->gpu_info.mem_features = gpu_read(ptdev, GPU_MEM_FEATURES);
+   ptdev->gpu_info.mmu_features = gpu_read(ptdev, GPU_MMU_FEATURES);
+   ptdev->gpu_info.thread_features = gpu_read(ptdev, GPU_THREAD_FEATURES);
+   ptdev->gpu_info.max_threads = gpu_read(ptdev, GPU_THREAD_MAX_THREADS);
+   ptdev->gpu_info.thread_max_workgroup_size = gpu_read(ptdev, 
GPU_THREAD_MAX_WORKGROUP_SIZE);
+   ptdev->gpu_info.thread_max_barrier_size = gpu_read(ptdev, 
GPU_THREAD_MAX_BARRIER_SIZE);
+   ptdev->gpu_info.coherency_features = gpu_read(ptdev, 
GPU_COHERENCY_FEATURES);
+   for (i = 0; i < 4; i++)
+   ptdev->gpu_info.texture_features[i] = gpu_read(ptdev, 
GPU_TEXTURE_FEATURES(i));
+
+   ptdev->gpu_info.as_present = gpu_read(ptdev, GPU_AS_PRESENT);
+
+   ptdev->gpu_info.shader_present = gpu_read(ptdev, GPU_SHADER_PRESENT_LO);
+   ptdev->gpu_info.shader_present |= (u64)gpu_read(ptdev, 
GPU_SHADER_PRESENT_HI) << 32;
+
+   ptdev->gpu_info.tiler_present = gpu_read(ptdev, GPU_TILER_PRESENT_LO);
+   ptdev->gpu_info.tiler_present |= (u64)gpu_read(ptdev, 
GPU_TILER_PRESENT_HI) << 32;
+
+   ptdev->gpu_info.l2_present = gpu_read(ptdev, GPU_L2_PRESENT_LO);
+   ptdev->gpu_info.l2_present |= (u64)gpu_read(ptdev, GPU_L2_PRESENT_HI) 
<< 32;
+   ptdev->gpu_info.core_group_count = 
hweight64(ptdev->gpu_info.l2_present);
+
+   major = (ptdev->gpu_info.gpu_id >> 12) & 0xf;
+   minor = (ptdev->gpu_info.gpu_id >> 4) & 0xff;
+   status = ptdev->gpu_info.gpu_id & 0xf;
+
+   for (model = gpu_models; model->name; model++) {
+   if (model->id == (ptdev->gpu_info.gpu_id & GPU_MODEL_ID_MASK))
+   break;
+   

[PATCH v2 11/15] drm/panthor: Add the scheduler logical block

2023-08-09 Thread Boris Brezillon
This is the piece of software interacting with the FW scheduler, and
taking care of some scheduling aspects when the FW comes short of slots
scheduling slots. Indeed, the FW only expose a few slots, and the kernel
has to give all submission contexts, a chance to execute their jobs.

The kernel-side scheduler is timeslice-based, with a round-robin queue
per priority level.

Job submission is handled with a 1:1 drm_sched_entity:drm_gpu_scheduler,
allowing us to delegate the dependency tracking to the core.

All the gory details should be documented inline.

v2:
- Rename the driver (pancsf -> panthor)
- Rename the file (_mcu -> _fw)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Document the code
- Use drm_dev_{unplug,enter,exit}() to provide safe device removal
- Move the ping logic to panthor_fw.c
- Fix various bugs

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_sched.c | 3272 +++
 drivers/gpu/drm/panthor/panthor_sched.h |   50 +
 2 files changed, 3322 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_sched.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_sched.h

diff --git a/drivers/gpu/drm/panthor/panthor_sched.c 
b/drivers/gpu/drm/panthor/panthor_sched.c
new file mode 100644
index ..c1a516454e5d
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -0,0 +1,3272 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2023 Collabora ltd. */
+
+#ifdef CONFIG_ARM_ARCH_TIMER
+#include 
+#endif
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "panthor_sched.h"
+#include "panthor_devfreq.h"
+#include "panthor_device.h"
+#include "panthor_gem.h"
+#include "panthor_heap.h"
+#include "panthor_regs.h"
+#include "panthor_gpu.h"
+#include "panthor_fw.h"
+#include "panthor_mmu.h"
+
+/**
+ * DOC: Scheduler
+ *
+ * Mali CSF hardware adopts a firmware-assited scheduling model, where
+ * the firmware takes care of scheduling aspects, to some extend.
+ *
+ * The scheduling happens at the scheduling group level, each group
+ * contains 1 to N queues (N is FW/hardware dependent, and exposed
+ * through the firmware interface). Each queue is assigned a command
+ * stream ring buffer, which serves as a way to get jobs submitted to
+ * the GPU, among other things.
+ *
+ * The firmware can schedule a maximum of M groups (M is FW/hardware
+ * dependent, and exposed through the firmware interface). Passed
+ * this maximum number of groups, the kernel must take care of
+ * rotating the groups passed to the firmware so every group gets
+ * a chance to have his queues scheduled for execution.
+ *
+ * The current implementation only supports with kernel-mode queues.
+ * In other terms, userspace doesn't have access to the ring-buffer.
+ * Instead, userspace passes indirect command stream buffers that are
+ * called from the queue ring-buffer by the kernel using a pre-defined
+ * sequence of command stream instructions to ensure the userspace driver
+ * always gets consistent results (cache maintenance,
+ * synchronization, ...).
+ *
+ * We rely on the drm_gpu_scheduler framework to deal with job
+ * dependencies and submission. As any other driver dealing with a
+ * FW-scheduler, we use the 1:1 entity:scheduler mode, such that each
+ * entity has its own job scheduler. When a job is ready to be executed
+ * (all its dependencies are met), it is pushed to the appropriate
+ * queue ring-buffer, and the group is scheduled for execution if it
+ * wasn't already active.
+ *
+ * Kernel-side group scheduling is timeslice-based. When we have less
+ * groups than there are slots, the periodic tick is disabled and we
+ * just let the FW schedule the active groups. When there are more
+ * groups than slots, we let each group a chance to execute stuff for
+ * a given amount of time, and then re-evaluate and pick new groups
+ * to schedule. The group selection algorithm is based on
+ * priority+round-robin.
+ *
+ * Even though user-mode queues is out of the scope right now, the
+ * current design takes them into account by avoiding any guess on the
+ * group/queue state that would be based on information we wouldn't have
+ * if userspace was in charge of the ring-buffer. That's also one of the
+ * reason we don't do 'cooperative' scheduling (encoding FW group slot
+ * reservation as dma_fence that would be returned from the
+ * drm_gpu_scheduler::prepare_job() hook, and treating group rotation as
+ * a queue of waiters, ordered by job submission order). This approach
+ * would work for kernel-mode queues, but would make user-mode queues a
+ * lot more complicated to retrofit.
+ */
+
+#define JOB_TIMEOUT_MS 

[PATCH v2 09/15] drm/panthor: Add the FW logical block

2023-08-09 Thread Boris Brezillon
Contains everything that's FW related, that includes the code dealing
with the microcontroller unit (MCU) that's running the FW, and anything
related to allocating memory shared between the FW and the CPU.

A few global FW events are processed in the IRQ handler, the rest is
forwarded to the scheduler, since scheduling is the primary reason for
the FW existence, and also the main source of FW <-> kernel
interactions.

v2:
- Rename the driver (pancsf -> panthor)
- Rename the file (_mcu -> _fw)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Document the code
- Use drm_dev_{unplug,enter,exit}() to provide safe device removal
- Use the panthor_irq layer to manage/process IRQs

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_fw.c | 1417 ++
 drivers/gpu/drm/panthor/panthor_fw.h |  505 +
 2 files changed, 1922 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_fw.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_fw.h

diff --git a/drivers/gpu/drm/panthor/panthor_fw.c 
b/drivers/gpu/drm/panthor/panthor_fw.c
new file mode 100644
index ..359a68f7af03
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_fw.c
@@ -0,0 +1,1417 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2023 Collabora ltd. */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include "panthor_device.h"
+#include "panthor_gem.h"
+#include "panthor_gpu.h"
+#include "panthor_regs.h"
+#include "panthor_fw.h"
+#include "panthor_mmu.h"
+#include "panthor_sched.h"
+
+#define CSF_FW_NAME "mali_csffw.bin"
+
+#define PING_INTERVAL_MS   12000
+#define PROGRESS_TIMEOUT_CYCLES(5ull * 500 * 1024 * 
1024)
+#define PROGRESS_TIMEOUT_SCALE_SHIFT   10
+#define IDLE_HYSTERESIS_US 800
+#define PWROFF_HYSTERESIS_US   1
+
+/**
+ * struct panthor_fw_mem - FW memory
+ */
+struct panthor_fw_mem {
+   /** @bo: Buffer object backing the FW memory. */
+   struct panthor_gem_object *bo;
+
+   /** @kmap: Kernel CPU mapping of the FW memory. */
+   void *kmap;
+
+   /** @va: MCU mapping of the FW memory. */
+   u64 va;
+};
+
+/**
+ * struct panthor_fw_binary_hdr - Firmware binary header.
+ */
+struct panthor_fw_binary_hdr {
+   /** @magic: Magic value to check binary validity. */
+   u32 magic;
+#define CSF_FW_BINARY_HEADER_MAGIC 0xc3f13a6e
+
+   /** @minor: Minor FW version. */
+   u8 minor;
+
+   /** @major: Major FW version. */
+   u8 major;
+#define CSF_FW_BINARY_HEADER_MAJOR_MAX 0
+
+   /** @padding1: MBZ. */
+   u16 padding1;
+
+   /** @version_hash: FW version hash. */
+   u32 version_hash;
+
+   /** @padding2: MBZ. */
+   u32 padding2;
+
+   /** @size: FW binary size. */
+   u32 size;
+};
+
+/**
+ * enum panthor_fw_binary_entry_type - Firmware binary entry type
+ */
+enum panthor_fw_binary_entry_type {
+   /** @CSF_FW_BINARY_ENTRY_TYPE_IFACE: Host <-> FW interface. */
+   CSF_FW_BINARY_ENTRY_TYPE_IFACE = 0,
+
+   /** @CSF_FW_BINARY_ENTRY_TYPE_CONFIG: FW config. */
+   CSF_FW_BINARY_ENTRY_TYPE_CONFIG = 1,
+
+   /** @CSF_FW_BINARY_ENTRY_TYPE_FUTF_TEST: Unit-tests. */
+   CSF_FW_BINARY_ENTRY_TYPE_FUTF_TEST = 2,
+
+   /** @CSF_FW_BINARY_ENTRY_TYPE_TRACE_BUFFER: Trace buffer interface. */
+   CSF_FW_BINARY_ENTRY_TYPE_TRACE_BUFFER = 3,
+
+   /** @CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA: Timeline metadata 
interface. */
+   CSF_FW_BINARY_ENTRY_TYPE_TIMELINE_METADATA = 4,
+};
+
+#define CSF_FW_BINARY_ENTRY_TYPE(ehdr) ((ehdr) 
& 0xff)
+#define CSF_FW_BINARY_ENTRY_SIZE(ehdr) 
(((ehdr) >> 8) & 0xff)
+#define CSF_FW_BINARY_ENTRY_UPDATE BIT(30)
+#define CSF_FW_BINARY_ENTRY_OPTIONAL   BIT(31)
+
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_RD
BIT(0)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_WR
BIT(1)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_EX
BIT(2)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_NONE   (0 << 3)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_CACHED (1 << 3)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_UNCACHED_COHERENT  (2 << 3)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_CACHED_COHERENT
(3 << 3)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_CACHE_MODE_MASK   
GENMASK(4, 3)
+#define CSF_FW_BINARY_IFACE_ENTRY_RD_PROT  BIT(5)
+#define CSF_FW_BINAR

[PATCH v2 06/15] drm/panthor: Add GEM logical block

2023-08-09 Thread Boris Brezillon
Anything relating to GEM object management is placed here. Nothing
particularly interesting here, given the implementation is based on
drm_gem_shmem_object, which is doing most of the work.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Document the code

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_gem.c | 229 ++
 drivers/gpu/drm/panthor/panthor_gem.h |  96 +++
 2 files changed, 325 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_gem.c
 create mode 100644 drivers/gpu/drm/panthor/panthor_gem.h

diff --git a/drivers/gpu/drm/panthor/panthor_gem.c 
b/drivers/gpu/drm/panthor/panthor_gem.c
new file mode 100644
index ..a441a68822ca
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_gem.c
@@ -0,0 +1,229 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2019 Linaro, Ltd, Rob Herring  */
+/* Copyright 2023 Collabora ltd. */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "panthor_device.h"
+#include "panthor_gem.h"
+#include "panthor_mmu.h"
+
+static void panthor_gem_free_object(struct drm_gem_object *obj)
+{
+   struct panthor_gem_object *bo = to_panthor_bo(obj);
+
+   if (drm_WARN_ON(obj->dev, bo->va_node))
+   panthor_vm_free_va(bo->exclusive_vm, bo->va_node);
+
+   panthor_vm_put(bo->exclusive_vm);
+   drm_gem_free_mmap_offset(&bo->base.base);
+   mutex_destroy(&bo->gpuva_list_lock);
+   drm_gem_shmem_free(&bo->base);
+}
+
+/**
+ * panthor_gem_unmap_and_put() - Unmap and drop the reference on a GEM object
+ * @vm: VM to unmap the GEM from.
+ * @bo: GEM object to unmap/release.
+ * @gpu_va: GPU/MCU virtual address the GEM object was mapped at.
+ * @cpu_va: kernel mapping of the GEM object.
+ * Can be NULL if the GEM was not CPU mapped.
+ *
+ * Should be called to undo what was done in panthor_gem_create_and_map().
+ */
+void panthor_gem_unmap_and_put(struct panthor_vm *vm,
+  struct panthor_gem_object *bo,
+  u64 gpu_va, void *cpu_va)
+{
+   if (cpu_va) {
+   struct iosys_map map = IOSYS_MAP_INIT_VADDR(cpu_va);
+
+   drm_gem_vunmap_unlocked(&bo->base.base, &map);
+   }
+
+   drm_WARN_ON(bo->base.base.dev, panthor_vm_unmap_range(vm, gpu_va, 
bo->base.base.size));
+   panthor_vm_free_va(vm, bo->va_node);
+   bo->va_node = NULL;
+   drm_gem_object_put(&bo->base.base);
+}
+
+/**
+ * panthor_gem_create_and_map() - Create and map a GEM object to a VM
+ * @ptdev: Device.
+ * @vm: VM to map the GEM to.
+ * @bo_flags: Combination of drm_panthor_bo_flags flags.
+ * @vm_map_flags: Combination of drm_panthor_vm_bind_op_flags (only those
+ * that are related to map operations).
+ * @gpu_va: Pointer holding the GPU address assigned when mapping to the VM.
+ * If *gpu_va == PANTHOR_GEM_ALLOC_VA, a virtual address range will be 
allocated
+ * and the allocated address returned, otherwise *gpu_va is used directly.
+ * @cpu_va: Pointer holding the kernel CPU mapping. If NULL, the GEM object
+ * is not CPU-mapped.
+ *
+ * Return: A valid pointer in case of success, an ERR_PTR() otherwise.
+ */
+struct panthor_gem_object *
+panthor_gem_create_and_map(struct panthor_device *ptdev, struct panthor_vm *vm,
+  size_t size, u32 bo_flags, u32 vm_map_flags,
+  u64 *gpu_va, void **cpu_va)
+{
+   struct drm_gem_shmem_object *obj;
+   struct panthor_gem_object *bo;
+   int ret;
+
+   obj = drm_gem_shmem_create(&ptdev->base, size);
+   if (!obj)
+   return ERR_PTR(-ENOMEM);
+
+   bo = to_panthor_bo(&obj->base);
+   bo->flags = bo_flags;
+   bo->exclusive_vm = panthor_vm_get(vm);
+   bo->base.base.resv = panthor_vm_resv(vm);
+
+   if (*gpu_va == PANTHOR_GEM_ALLOC_VA) {
+   bo->va_node = panthor_vm_alloc_va(vm, obj->base.size);
+
+   if (IS_ERR(bo->va_node)) {
+   ret = PTR_ERR(bo->va_node);
+   bo->va_node = NULL;
+   goto err_put_obj;
+   }
+
+   *gpu_va = bo->va_node->start;
+   }
+
+   ret = panthor_vm_map_bo_range(vm, bo, 0, obj->base.size, *gpu_va, 
vm_map_flags);
+   if (ret)
+   goto err_put_obj;
+
+   if (cpu_va) {
+   struct iosys_map map;
+   int ret;
+
+   ret = drm_gem_vmap_unlocked(&obj->base, &map);
+   if (ret)
+   goto err_vm_unmap_range;
+
+   *cpu_va = map.vaddr;
+   }
+
+   return bo;
+
+err_vm_unmap_range:
+   panthor_vm_unmap_range(vm, *gpu_va, obj->base.size);
+
+err_put_obj:
+   drm_gem_object_put(&o

[PATCH v2 13/15] drm/panthor: Allow driver compilation

2023-08-09 Thread Boris Brezillon
Now that all blocks are available, we can add/update Kconfig/Makefile
files to allow compilation.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Add new dependencies on GPUVA and DRM_SCHED

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/Kconfig  |  2 ++
 drivers/gpu/drm/Makefile |  1 +
 drivers/gpu/drm/panthor/Kconfig  | 16 
 drivers/gpu/drm/panthor/Makefile | 15 +++
 4 files changed, 34 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/Kconfig
 create mode 100644 drivers/gpu/drm/panthor/Makefile

diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
index 2a44b9419d4d..bddfbdb2ffee 100644
--- a/drivers/gpu/drm/Kconfig
+++ b/drivers/gpu/drm/Kconfig
@@ -358,6 +358,8 @@ source "drivers/gpu/drm/lima/Kconfig"
 
 source "drivers/gpu/drm/panfrost/Kconfig"
 
+source "drivers/gpu/drm/panthor/Kconfig"
+
 source "drivers/gpu/drm/aspeed/Kconfig"
 
 source "drivers/gpu/drm/mcde/Kconfig"
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile
index 215e78e79125..0a260727505f 100644
--- a/drivers/gpu/drm/Makefile
+++ b/drivers/gpu/drm/Makefile
@@ -188,6 +188,7 @@ obj-$(CONFIG_DRM_TVE200) += tve200/
 obj-$(CONFIG_DRM_XEN) += xen/
 obj-$(CONFIG_DRM_VBOXVIDEO) += vboxvideo/
 obj-$(CONFIG_DRM_LIMA)  += lima/
+obj-$(CONFIG_DRM_PANTHOR) += panthor/
 obj-$(CONFIG_DRM_PANFROST) += panfrost/
 obj-$(CONFIG_DRM_ASPEED_GFX) += aspeed/
 obj-$(CONFIG_DRM_MCDE) += mcde/
diff --git a/drivers/gpu/drm/panthor/Kconfig b/drivers/gpu/drm/panthor/Kconfig
new file mode 100644
index ..a9d17b1bbb75
--- /dev/null
+++ b/drivers/gpu/drm/panthor/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0 or MIT
+
+config DRM_PANTHOR
+   tristate "Panthor (DRM support for ARM Mali CSF-based GPUs)"
+   depends on DRM
+   depends on ARM || ARM64 || (COMPILE_TEST && !GENERIC_ATOMIC64)
+   depends on MMU
+   select DRM_EXEC
+   select DRM_SCHED
+   select IOMMU_SUPPORT
+   select IOMMU_IO_PGTABLE_LPAE
+   select DRM_GEM_SHMEM_HELPER
+   select PM_DEVFREQ
+   select DEVFREQ_GOV_SIMPLE_ONDEMAND
+   help
+ DRM driver for ARM Mali CSF-based GPUs.
diff --git a/drivers/gpu/drm/panthor/Makefile b/drivers/gpu/drm/panthor/Makefile
new file mode 100644
index ..64193a484879
--- /dev/null
+++ b/drivers/gpu/drm/panthor/Makefile
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0 or MIT
+
+panthor-y := \
+   panthor_devfreq.o \
+   panthor_device.o \
+   panthor_drv.o \
+   panthor_gem.o \
+   panthor_gpu.o \
+   panthor_heap.o \
+   panthor_heap.o \
+   panthor_fw.o \
+   panthor_mmu.o \
+   panthor_sched.o
+
+obj-$(CONFIG_DRM_PANTHOR) += panthor.o
-- 
2.41.0



[PATCH v2 12/15] drm/panthor: Add the driver frontend block

2023-08-09 Thread Boris Brezillon
This is the last piece missing to expose the driver to the outside
world.

This is basically a wrapper between the ioctls and the other logical
blocks.

v2:
- Rename the driver (pancsf -> panthor)
- Change the license (GPL2 -> MIT + GPL2)
- Split the driver addition commit
- Document the code
- Use drm_dev_{unplug,enter,exit}() to provide safe device removal
- Fix various bugs
- Refactored the code to make job submission re-usable for VM_BIND
  jobs
- Add user object copy helpers

Signed-off-by: Boris Brezillon 
---
 drivers/gpu/drm/panthor/panthor_drv.c | 1540 +
 1 file changed, 1540 insertions(+)
 create mode 100644 drivers/gpu/drm/panthor/panthor_drv.c

diff --git a/drivers/gpu/drm/panthor/panthor_drv.c 
b/drivers/gpu/drm/panthor/panthor_drv.c
new file mode 100644
index ..377ebea4c0e8
--- /dev/null
+++ b/drivers/gpu/drm/panthor/panthor_drv.c
@@ -0,0 +1,1540 @@
+// SPDX-License-Identifier: GPL-2.0 or MIT
+/* Copyright 2018 Marty E. Plummer  */
+/* Copyright 2019 Linaro, Ltd., Rob Herring  */
+/* Copyright 2019 Collabora ltd. */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "panthor_sched.h"
+#include "panthor_device.h"
+#include "panthor_gem.h"
+#include "panthor_heap.h"
+#include "panthor_fw.h"
+#include "panthor_mmu.h"
+#include "panthor_gpu.h"
+#include "panthor_regs.h"
+
+/**
+ * DOC: user <-> kernel object copy helpers.
+ */
+
+/**
+ * panthor_set_uobj() - Copy kernel object to user object.
+ * @usr_ptr: Users pointer.
+ * @usr_size: Size of the user object.
+ * @min_size: Minimum size for this object.
+ * @kern_size: Size of the kernel object.
+ * @in: Address of the kernel object to copy.
+ *
+ * Helper automating kernel -> user object copies.
+ *
+ * Don't use this function directly, use PANTHOR_UOBJ_SET() instead.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+panthor_set_uobj(u64 usr_ptr, u32 usr_size, u32 min_size, u32 kern_size, const 
void *in)
+{
+   /* User size shouldn't be smaller than the minimal object size. */
+   if (usr_size < min_size)
+   return -EINVAL;
+
+   if (copy_to_user(u64_to_user_ptr(usr_ptr), in, min_t(u32, usr_size, 
kern_size)))
+   return -EFAULT;
+
+   /* When the kernel object is smaller than the user object, we fill the 
gap with
+* zeros.
+*/
+   if (usr_size > kern_size &&
+   clear_user(u64_to_user_ptr(usr_ptr + kern_size), usr_size - 
kern_size)) {
+   return -EFAULT;
+   }
+
+   return 0;
+}
+
+/**
+ * panthor_get_uobj_array() - Copy a user object array into a kernel 
accessible object array.
+ * @in: The object array to copy.
+ * @min_stride: Minimum array stride.
+ * @obj_kernel: Kernel object size.
+ * @out: Pointer to a variable that will hold the newly allocated object array.
+ *
+ * Helper automating user -> kernel object copies.
+ *
+ * Don't use this function directly, use PANTHOR_UOBJ_ARRAY_GET() instead.
+ *
+ * Return: 0 on success, a negative error code otherwise.
+ */
+static int
+panthor_get_uobj_array(const struct drm_panthor_obj_array *in, u32 min_stride,
+  u32 obj_size, void **out)
+{
+   int ret = 0;
+   void *out_alloc;
+
+   /* User stride must be at least the minimum object size, otherwise it 
might
+* lack useful information.
+*/
+   if (in->stride < min_stride)
+   return -EINVAL;
+
+   if (!in->count)
+   return 0;
+
+   out_alloc = kvmalloc_array(in->count, obj_size, GFP_KERNEL);
+   if (!out_alloc)
+   return -ENOMEM;
+
+   if (obj_size == in->stride) {
+   /* Fast path when user/kernel have the same uAPI header 
version. */
+   if (copy_from_user(out_alloc, u64_to_user_ptr(in->array),
+  (unsigned long)obj_size * in->count))
+   ret = -EFAULT;
+   } else {
+   void __user *in_ptr = u64_to_user_ptr(in->array);
+   void *out_ptr = out_alloc;
+
+   /* If the sizes differ, we need to copy elements one by one. */
+   for (u32 i = 0; i < in->count; i++) {
+   ret = copy_struct_from_user(out_ptr, obj_size, in_ptr, 
in->stride);
+   if (ret)
+   break;
+
+   out_ptr += obj_size;
+   in_ptr += in->stride;
+   }
+   }
+
+   if (ret) {
+   kvfree(out_alloc);
+   return ret;
+   }
+
+   *out = out_alloc;
+   return 0;
+}
+
+/**
+ * PANTHOR_UOBJ_MIN_SIZE_INTERNAL() - Get the minimum user object size
+ * @_typename: Object type.

[PATCH v2 14/15] dt-bindings: gpu: mali-valhall-csf: Add initial bindings for panthor driver

2023-08-09 Thread Boris Brezillon
From: Liviu Dudau 

Arm has introduced a new v10 GPU architecture that replaces the Job Manager
interface with a new Command Stream Frontend. It adds firmware driven
command stream queues that can be used by kernel and user space to submit
jobs to the GPU.

Add the initial schema for the device tree that is based on support for
RK3588 SoC. The minimum number of clocks is one for the IP, but on Rockchip
platforms they will tend to expose the semi-independent clocks for better
power management.

v2:
- New commit

Signed-off-by: Liviu Dudau 
Cc: Krzysztof Kozlowski 
Cc: Rob Herring 
Cc: Conor Dooley 
Cc: devicet...@vger.kernel.org
---
 .../bindings/gpu/arm,mali-valhall-csf.yaml| 148 ++
 1 file changed, 148 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml

diff --git a/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml 
b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
new file mode 100644
index ..2b9f77aa0b7a
--- /dev/null
+++ b/Documentation/devicetree/bindings/gpu/arm,mali-valhall-csf.yaml
@@ -0,0 +1,148 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/gpu/arm,mali-valhall-csf.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: ARM Mali Valhall GPU
+
+maintainers:
+  - Liviu Dudau 
+  - Boris Brezillon 
+
+properties:
+  $nodename:
+pattern: '^gpu@[a-f0-9]+$'
+
+  compatible:
+oneOf:
+  - items:
+  - enum:
+  - rockchip,rk3588-mali
+  - const: arm,mali-valhall-csf   # Mali Valhall GPU model/revision is 
fully discoverable
+
+  reg:
+maxItems: 1
+
+  interrupts:
+items:
+  - description: Job interrupt
+  - description: MMU interrupt
+  - description: GPU interrupt
+
+  interrupt-names:
+items:
+  - const: job
+  - const: mmu
+  - const: gpu
+
+  clocks:
+minItems: 1
+maxItems: 3
+
+  clock-names:
+minItems: 1
+items:
+  - const: core
+  - const: coregroup
+  - const: stacks
+
+  mali-supply: true
+
+  sram-supply: true
+
+  operating-points-v2: true
+
+  power-domains:
+minItems: 1
+maxItems: 5
+
+  power-domain-names:
+minItems: 1
+maxItems: 5
+
+  "#cooling-cells":
+const: 2
+
+  dynamic-power-coefficient:
+$ref: /schemas/types.yaml#/definitions/uint32
+description:
+  A u32 value that represents the running time dynamic
+  power coefficient in units of uW/MHz/V^2. The
+  coefficient can either be calculated from power
+  measurements or derived by analysis.
+
+  The dynamic power consumption of the GPU is
+  proportional to the square of the Voltage (V) and
+  the clock frequency (f). The coefficient is used to
+  calculate the dynamic power as below -
+
+  Pdyn = dynamic-power-coefficient * V^2 * f
+
+  where voltage is in V, frequency is in MHz.
+
+  dma-coherent: true
+
+required:
+  - compatible
+  - reg
+  - interrupts
+  - interrupt-names
+  - clocks
+  - mali-supply
+
+additionalProperties: false
+
+allOf:
+  - if:
+  properties:
+compatible:
+  contains:
+const: rockchip,rk3588-mali
+then:
+  properties:
+clocks:
+  minItems: 3
+clock-names:
+  items:
+- const: core
+- const: coregroup
+- const: stacks
+
+examples:
+  - |
+#include 
+#include 
+#include 
+#include 
+
+gpu: gpu@fb00 {
+compatible = "rockchip,rk3588-mali", "arm,mali-valhall-csf";
+reg = <0xfb00 0x20>;
+interrupts = ,
+ ,
+ ;
+interrupt-names = "job", "mmu", "gpu";
+clock-names = "core", "coregroup", "stacks";
+clocks = <&cru CLK_GPU>, <&cru CLK_GPU_COREGROUP>,
+ <&cru CLK_GPU_STACKS>;
+power-domains = <&power RK3588_PD_GPU>;
+operating-points-v2 = <&gpu_opp_table>;
+mali-supply = <&vdd_gpu_s0>;
+sram-supply = <&vdd_gpu_mem_s0>;
+status = "disabled";
+};
+
+gpu_opp_table: opp-table {
+compatible = "operating-points-v2";
+opp-3 {
+opp-hz = /bits/ 64 <3>;
+opp-microvolt = <675000 675000 85>;
+};
+opp-4 {
+opp-hz = /bits/ 64 <4>;
+opp-microvolt = <675000 675000 85>;
+};
+};
+
+...
-- 
2.41.0



[PATCH v2 15/15] drm/panthor: Add an entry to MAINTAINERS

2023-08-09 Thread Boris Brezillon
Add an entry for the Panthor driver to the MAINTAINERS file.

v2:
- New commit

Signed-off-by: Boris Brezillon 
---

If anyone from Arm wants to volunteer to become a co-maintainer, that
would be highly appreciated
---
 MAINTAINERS | 8 
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index cd882b87a3c6..6149ab68d461 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1624,6 +1624,14 @@ T:   git git://anongit.freedesktop.org/drm/drm-misc
 F: drivers/gpu/drm/panfrost/
 F: include/uapi/drm/panfrost_drm.h
 
+ARM MALI PANTHOR DRM DRIVER
+M: Boris Brezillon 
+L: dri-devel@lists.freedesktop.org
+S: Supported
+T: git git://anongit.freedesktop.org/drm/drm-misc
+F: drivers/gpu/drm/panthor/
+F: include/uapi/drm/panthor_drm.h
+
 ARM MALI-DP DRM DRIVER
 M: Liviu Dudau 
 S: Supported
-- 
2.41.0



Re: [PATCH] drm/shmem-helper: Reset vma->vm_ops before calling dma_buf_mmap()

2023-08-09 Thread Boris Brezillon
On Mon, 24 Jul 2023 13:26:10 +0200
Boris Brezillon  wrote:

> The dma-buf backend is supposed to provide its own vm_ops, but some
> implementation just have nothing special to do and leave vm_ops
> untouched, probably expecting this field to be zero initialized (this
> is the case with the system_heap implementation for instance).
> Let's reset vma->vm_ops to NULL to keep things working with these
> implementations.
> 
> Fixes: 26d3ac3cb04d ("drm/shmem-helpers: Redirect mmap for imported dma-buf")
> Cc: 
> Cc: Daniel Vetter 
> Reported-by: Roman Stratiienko 
> Signed-off-by: Boris Brezillon 

Queued to drm-misc-fixes.

> ---
>  drivers/gpu/drm/drm_gem_shmem_helper.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/gpu/drm/drm_gem_shmem_helper.c 
> b/drivers/gpu/drm/drm_gem_shmem_helper.c
> index 4ea6507a77e5..baaf0e0feb06 100644
> --- a/drivers/gpu/drm/drm_gem_shmem_helper.c
> +++ b/drivers/gpu/drm/drm_gem_shmem_helper.c
> @@ -623,7 +623,13 @@ int drm_gem_shmem_mmap(struct drm_gem_shmem_object 
> *shmem, struct vm_area_struct
>   int ret;
>  
>   if (obj->import_attach) {
> + /* Reset both vm_ops and vm_private_data, so we don't end up 
> with
> +  * vm_ops pointing to our implementation if the dma-buf backend
> +  * doesn't set those fields.
> +  */
>   vma->vm_private_data = NULL;
> + vma->vm_ops = NULL;
> +
>   ret = dma_buf_mmap(obj->dma_buf, vma, 0);
>  
>   /* Drop the reference drm_gem_mmap_obj() acquired.*/



Re: [PATCH 1/2] drm/exec: use unique instead of local label

2023-08-09 Thread Boris Brezillon
On Wed, 9 Aug 2023 08:37:55 -0700
Nathan Chancellor  wrote:

> Hi Christian,
> 
> Can this be applied to drm-misc? Other drivers are starting to make use
> of this API and our builds with clang-17 and clang-18 have been broken
> for some time due to this.

Queued to drm-misc-next.


Re: [PATCH v4] drm/panfrost: Sync IRQ by job's timeout handler

2023-08-09 Thread Boris Brezillon
On Mon,  7 Aug 2023 03:04:44 +0300
Dmitry Osipenko  wrote:

> Panfrost IRQ handler may stuck for a long time, for example this happens
> when there is a bad HDMI connection and HDMI handler takes a long time to
> finish processing, holding Panfrost. Make Panfrost's job timeout handler
> to sync IRQ before checking fence signal status in order to prevent
> spurious job timeouts due to a slow IRQ processing.
> 
> Reviewed-by: Steven Price 
> Reviewed-by: Boris Brezillon 
> Reviewed-by: AngeloGioacchino Del Regno 
> 
> Tested-by: AngeloGioacchino Del Regno 
>  # MediaTek MT8192 and MT8195 
> Chromebooks
> Signed-off-by: Dmitry Osipenko 

Queued to drm-misc-next.

Thanks,

Boris

> ---
> 
> Changelog:
> 
> v4: - Improved comment like was suggested by Boris and added his r-b.
> 
> v3: - Added comment to the code as was suggested by Boris
> 
> - Added r-b/t-b from Steven and Angelo
> 
> v2: - Moved synchronize_irq() after first signal-check to avoid unnecessary
>   blocking on syncing.
> 
> - Added warn message about high interrupt latency.
> 
>  drivers/gpu/drm/panfrost/panfrost_job.c | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
> b/drivers/gpu/drm/panfrost/panfrost_job.c
> index dbc597ab46fb..db6d9a17004f 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -720,6 +720,22 @@ static enum drm_gpu_sched_stat 
> panfrost_job_timedout(struct drm_sched_job
>   if (dma_fence_is_signaled(job->done_fence))
>   return DRM_GPU_SCHED_STAT_NOMINAL;
>  
> + /*
> +  * Panfrost IRQ handler may take a long time to process an interrupt
> +  * if there is another IRQ handler hogging the processing.
> +  * For example, the HDMI encoder driver might be stuck in the IRQ
> +  * handler for a significant time in a case of bad cable connection.
> +  * In order to catch such cases and not report spurious Panfrost
> +  * job timeouts, synchronize the IRQ handler and re-check the fence
> +  * status.
> +  */
> + synchronize_irq(pfdev->js->irq);
> +
> + if (dma_fence_is_signaled(job->done_fence)) {
> + dev_warn(pfdev->dev, "unexpectedly high interrupt latency\n");
> + return DRM_GPU_SCHED_STAT_NOMINAL;
> + }
> +
>   dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, 
> status=0x%x, head=0x%x, tail=0x%x, sched_job=%p",
>   js,
>   job_read(pfdev, JS_CONFIG(js)),



Re: [PATCH] drm: atmel-hlcdc: Support inverting the pixel clock polarity

2023-08-09 Thread Boris Brezillon
On Tue, 8 Aug 2023 08:33:38 +0200
Miquel Raynal  wrote:

> Hi Sam,
> 
> s...@ravnborg.org wrote on Mon, 7 Aug 2023 18:52:45 +0200:
> 
> > Hi Miquel,
> > 
> > On Mon, Aug 07, 2023 at 11:12:46AM +0200, Miquel Raynal wrote:  
> > > Hi Sam,
> > > 
> > > s...@ravnborg.org wrote on Sat, 10 Jun 2023 22:05:15 +0200:
> > > 
> > > > On Fri, Jun 09, 2023 at 04:48:43PM +0200, Miquel Raynal wrote:
> > > > > On the SoC host controller, the pixel clock can be:
> > > > > * standard: data is launched on the rising edge
> > > > > * inverted: data is launched on the falling edge
> > > > > 
> > > > > Some panels may need the inverted option to be used so let's support
> > > > > this DRM flag.
> > > > > 
> > > > > Signed-off-by: Miquel Raynal   
> > > > 
> > > > Hi Miquel,
> > > > 
> > > > the patch is:
> > > > Reviewed-by: Sam Ravnborg 
> > > > 
> > > > I hope someone else can pick it up and apply it to drm-misc as
> > > > my drm-misc setup is hopelessly outdated atm.
> > > 
> > > I haven't been noticed this patch was picked-up, is your tree still
> > > outdated or can you take care of it?
> > 
> > I am still hopelessly behind on stuff.  
> 
> No problem.

I queued it to drm-misc-next this morning.

Regards,

Boris


Re: [PATCH v2 00/15] drm: Add a driver for FW-based Mali GPUs

2023-08-10 Thread Boris Brezillon
Hello Rob,

On Wed, 9 Aug 2023 14:22:59 -0600
Rob Herring  wrote:

> On Wed, Aug 9, 2023 at 10:53 AM Boris Brezillon
>  wrote:
> >
> > I tried to Cc anyone that was involved in any development of the code
> > I picked from panfrost, so they can acknowledge the GPL2 -> MIT+GPL2
> > change. If I missed someone, please let me know.  
> 
> Panfrost was largely based on etnaviv, vc4, v3d, and msm. Those are
> all GPL2 (or 2+) only.

Uh, I must have missed some copyright headers then. Note that not all
panfrost files were taken as a base for panthor:

- Makefile/Kconfig. I honestly hope there's nothing copyright-able in
  there, given there's no other way to define your driver and
  compilation rules.
- panthor_device.{c,h} copied from panfrost_device.{c,h} with quite a
  few modifications in the process. This one has your copyright, and
  Marty's one.
- a tiny part of panthor_drv.c was copied from panfrost_drv.c, but let's
  be honest, the part that was copied (ioctl wrappers, mostly), can't
  really be done differently. This one has your copyright, Marty's one,
  and Collabora's one.
- panthor_regs.h copied from panfrost_regs.h. This one has your
  copyright, Marty's one and Arm's one (definitions extracted from
  kbase). But again, I'm not even sure register definitions are
  copyright-able, given there's no other way to define them. If that
  makes a difference, I changed the prefix, and dropped definition that
  do not exist on CSF HW.
- panthor_gpu.{c,h} copied from panfrost_gpu.{c,h}. These files have
  your copyright, Marty's one, and Collabora's one.
- panthor_{gem,mmu}.{c,h} copied from panfrost_{gem,mmu}.{c,h}. Those
  ones have your copyright only.
- panthor_devfreq.{c,h} copied from panfrost_devfreq.{c,h}. Collabora's
  copyright only.
- panthor_{heap,fw,sched}.{c,h}. Those are brand new files, that were
  written from scratch.

I also git-blamed the lines I copies to Cc any contributors to the
above files. I might have omitted someone, but I did my best to
try and spot people that have a word in this decision.

> How is relicensing that code okay?

Sorry, the copyright headers of the files I copied didn't mention that
:-/. If that's an omission, it would be good to have the headers updated
to reflect the actual chain of copyrights.

> Also,
> panfrost depends on drm_gem_shmem_helper.c (at least) which is GPL2.
> Does that get re-implemented in a MIT licensed environment?

Not only drm_gem_shmem, but drm_gpuva_mgr and drm_sched too. And yes,
any helper function/lib that's not GPL+MIT will have to be
re-implemented or replaced by something else.

> 
> Maybe some drivers are enough of a silo to get away with MIT
> licensing, but I wouldn't be comfortable claiming it.

Well, yes, re-using the code as-is is almost impossible, unless
someone rewrites the various GPL components we depend on. But if
someone wants to pick, say, the scheduling logic, and replace drm_sched
by something else, they can. Not saying it's worth it, just saying it's
possible.

Regards,

Boris



Re: [PATCH drm-next v6 02/13] drm: manager to keep track of GPUs VA mappings

2023-07-06 Thread Boris Brezillon
On Thu, 6 Jul 2023 17:06:08 +0200
Danilo Krummrich  wrote:

> Hi Boris,
> 
> On 6/30/23 10:02, Boris Brezillon wrote:
> > Hi Danilo,
> > 
> > On Fri, 30 Jun 2023 00:25:18 +0200
> > Danilo Krummrich  wrote:
> >   
> >> + *int driver_gpuva_remap(struct drm_gpuva_op *op, void *__ctx)
> >> + *{
> >> + *struct driver_context *ctx = __ctx;
> >> + *
> >> + *drm_gpuva_remap(ctx->prev_va, ctx->next_va, &op->remap);
> >> + *
> >> + *drm_gpuva_unlink(op->remap.unmap->va);
> >> + *kfree(op->remap.unmap->va);
> >> + *
> >> + *if (op->remap.prev) {
> >> + *drm_gpuva_link(ctx->prev_va);  
> > 
> > I ended up switching to dma_resv-based locking for the GEMs and I
> > wonder what the locking is supposed to look like in the async-mapping
> > case, where we insert/remove the VA nodes in the drm_sched::run_job()
> > path.  
> 
> If you decide to pick the interface where you just call 
> drm_gpuva_sm_[un]map() and receive a callback for each operation it 
> takes to fulfill the request, you probably do this because you want to 
> do everything one shot, updating the VA space, link/unlink GPUVAs 
> to/from its corresponding backing GEMs, do the actual GPU mappings.
> 
> This has a few advantages over generating a list of operations when the 
> job is submitted. You've pointed out one of them, when you noticed that 
> with a list of operations one can't sneak in a synchronous job between 
> already queued up asynchronous jobs.
> 
> However, for the asynchronous path it has the limitation that the 
> dma-resv lock can't be used to link/unlink GPUVAs to/from its 
> corresponding backing GEMs, since this would happen in the fence 
> signalling critical path and we're not allowed to hold the dma-resv lock 
> there. Hence, as we discussed I added the option for drivers to provide 
> an external lock for that, just to be able to keep some lockdep checks.

Uh, okay, I guess that means I need to go back to a custom lock for VM
operations then.

> 
> > 
> > What I have right now is something like:
> > 
> > dma_resv_lock(vm->resv);
> > 
> > // split done in drm_gpuva_sm_map(), each iteration
> > // of the loop is a call to the driver ->[re,un]map()
> > // hook
> > for_each_sub_op() {
> > 
> > // Private BOs have their resv field pointing to the
> > // VM resv and we take the VM resv lock before calling
> > // drm_gpuva_sm_map()
> > if (vm->resv != gem->resv)
> > dma_resv_lock(gem->resv);
> > 
> > drm_gpuva_[un]link(va);
> > gem_[un]pin(gem);
> > 
> > if (vm->resv != gem->resv)
> > dma_resv_unlock(gem->resv);
> > }
> > 
> > dma_resv_unlock(vm->resv);
> >   
> 
> I'm not sure I get this code right, reading "for_each_sub_op()" and 
> "drm_gpuva_sm_map()" looks a bit like things are mixed up?
> 
> Or do you mean to represent the sum of all callbacks with 
> "for_each_sub_op()"?

That ^.

> In this case I assume this code runs in 
> drm_sched::run_job() and hence isn't allowed to take the dma-resv lock.

Yeah, I didn't realize that taking the dma-resv lock in the
dma-signaling path was forbidden. I think it's fine for the drm_gpuva
destroy operation (which calls drm_gem_shmem_unpin(), which in turns
acquires the resv lock) because I can move that to a worker and get it
out of the dma-signaling path. The problem remains for remap operations
though. I need to call drm_gem_shmem_pin() so we retain the pages even
after the unmapped gpuva object that's in the middle of a mapping is
released. I guess one option would be to use an atomic_t for
drm_shmem_gem_object::pages_use_count, and
have something like:

int drm_gem_shmem_pin(struct drm_gem_shmem_object *shmem)
{
int ret;

if (atomic_inc_not_zero(&shmem->pages_use_count))
return 0;

dma_resv_lock(shmem->base.resv, NULL);
ret = drm_gem_shmem_pin_locked(shmem);
dma_resv_unlock(shmem->base.resv);

return ret;
}

Given the object already had its pages pinned when we remap, we're sure
the fast path will be taken, and no dma-resv lock aquired.

> 
> > In practice, I don't expect things to deadlock, because the VM resv is
> > not supposed to be taken outside the VM context and the locki

<    3   4   5   6   7   8   9   10   11   12   >