Re: [PATCH 2/2] drm: add tdr support for embeded hw_fence

2021-07-22 Thread Christian König

Am 22.07.21 um 18:47 schrieb Jingwen Chen:

On Thu Jul 22, 2021 at 06:24:28PM +0200, Christian König wrote:

Am 22.07.21 um 16:45 schrieb Andrey Grodzovsky:

On 2021-07-22 6:45 a.m., Jingwen Chen wrote:

On Wed Jul 21, 2021 at 12:53:51PM -0400, Andrey Grodzovsky wrote:

On 2021-07-20 11:13 p.m., Jingwen Chen wrote:

[Why]
After embeded hw_fence to amdgpu_job, we need to add tdr support
for this feature.

[How]
1. Add a resubmit_flag for resubmit jobs.
2. Clear job fence from RCU and force complete vm flush fences in
  pre_asic_reset
3. skip dma_fence_get for resubmit jobs and add a dma_fence_put
  for guilty jobs.

Signed-off-by: Jack Zhang 
Signed-off-by: Jingwen Chen 
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++-
    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 16 +++-
    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 +++-
    drivers/gpu/drm/scheduler/sched_main.c |  1 +
    include/drm/gpu_scheduler.h    |  1 +
    5 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 40461547701a..fe0237f72a09 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4382,7 +4382,7 @@ int amdgpu_device_mode1_reset(struct
amdgpu_device *adev)
    int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
     struct amdgpu_reset_context *reset_context)
    {
-    int i, r = 0;
+    int i, j, r = 0;
    struct amdgpu_job *job = NULL;
    bool need_full_reset =
    test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
@@ -4406,6 +4406,16 @@ int
amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
    if (!ring || !ring->sched.thread)
    continue;
+    /*clear job fence from fence drv to avoid force_completion
+ *leave NULL and vm flush fence in fence drv */
+    for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) {
+    struct dma_fence *old,**ptr;
+    ptr = &ring->fence_drv.fences[j];
+    old = rcu_dereference_protected(*ptr, 1);
+    if (old && test_bit(DMA_FENCE_FLAG_USER_BITS,
&old->flags))) {
+    RCU_INIT_POINTER(*ptr, NULL);
+    }

Is this to avoid premature job free because of dma_fence_put inside
amdgpu_fence_process ?
I can't currently remember why but we probably want all the HW fences
currently in the ring to
be forced signaled - maybe better to test for DMA_FENCE_FLAG_USER_BITS
inside amdgpu_fence_process
and still do the signaling but not the dma_fence_put part

Andrey

Hi Andrey,

This is to avoid signaling the same fence twice. If we still do the
signaling, then the job in the pending list will be signaled first in
force_completion, and later be signaled in resubmit. This will go to
BUG() in amdgpu_fence_process.


Oh, i see, how about just adding 'skip' flag to amdgpu_ring and setting
it before calling
amdgpu_fence_driver_force_completion and resetting it after, then inside
amdgpu_fence_driver_force_completion
you can just skip the signaling part with this flag for fences with
DMA_FENCE_FLAG_USER_BITS set
Less lines of code at least.

Still sounds quite a bit hacky.

I would rather suggest to completely drop the approach with
amdgpu_fence_driver_force_completion(). I could never see why we would want
that in the first place.

Regards,
Christian.


Hi Christian,

I keep the amdgpu_fence_driver_force_completion here to make sure the vm
flush fence is signaled and put.
So the key question is whether the fence of ib test should be the first
fence to be signaled after reset.
If it should be, it means not only fences with DMA_FENCE_FLAG_USER_BITS
but also vm flush fences should be removed from RCU fence array before
ib_test. Then we must do the force_completion here for vm flush
fence, otherwise there will be a memory leak here as no one will signal
and put it after we remove it from RCU.
If not, then the first fence to signal could be vm flush fence. And I'm
OK with drop amdgpu_fence_driver_force_completion here.


The key problem is that amdgpu_fence_driver_force_completion() goes over 
the RCU array and signals everything there.


What we should do instead is to go over the jobs and cleanup the ones we 
don't want to re-submit and keep the rest.


And please drop any usage of DMA_FENCE_FLAG_USER_BITS. That is not 
something which should be abused for reset handling.


Regards,
Christian.




Best Regards,
JingWen Chen

Andrey



+    }
    /* after all hw jobs are reset, hw fence is
meaningless, so force_completion */
    amdgpu_fence_driver_force_completion(ring);
    }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index eecf21d8ec33..815776c9a013 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -156,11 +156,17 @@ int amdgpu_fe

Re: [PATCH 2/2] drm: add tdr support for embeded hw_fence

2021-07-22 Thread Jingwen Chen
On Thu Jul 22, 2021 at 01:50:09PM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-07-22 1:27 p.m., Jingwen Chen wrote:
> > On Thu Jul 22, 2021 at 01:17:13PM -0400, Andrey Grodzovsky wrote:
> > > On 2021-07-22 12:47 p.m., Jingwen Chen wrote:
> > > > On Thu Jul 22, 2021 at 06:24:28PM +0200, Christian König wrote:
> > > > > Am 22.07.21 um 16:45 schrieb Andrey Grodzovsky:
> > > > > > On 2021-07-22 6:45 a.m., Jingwen Chen wrote:
> > > > > > > On Wed Jul 21, 2021 at 12:53:51PM -0400, Andrey Grodzovsky wrote:
> > > > > > > > On 2021-07-20 11:13 p.m., Jingwen Chen wrote:
> > > > > > > > > [Why]
> > > > > > > > > After embeded hw_fence to amdgpu_job, we need to add tdr 
> > > > > > > > > support
> > > > > > > > > for this feature.
> > > > > > > > > 
> > > > > > > > > [How]
> > > > > > > > > 1. Add a resubmit_flag for resubmit jobs.
> > > > > > > > > 2. Clear job fence from RCU and force complete vm flush 
> > > > > > > > > fences in
> > > > > > > > >    pre_asic_reset
> > > > > > > > > 3. skip dma_fence_get for resubmit jobs and add a 
> > > > > > > > > dma_fence_put
> > > > > > > > >    for guilty jobs.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Jack Zhang 
> > > > > > > > > Signed-off-by: Jingwen Chen 
> > > > > > > > > ---
> > > > > > > > >      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 
> > > > > > > > > +++-
> > > > > > > > >      drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 16 
> > > > > > > > > +++-
> > > > > > > > >      drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 +++-
> > > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c |  1 +
> > > > > > > > >      include/drm/gpu_scheduler.h    |  1 +
> > > > > > > > >      5 files changed, 27 insertions(+), 7 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > index 40461547701a..fe0237f72a09 100644
> > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > > > @@ -4382,7 +4382,7 @@ int amdgpu_device_mode1_reset(struct
> > > > > > > > > amdgpu_device *adev)
> > > > > > > > >      int amdgpu_device_pre_asic_reset(struct amdgpu_device 
> > > > > > > > > *adev,
> > > > > > > > >       struct amdgpu_reset_context 
> > > > > > > > > *reset_context)
> > > > > > > > >      {
> > > > > > > > > -    int i, r = 0;
> > > > > > > > > +    int i, j, r = 0;
> > > > > > > > >      struct amdgpu_job *job = NULL;
> > > > > > > > >      bool need_full_reset =
> > > > > > > > >      test_bit(AMDGPU_NEED_FULL_RESET, 
> > > > > > > > > &reset_context->flags);
> > > > > > > > > @@ -4406,6 +4406,16 @@ int
> > > > > > > > > amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> > > > > > > > >      if (!ring || !ring->sched.thread)
> > > > > > > > >      continue;
> > > > > > > > > +    /*clear job fence from fence drv to avoid 
> > > > > > > > > force_completion
> > > > > > > > > + *leave NULL and vm flush fence in fence drv */
> > > > > > > > > +    for (j = 0; j <= ring->fence_drv.num_fences_mask; j 
> > > > > > > > > ++) {
> > > > > > > > > +    struct dma_fence *old,**ptr;
> > > > > > > > > +    ptr = &ring->fence_drv.fences[j];
> > > > > > > > > +    old = rcu_dereference_protected(*ptr, 1);
> > > > > > > > > +    if (old && test_bit(DMA_FENCE_FLAG_USER_BITS,
> > > > > > > > > &old->flags))) {
> > > > > > > > > +    RCU_INIT_POINTER(*ptr, NULL);
> > > > > > > > > +    }
> > > > > > > > Is this to avoid premature job free because of dma_fence_put 
> > > > > > > > inside
> > > > > > > > amdgpu_fence_process ?
> > > > > > > > I can't currently remember why but we probably want all the HW 
> > > > > > > > fences
> > > > > > > > currently in the ring to
> > > > > > > > be forced signaled - maybe better to test for 
> > > > > > > > DMA_FENCE_FLAG_USER_BITS
> > > > > > > > inside amdgpu_fence_process
> > > > > > > > and still do the signaling but not the dma_fence_put part
> > > > > > > > 
> > > > > > > > Andrey
> > > > > > > Hi Andrey,
> > > > > > > 
> > > > > > > This is to avoid signaling the same fence twice. If we still do 
> > > > > > > the
> > > > > > > signaling, then the job in the pending list will be signaled 
> > > > > > > first in
> > > > > > > force_completion, and later be signaled in resubmit. This will go 
> > > > > > > to
> > > > > > > BUG() in amdgpu_fence_process.
> > > > > > Oh, i see, how about just adding 'skip' flag to amdgpu_ring and 
> > > > > > setting
> > > > > > it before calling
> > > > > > amdgpu_fence_driver_force_completion and resetting it after, then 
> > > > > > inside
> > > > > > amdgpu_fence_driver_force_completion
> > > > > > you can just skip the signaling part with this flag for fences with
> > > > > > DMA_FENCE_FLAG_USER_BITS set
> > > > > > Le

Re: [PATCH v4 10/13] lib: test_hmm add module param for zone device type

2021-07-22 Thread Jason Gunthorpe
On Thu, Jul 22, 2021 at 11:59:17AM -0500, Sierra Guiza, Alejandro (Alex) wrote:
> 
> On 7/22/2021 7:23 AM, Jason Gunthorpe wrote:
> > On Sat, Jul 17, 2021 at 02:21:32PM -0500, Alex Sierra wrote:
> > > In order to configure device generic in test_hmm, two
> > > module parameters should be passed, which correspon to the
> > > SP start address of each device (2) spm_addr_dev0 &
> > > spm_addr_dev1. If no parameters are passed, private device
> > > type is configured.
> > I don't think tests should need configuration like this, is it really
> > necessary? How can people with normal HW run this test?
> Hi Jason,
> The idea was to add an easy way to validate the codepaths touched by this
> patch series, which make modifications to the migration helpers for device
> generic type pages. We're using CONFIG_EFI_FAKE_MEMMAP to create fake SPM
> devices inside system memory. No special HW needed. And passing the kernel
> parameter efi_fake_mem. Ex. efi_fake_mem=1G@0x1:0x4. I should
> probably need to include a small example of how to set this in the
> test_hmm.sh
> usage().

I don't think anything about hmm is sensitive to how the pages are
acquired - you can't create device generic pages without relying on
FAKE_MEMMAP?

Jason
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm: add tdr support for embeded hw_fence

2021-07-22 Thread Andrey Grodzovsky


On 2021-07-22 1:27 p.m., Jingwen Chen wrote:

On Thu Jul 22, 2021 at 01:17:13PM -0400, Andrey Grodzovsky wrote:

On 2021-07-22 12:47 p.m., Jingwen Chen wrote:

On Thu Jul 22, 2021 at 06:24:28PM +0200, Christian König wrote:

Am 22.07.21 um 16:45 schrieb Andrey Grodzovsky:

On 2021-07-22 6:45 a.m., Jingwen Chen wrote:

On Wed Jul 21, 2021 at 12:53:51PM -0400, Andrey Grodzovsky wrote:

On 2021-07-20 11:13 p.m., Jingwen Chen wrote:

[Why]
After embeded hw_fence to amdgpu_job, we need to add tdr support
for this feature.

[How]
1. Add a resubmit_flag for resubmit jobs.
2. Clear job fence from RCU and force complete vm flush fences in
   pre_asic_reset
3. skip dma_fence_get for resubmit jobs and add a dma_fence_put
   for guilty jobs.

Signed-off-by: Jack Zhang 
Signed-off-by: Jingwen Chen 
---
     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++-
     drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 16 +++-
     drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 +++-
     drivers/gpu/drm/scheduler/sched_main.c |  1 +
     include/drm/gpu_scheduler.h    |  1 +
     5 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 40461547701a..fe0237f72a09 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4382,7 +4382,7 @@ int amdgpu_device_mode1_reset(struct
amdgpu_device *adev)
     int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
      struct amdgpu_reset_context *reset_context)
     {
-    int i, r = 0;
+    int i, j, r = 0;
     struct amdgpu_job *job = NULL;
     bool need_full_reset =
     test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
@@ -4406,6 +4406,16 @@ int
amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
     if (!ring || !ring->sched.thread)
     continue;
+    /*clear job fence from fence drv to avoid force_completion
+ *leave NULL and vm flush fence in fence drv */
+    for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) {
+    struct dma_fence *old,**ptr;
+    ptr = &ring->fence_drv.fences[j];
+    old = rcu_dereference_protected(*ptr, 1);
+    if (old && test_bit(DMA_FENCE_FLAG_USER_BITS,
&old->flags))) {
+    RCU_INIT_POINTER(*ptr, NULL);
+    }

Is this to avoid premature job free because of dma_fence_put inside
amdgpu_fence_process ?
I can't currently remember why but we probably want all the HW fences
currently in the ring to
be forced signaled - maybe better to test for DMA_FENCE_FLAG_USER_BITS
inside amdgpu_fence_process
and still do the signaling but not the dma_fence_put part

Andrey

Hi Andrey,

This is to avoid signaling the same fence twice. If we still do the
signaling, then the job in the pending list will be signaled first in
force_completion, and later be signaled in resubmit. This will go to
BUG() in amdgpu_fence_process.

Oh, i see, how about just adding 'skip' flag to amdgpu_ring and setting
it before calling
amdgpu_fence_driver_force_completion and resetting it after, then inside
amdgpu_fence_driver_force_completion
you can just skip the signaling part with this flag for fences with
DMA_FENCE_FLAG_USER_BITS set
Less lines of code at least.

Still sounds quite a bit hacky.

I would rather suggest to completely drop the approach with
amdgpu_fence_driver_force_completion(). I could never see why we would want
that in the first place.

Regards,
Christian.


Hi Christian,

I keep the amdgpu_fence_driver_force_completion here to make sure the vm
flush fence is signaled and put.


Right, so we need to do force completion for the sake of all the fences
without parent jobs since there is code which wait directly on them.



So the key question is whether the fence of ib test should be the first
fence to be signaled after reset.


What do you mean by 'after reset' - at this point in the code there was
no ASIC reset yet, we just stopped the schedulers and detached the
HW fences from their parent jobs (sched_fence)

I mean the ASIC reset. There will be a ib_test for each ring after ASIC
reset.



Then why wouldn't they be the first ? They will emit new fences into the 
ring
which will be signaled right away because the ASIC went through reset 
and is not
hung anymore.  All the previous fences, including VM flush once from 
before the
reset will be already signaled by this time from 
amdgpu_fence_driver_force_completion.






If it should be, it means not only fences with DMA_FENCE_FLAG_USER_BITS
but also vm flush fences should be removed from RCU fence array before
ib_test.



As I mentioned above, not clear to me why VM fences should get special 
treatment here.


Andrey




Which ib_test is it, the one after ASIC reset ?


Yes


Best Regards,
JingWen Chen

Andrey


   Then we must do the force_completion here for vm flush
fence, otherwise the

Re: [PATCH 2/2] drm: add tdr support for embeded hw_fence

2021-07-22 Thread Jingwen Chen
On Thu Jul 22, 2021 at 01:17:13PM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-07-22 12:47 p.m., Jingwen Chen wrote:
> > On Thu Jul 22, 2021 at 06:24:28PM +0200, Christian König wrote:
> > > Am 22.07.21 um 16:45 schrieb Andrey Grodzovsky:
> > > > On 2021-07-22 6:45 a.m., Jingwen Chen wrote:
> > > > > On Wed Jul 21, 2021 at 12:53:51PM -0400, Andrey Grodzovsky wrote:
> > > > > > On 2021-07-20 11:13 p.m., Jingwen Chen wrote:
> > > > > > > [Why]
> > > > > > > After embeded hw_fence to amdgpu_job, we need to add tdr support
> > > > > > > for this feature.
> > > > > > > 
> > > > > > > [How]
> > > > > > > 1. Add a resubmit_flag for resubmit jobs.
> > > > > > > 2. Clear job fence from RCU and force complete vm flush fences in
> > > > > > >   pre_asic_reset
> > > > > > > 3. skip dma_fence_get for resubmit jobs and add a dma_fence_put
> > > > > > >   for guilty jobs.
> > > > > > > 
> > > > > > > Signed-off-by: Jack Zhang 
> > > > > > > Signed-off-by: Jingwen Chen 
> > > > > > > ---
> > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++-
> > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 16 
> > > > > > > +++-
> > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 +++-
> > > > > > >     drivers/gpu/drm/scheduler/sched_main.c |  1 +
> > > > > > >     include/drm/gpu_scheduler.h    |  1 +
> > > > > > >     5 files changed, 27 insertions(+), 7 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > index 40461547701a..fe0237f72a09 100644
> > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > > > @@ -4382,7 +4382,7 @@ int amdgpu_device_mode1_reset(struct
> > > > > > > amdgpu_device *adev)
> > > > > > >     int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> > > > > > >      struct amdgpu_reset_context *reset_context)
> > > > > > >     {
> > > > > > > -    int i, r = 0;
> > > > > > > +    int i, j, r = 0;
> > > > > > >     struct amdgpu_job *job = NULL;
> > > > > > >     bool need_full_reset =
> > > > > > >     test_bit(AMDGPU_NEED_FULL_RESET, 
> > > > > > > &reset_context->flags);
> > > > > > > @@ -4406,6 +4406,16 @@ int
> > > > > > > amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> > > > > > >     if (!ring || !ring->sched.thread)
> > > > > > >     continue;
> > > > > > > +    /*clear job fence from fence drv to avoid 
> > > > > > > force_completion
> > > > > > > + *leave NULL and vm flush fence in fence drv */
> > > > > > > +    for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) {
> > > > > > > +    struct dma_fence *old,**ptr;
> > > > > > > +    ptr = &ring->fence_drv.fences[j];
> > > > > > > +    old = rcu_dereference_protected(*ptr, 1);
> > > > > > > +    if (old && test_bit(DMA_FENCE_FLAG_USER_BITS,
> > > > > > > &old->flags))) {
> > > > > > > +    RCU_INIT_POINTER(*ptr, NULL);
> > > > > > > +    }
> > > > > > Is this to avoid premature job free because of dma_fence_put inside
> > > > > > amdgpu_fence_process ?
> > > > > > I can't currently remember why but we probably want all the HW 
> > > > > > fences
> > > > > > currently in the ring to
> > > > > > be forced signaled - maybe better to test for 
> > > > > > DMA_FENCE_FLAG_USER_BITS
> > > > > > inside amdgpu_fence_process
> > > > > > and still do the signaling but not the dma_fence_put part
> > > > > > 
> > > > > > Andrey
> > > > > Hi Andrey,
> > > > > 
> > > > > This is to avoid signaling the same fence twice. If we still do the
> > > > > signaling, then the job in the pending list will be signaled first in
> > > > > force_completion, and later be signaled in resubmit. This will go to
> > > > > BUG() in amdgpu_fence_process.
> > > > 
> > > > Oh, i see, how about just adding 'skip' flag to amdgpu_ring and setting
> > > > it before calling
> > > > amdgpu_fence_driver_force_completion and resetting it after, then inside
> > > > amdgpu_fence_driver_force_completion
> > > > you can just skip the signaling part with this flag for fences with
> > > > DMA_FENCE_FLAG_USER_BITS set
> > > > Less lines of code at least.
> > > Still sounds quite a bit hacky.
> > > 
> > > I would rather suggest to completely drop the approach with
> > > amdgpu_fence_driver_force_completion(). I could never see why we would 
> > > want
> > > that in the first place.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > Hi Christian,
> > 
> > I keep the amdgpu_fence_driver_force_completion here to make sure the vm
> > flush fence is signaled and put.
> 
> 
> Right, so we need to do force completion for the sake of all the fences
> without parent jobs since there is code which wait directly on them.
> 
> 
> > So the key question is whether the fence of ib test should be the first
> >

Re: [PATCH 2/2] drm: add tdr support for embeded hw_fence

2021-07-22 Thread Andrey Grodzovsky


On 2021-07-22 12:47 p.m., Jingwen Chen wrote:

On Thu Jul 22, 2021 at 06:24:28PM +0200, Christian König wrote:

Am 22.07.21 um 16:45 schrieb Andrey Grodzovsky:

On 2021-07-22 6:45 a.m., Jingwen Chen wrote:

On Wed Jul 21, 2021 at 12:53:51PM -0400, Andrey Grodzovsky wrote:

On 2021-07-20 11:13 p.m., Jingwen Chen wrote:

[Why]
After embeded hw_fence to amdgpu_job, we need to add tdr support
for this feature.

[How]
1. Add a resubmit_flag for resubmit jobs.
2. Clear job fence from RCU and force complete vm flush fences in
  pre_asic_reset
3. skip dma_fence_get for resubmit jobs and add a dma_fence_put
  for guilty jobs.

Signed-off-by: Jack Zhang 
Signed-off-by: Jingwen Chen 
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++-
    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 16 +++-
    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 +++-
    drivers/gpu/drm/scheduler/sched_main.c |  1 +
    include/drm/gpu_scheduler.h    |  1 +
    5 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 40461547701a..fe0237f72a09 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4382,7 +4382,7 @@ int amdgpu_device_mode1_reset(struct
amdgpu_device *adev)
    int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
     struct amdgpu_reset_context *reset_context)
    {
-    int i, r = 0;
+    int i, j, r = 0;
    struct amdgpu_job *job = NULL;
    bool need_full_reset =
    test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
@@ -4406,6 +4406,16 @@ int
amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
    if (!ring || !ring->sched.thread)
    continue;
+    /*clear job fence from fence drv to avoid force_completion
+ *leave NULL and vm flush fence in fence drv */
+    for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) {
+    struct dma_fence *old,**ptr;
+    ptr = &ring->fence_drv.fences[j];
+    old = rcu_dereference_protected(*ptr, 1);
+    if (old && test_bit(DMA_FENCE_FLAG_USER_BITS,
&old->flags))) {
+    RCU_INIT_POINTER(*ptr, NULL);
+    }

Is this to avoid premature job free because of dma_fence_put inside
amdgpu_fence_process ?
I can't currently remember why but we probably want all the HW fences
currently in the ring to
be forced signaled - maybe better to test for DMA_FENCE_FLAG_USER_BITS
inside amdgpu_fence_process
and still do the signaling but not the dma_fence_put part

Andrey

Hi Andrey,

This is to avoid signaling the same fence twice. If we still do the
signaling, then the job in the pending list will be signaled first in
force_completion, and later be signaled in resubmit. This will go to
BUG() in amdgpu_fence_process.


Oh, i see, how about just adding 'skip' flag to amdgpu_ring and setting
it before calling
amdgpu_fence_driver_force_completion and resetting it after, then inside
amdgpu_fence_driver_force_completion
you can just skip the signaling part with this flag for fences with
DMA_FENCE_FLAG_USER_BITS set
Less lines of code at least.

Still sounds quite a bit hacky.

I would rather suggest to completely drop the approach with
amdgpu_fence_driver_force_completion(). I could never see why we would want
that in the first place.

Regards,
Christian.


Hi Christian,

I keep the amdgpu_fence_driver_force_completion here to make sure the vm
flush fence is signaled and put.



Right, so we need to do force completion for the sake of all the fences
without parent jobs since there is code which wait directly on them.



So the key question is whether the fence of ib test should be the first
fence to be signaled after reset.



What do you mean by 'after reset' - at this point in the code there was
no ASIC reset yet, we just stopped the schedulers and detached the
HW fences from their parent jobs (sched_fence)



If it should be, it means not only fences with DMA_FENCE_FLAG_USER_BITS
but also vm flush fences should be removed from RCU fence array before
ib_test.



Which ib_test is it, the one after ASIC reset ?

Andrey


  Then we must do the force_completion here for vm flush
fence, otherwise there will be a memory leak here as no one will signal
and put it after we remove it from RCU.
If not, then the first fence to signal could be vm flush fence. And I'm
OK with drop amdgpu_fence_driver_force_completion here.


Best Regards,
JingWen Chen

Andrey



+    }
    /* after all hw jobs are reset, hw fence is
meaningless, so force_completion */
    amdgpu_fence_driver_force_completion(ring);
    }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index eecf21d8ec33..815776c9a013 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -156,11 +156

Re: [PATCH] drm/amdgpu: Add msix restore for pass-through mode

2021-07-22 Thread Deucher, Alexander
[Public]

Acked-by: Alex Deucher 

From: Chengzhe Liu 
Sent: Thursday, July 22, 2021 1:49 AM
To: amd-gfx@lists.freedesktop.org 
Cc: Tuikov, Luben ; Koenig, Christian 
; Deucher, Alexander ; 
Xiao, Jack ; Zhang, Hawking ; Xu, 
Feifei ; Wang, Kevin(Yang) ; Liu, Cheng 
Zhe 
Subject: [PATCH] drm/amdgpu: Add msix restore for pass-through mode

In pass-through mode, after mode 1 reset, msix enablement status would
lost and never receives interrupt again. So, we should restore msix
status after mode 1 reset.

Signed-off-by: Chengzhe Liu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 83af307e97cd..e1aa4a5e6a98 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -584,7 +584,7 @@ void amdgpu_irq_gpu_reset_resume_helper(struct 
amdgpu_device *adev)
 {
 int i, j, k;

-   if (amdgpu_sriov_vf(adev))
+   if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
 amdgpu_restore_msix(adev);

 for (i = 0; i < AMDGPU_IRQ_CLIENTID_MAX; ++i) {
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Check pmops for desired suspend state

2021-07-22 Thread Deucher, Alexander
[AMD Official Use Only]

I sent a similar patch out a while ago, but never had a chance to follow up on 
it.  The problem is users might change the default.
https://www.spinics.net/lists/amd-gfx/msg60578.html


Alex


From: Vishwakarma, Pratik 
Sent: Thursday, July 22, 2021 1:27 AM
To: Deucher, Alexander ; 
amd-gfx@lists.freedesktop.org 
Cc: Vishwakarma, Pratik 
Subject: [PATCH] drm/amdgpu: Check pmops for desired suspend state

[Why]
User might set mem_sleep as deep and it will result
in amdgpu resume errors.

[How]
Check with pm for default suspend state

Signed-off-by: Pratik Vishwakarma 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index af1710971ff3..d92196429741 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1468,7 +1468,8 @@ static int amdgpu_pmops_suspend(struct device *dev)
 struct amdgpu_device *adev = drm_to_adev(drm_dev);
 int r;

-   if (amdgpu_acpi_is_s0ix_supported(adev))
+   if (amdgpu_acpi_is_s0ix_supported(adev)
+   && pm_suspend_default_s2idle())
 adev->in_s0ix = true;
 adev->in_s3 = true;
 r = amdgpu_device_suspend(drm_dev, true);
--
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 10/13] lib: test_hmm add module param for zone device type

2021-07-22 Thread Sierra Guiza, Alejandro (Alex)



On 7/22/2021 7:23 AM, Jason Gunthorpe wrote:

On Sat, Jul 17, 2021 at 02:21:32PM -0500, Alex Sierra wrote:

In order to configure device generic in test_hmm, two
module parameters should be passed, which correspon to the
SP start address of each device (2) spm_addr_dev0 &
spm_addr_dev1. If no parameters are passed, private device
type is configured.

I don't think tests should need configuration like this, is it really
necessary? How can people with normal HW run this test?

Hi Jason,
The idea was to add an easy way to validate the codepaths touched by this
patch series, which make modifications to the migration helpers for device
generic type pages. We're using CONFIG_EFI_FAKE_MEMMAP to create fake SPM
devices inside system memory. No special HW needed. And passing the kernel
parameter efi_fake_mem. Ex. efi_fake_mem=1G@0x1:0x4. I should
probably need to include a small example of how to set this in the 
test_hmm.sh

usage().

Alex S.


Jason

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm: add tdr support for embeded hw_fence

2021-07-22 Thread Jingwen Chen
On Thu Jul 22, 2021 at 06:24:28PM +0200, Christian König wrote:
> Am 22.07.21 um 16:45 schrieb Andrey Grodzovsky:
> > 
> > On 2021-07-22 6:45 a.m., Jingwen Chen wrote:
> > > On Wed Jul 21, 2021 at 12:53:51PM -0400, Andrey Grodzovsky wrote:
> > > > On 2021-07-20 11:13 p.m., Jingwen Chen wrote:
> > > > > [Why]
> > > > > After embeded hw_fence to amdgpu_job, we need to add tdr support
> > > > > for this feature.
> > > > > 
> > > > > [How]
> > > > > 1. Add a resubmit_flag for resubmit jobs.
> > > > > 2. Clear job fence from RCU and force complete vm flush fences in
> > > > >  pre_asic_reset
> > > > > 3. skip dma_fence_get for resubmit jobs and add a dma_fence_put
> > > > >  for guilty jobs.
> > > > > 
> > > > > Signed-off-by: Jack Zhang 
> > > > > Signed-off-by: Jingwen Chen 
> > > > > ---
> > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++-
> > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 16 +++-
> > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 +++-
> > > > >    drivers/gpu/drm/scheduler/sched_main.c |  1 +
> > > > >    include/drm/gpu_scheduler.h    |  1 +
> > > > >    5 files changed, 27 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > index 40461547701a..fe0237f72a09 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > > @@ -4382,7 +4382,7 @@ int amdgpu_device_mode1_reset(struct
> > > > > amdgpu_device *adev)
> > > > >    int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> > > > >     struct amdgpu_reset_context *reset_context)
> > > > >    {
> > > > > -    int i, r = 0;
> > > > > +    int i, j, r = 0;
> > > > >    struct amdgpu_job *job = NULL;
> > > > >    bool need_full_reset =
> > > > >    test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
> > > > > @@ -4406,6 +4406,16 @@ int
> > > > > amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> > > > >    if (!ring || !ring->sched.thread)
> > > > >    continue;
> > > > > +    /*clear job fence from fence drv to avoid force_completion
> > > > > + *leave NULL and vm flush fence in fence drv */
> > > > > +    for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) {
> > > > > +    struct dma_fence *old,**ptr;
> > > > > +    ptr = &ring->fence_drv.fences[j];
> > > > > +    old = rcu_dereference_protected(*ptr, 1);
> > > > > +    if (old && test_bit(DMA_FENCE_FLAG_USER_BITS,
> > > > > &old->flags))) {
> > > > > +    RCU_INIT_POINTER(*ptr, NULL);
> > > > > +    }
> > > > 
> > > > Is this to avoid premature job free because of dma_fence_put inside
> > > > amdgpu_fence_process ?
> > > > I can't currently remember why but we probably want all the HW fences
> > > > currently in the ring to
> > > > be forced signaled - maybe better to test for DMA_FENCE_FLAG_USER_BITS
> > > > inside amdgpu_fence_process
> > > > and still do the signaling but not the dma_fence_put part
> > > > 
> > > > Andrey
> > > Hi Andrey,
> > > 
> > > This is to avoid signaling the same fence twice. If we still do the
> > > signaling, then the job in the pending list will be signaled first in
> > > force_completion, and later be signaled in resubmit. This will go to
> > > BUG() in amdgpu_fence_process.
> > 
> > 
> > Oh, i see, how about just adding 'skip' flag to amdgpu_ring and setting
> > it before calling
> > amdgpu_fence_driver_force_completion and resetting it after, then inside
> > amdgpu_fence_driver_force_completion
> > you can just skip the signaling part with this flag for fences with
> > DMA_FENCE_FLAG_USER_BITS set
> > Less lines of code at least.
> 
> Still sounds quite a bit hacky.
> 
> I would rather suggest to completely drop the approach with
> amdgpu_fence_driver_force_completion(). I could never see why we would want
> that in the first place.
> 
> Regards,
> Christian.
> 
Hi Christian,

I keep the amdgpu_fence_driver_force_completion here to make sure the vm
flush fence is signaled and put. 
So the key question is whether the fence of ib test should be the first
fence to be signaled after reset.
If it should be, it means not only fences with DMA_FENCE_FLAG_USER_BITS
but also vm flush fences should be removed from RCU fence array before
ib_test. Then we must do the force_completion here for vm flush
fence, otherwise there will be a memory leak here as no one will signal
and put it after we remove it from RCU.
If not, then the first fence to signal could be vm flush fence. And I'm
OK with drop amdgpu_fence_driver_force_completion here.


Best Regards,
JingWen Chen
> > 
> > Andrey
> > 
> > 
> > > 
> > > > > +    }
> > > > >    /* after all hw jobs are reset, hw fence is
> > > > > meaningless, so force_completion */
> > > > >    amdgpu_fence_driver_

Re: [PATCH 2/2] drm: add tdr support for embeded hw_fence

2021-07-22 Thread Jingwen Chen
On Thu Jul 22, 2021 at 10:45:40AM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-07-22 6:45 a.m., Jingwen Chen wrote:
> > On Wed Jul 21, 2021 at 12:53:51PM -0400, Andrey Grodzovsky wrote:
> > > On 2021-07-20 11:13 p.m., Jingwen Chen wrote:
> > > > [Why]
> > > > After embeded hw_fence to amdgpu_job, we need to add tdr support
> > > > for this feature.
> > > > 
> > > > [How]
> > > > 1. Add a resubmit_flag for resubmit jobs.
> > > > 2. Clear job fence from RCU and force complete vm flush fences in
> > > >  pre_asic_reset
> > > > 3. skip dma_fence_get for resubmit jobs and add a dma_fence_put
> > > >  for guilty jobs.
> > > > 
> > > > Signed-off-by: Jack Zhang 
> > > > Signed-off-by: Jingwen Chen 
> > > > ---
> > > >drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++-
> > > >drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 16 +++-
> > > >drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  4 +++-
> > > >drivers/gpu/drm/scheduler/sched_main.c |  1 +
> > > >include/drm/gpu_scheduler.h|  1 +
> > > >5 files changed, 27 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > index 40461547701a..fe0237f72a09 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > > @@ -4382,7 +4382,7 @@ int amdgpu_device_mode1_reset(struct 
> > > > amdgpu_device *adev)
> > > >int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> > > >  struct amdgpu_reset_context 
> > > > *reset_context)
> > > >{
> > > > -   int i, r = 0;
> > > > +   int i, j, r = 0;
> > > > struct amdgpu_job *job = NULL;
> > > > bool need_full_reset =
> > > > test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
> > > > @@ -4406,6 +4406,16 @@ int amdgpu_device_pre_asic_reset(struct 
> > > > amdgpu_device *adev,
> > > > if (!ring || !ring->sched.thread)
> > > > continue;
> > > > +   /*clear job fence from fence drv to avoid 
> > > > force_completion
> > > > +*leave NULL and vm flush fence in fence drv */
> > > > +   for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) 
> > > > {
> > > > +   struct dma_fence *old,**ptr;
> > > > +   ptr = &ring->fence_drv.fences[j];
> > > > +   old = rcu_dereference_protected(*ptr, 1);
> > > > +   if (old && test_bit(DMA_FENCE_FLAG_USER_BITS, 
> > > > &old->flags))) {
> > > > +   RCU_INIT_POINTER(*ptr, NULL);
> > > > +   }
> > > 
> > > Is this to avoid premature job free because of dma_fence_put inside
> > > amdgpu_fence_process ?
> > > I can't currently remember why but we probably want all the HW fences
> > > currently in the ring to
> > > be forced signaled - maybe better to test for DMA_FENCE_FLAG_USER_BITS
> > > inside amdgpu_fence_process
> > > and still do the signaling but not the dma_fence_put part
> > > 
> > > Andrey
> > Hi Andrey,
> > 
> > This is to avoid signaling the same fence twice. If we still do the
> > signaling, then the job in the pending list will be signaled first in
> > force_completion, and later be signaled in resubmit. This will go to
> > BUG() in amdgpu_fence_process.
> 
> 
> Oh, i see, how about just adding 'skip' flag to amdgpu_ring and setting it
> before calling
> amdgpu_fence_driver_force_completion and resetting it after, then inside
> amdgpu_fence_driver_force_completion
> you can just skip the signaling part with this flag for fences with
> DMA_FENCE_FLAG_USER_BITS set
> Less lines of code at least.
> 
> Andrey
Hi Andrey,

In this way, this issue still exists. If we just skip it in the
force_completion, these fences are still in the RCU fence array. So when
the FLR finishes, the eop interrupt function of ib_test will call
amdgpu_fence_process, the skipped fences will be signaled along with
ib_test fences and signaled again during resubmission.


Best Regards,
JingWen Chen
> 
> > 
> > > > +   }
> > > > /* after all hw jobs are reset, hw fence is 
> > > > meaningless, so force_completion */
> > > > amdgpu_fence_driver_force_completion(ring);
> > > > }
> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > > index eecf21d8ec33..815776c9a013 100644
> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > > @@ -156,11 +156,17 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
> > > > struct dma_fence **f, struct amd
> > > > job->ring = ring;
> > > > }
> > > > -   seq = ++ring->fence_drv.sync_seq;
> > > > -   dma_fence_init(fence, &amdgpu_fence_ops,
> > > > -  

Re: [PATCH 2/2] drm: add tdr support for embeded hw_fence

2021-07-22 Thread Andrey Grodzovsky



On 2021-07-22 6:45 a.m., Jingwen Chen wrote:

On Wed Jul 21, 2021 at 12:53:51PM -0400, Andrey Grodzovsky wrote:

On 2021-07-20 11:13 p.m., Jingwen Chen wrote:

[Why]
After embeded hw_fence to amdgpu_job, we need to add tdr support
for this feature.

[How]
1. Add a resubmit_flag for resubmit jobs.
2. Clear job fence from RCU and force complete vm flush fences in
 pre_asic_reset
3. skip dma_fence_get for resubmit jobs and add a dma_fence_put
 for guilty jobs.

Signed-off-by: Jack Zhang 
Signed-off-by: Jingwen Chen 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++-
   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 16 +++-
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  4 +++-
   drivers/gpu/drm/scheduler/sched_main.c |  1 +
   include/drm/gpu_scheduler.h|  1 +
   5 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 40461547701a..fe0237f72a09 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4382,7 +4382,7 @@ int amdgpu_device_mode1_reset(struct amdgpu_device *adev)
   int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 struct amdgpu_reset_context *reset_context)
   {
-   int i, r = 0;
+   int i, j, r = 0;
struct amdgpu_job *job = NULL;
bool need_full_reset =
test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
@@ -4406,6 +4406,16 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device 
*adev,
if (!ring || !ring->sched.thread)
continue;
+   /*clear job fence from fence drv to avoid force_completion
+*leave NULL and vm flush fence in fence drv */
+   for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) {
+   struct dma_fence *old,**ptr;
+   ptr = &ring->fence_drv.fences[j];
+   old = rcu_dereference_protected(*ptr, 1);
+   if (old && test_bit(DMA_FENCE_FLAG_USER_BITS, 
&old->flags))) {
+   RCU_INIT_POINTER(*ptr, NULL);
+   }


Is this to avoid premature job free because of dma_fence_put inside
amdgpu_fence_process ?
I can't currently remember why but we probably want all the HW fences
currently in the ring to
be forced signaled - maybe better to test for DMA_FENCE_FLAG_USER_BITS
inside amdgpu_fence_process
and still do the signaling but not the dma_fence_put part

Andrey

Hi Andrey,

This is to avoid signaling the same fence twice. If we still do the
signaling, then the job in the pending list will be signaled first in
force_completion, and later be signaled in resubmit. This will go to
BUG() in amdgpu_fence_process.



Oh, i see, how about just adding 'skip' flag to amdgpu_ring and setting 
it before calling
amdgpu_fence_driver_force_completion and resetting it after, then inside 
amdgpu_fence_driver_force_completion
you can just skip the signaling part with this flag for fences with 
DMA_FENCE_FLAG_USER_BITS set

Less lines of code at least.

Andrey





+   }
/* after all hw jobs are reset, hw fence is meaningless, so 
force_completion */
amdgpu_fence_driver_force_completion(ring);
}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index eecf21d8ec33..815776c9a013 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -156,11 +156,17 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct 
dma_fence **f, struct amd
job->ring = ring;
}
-   seq = ++ring->fence_drv.sync_seq;
-   dma_fence_init(fence, &amdgpu_fence_ops,
-  &ring->fence_drv.lock,
-  adev->fence_context + ring->idx,
-  seq);
+   if (job != NULL && job->base.resubmit_flag == 1) {
+   /* reinit seq for resubmitted jobs */
+   seq = ++ring->fence_drv.sync_seq;
+   fence->seqno = seq;
+   } else {
+   seq = ++ring->fence_drv.sync_seq;


Seems like you could do the above line only once above if-else as it was
before

Sure, I will modify this.


Best Regards,
JingWen Chen

+   dma_fence_init(fence, &amdgpu_fence_ops,
+   &ring->fence_drv.lock,
+   adev->fence_context + ring->idx,
+   seq);
+   }
if (job != NULL) {
/* mark this fence has a parent job */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 7c426e225b24..d6f848adc3f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -241,6 +241,7 @@ static struct dma_fence *

Re: [PATCH] drm/amdgpu: fix build error

2021-07-22 Thread Luben Tuikov
Please push your patch now that it's been reviewed.

Regards,
Luben

On 2021-07-22 10:22 a.m., Chen, Guchun wrote:
> [Public]
>
> I guess your branch does not have below patch:
>
> 7afefb81b72c drm/amdgpu: Rename flag which prevents HW access
>
> Regards,
> Guchun
>
> -Original Message-
> From: Tuikov, Luben  
> Sent: Thursday, July 22, 2021 10:14 PM
> To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Zhang, 
> Hawking ; Deucher, Alexander 
> 
> Subject: Re: [PATCH] drm/amdgpu: fix build error
>
> On 2021-07-22 5:25 a.m., Guchun Chen wrote:
>> drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.c: In function 
>> 'smu_cmn_send_msg_without_waiting':
>> drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.c:259:15: error: 'struct 
>> amdgpu_device' has no member named 'in_pci_err_recovery'
>>if (smu->adev->in_pci_err_recovery)
>>^~
>>   CC [M]  drivers/staging/rtl8192u/ieee80211/ieee80211_tx.o
>> scripts/Makefile.build:272: recipe for target 
>> 'drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.o' failed
>> make[7]: *** [drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.o] Error 1
>> scripts/Makefile.build:515: recipe for target 'drivers/gpu/drm/amd/amdgpu' 
>> failed
>> make[6]: *** [drivers/gpu/drm/amd/amdgpu] Error 2
>> make[6]: *** Waiting for unfinished jobs
>>
>> Fixes: e070ba49f3a7 drm/amd/pm: Fix a bug communicating with the SMU (v5)
>>
>> Signed-off-by: Guchun Chen 
> Generally, there is no empty lines between tags.
>
> Not sure what happened here? Rename? This compiled and worked fine on my 
> machine.
>
> Reviewed-by: Luben Tuikov 
>
> Regards,
> Luben
>
>> ---
>>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
>> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> index a48f18932357..a0e2111eb783 100644
>> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
>> @@ -256,7 +256,7 @@ int smu_cmn_send_msg_without_waiting(struct smu_context 
>> *smu,
>>  u32 reg;
>>  int res;
>>  
>> -if (smu->adev->in_pci_err_recovery)
>> +if (smu->adev->no_hw_access)
>>  return 0;
>>  
>>  mutex_lock(&smu->message_lock);

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: fix build error

2021-07-22 Thread Chen, Guchun
[Public]

I guess your branch does not have below patch:

7afefb81b72c drm/amdgpu: Rename flag which prevents HW access

Regards,
Guchun

-Original Message-
From: Tuikov, Luben  
Sent: Thursday, July 22, 2021 10:14 PM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Zhang, 
Hawking ; Deucher, Alexander 
Subject: Re: [PATCH] drm/amdgpu: fix build error

On 2021-07-22 5:25 a.m., Guchun Chen wrote:
> drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.c: In function 
> 'smu_cmn_send_msg_without_waiting':
> drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.c:259:15: error: 'struct 
> amdgpu_device' has no member named 'in_pci_err_recovery'
>if (smu->adev->in_pci_err_recovery)
>^~
>   CC [M]  drivers/staging/rtl8192u/ieee80211/ieee80211_tx.o
> scripts/Makefile.build:272: recipe for target 
> 'drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.o' failed
> make[7]: *** [drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.o] Error 1
> scripts/Makefile.build:515: recipe for target 'drivers/gpu/drm/amd/amdgpu' 
> failed
> make[6]: *** [drivers/gpu/drm/amd/amdgpu] Error 2
> make[6]: *** Waiting for unfinished jobs
>
> Fixes: e070ba49f3a7 drm/amd/pm: Fix a bug communicating with the SMU (v5)
>
> Signed-off-by: Guchun Chen 

Generally, there is no empty lines between tags.

Not sure what happened here? Rename? This compiled and worked fine on my 
machine.

Reviewed-by: Luben Tuikov 

Regards,
Luben

> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index a48f18932357..a0e2111eb783 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -256,7 +256,7 @@ int smu_cmn_send_msg_without_waiting(struct smu_context 
> *smu,
>   u32 reg;
>   int res;
>  
> - if (smu->adev->in_pci_err_recovery)
> + if (smu->adev->no_hw_access)
>   return 0;
>  
>   mutex_lock(&smu->message_lock);
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: fix build error

2021-07-22 Thread Luben Tuikov
On 2021-07-22 5:25 a.m., Guchun Chen wrote:
> drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.c: In function 
> 'smu_cmn_send_msg_without_waiting':
> drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.c:259:15: error: 'struct 
> amdgpu_device' has no member named 'in_pci_err_recovery'
>if (smu->adev->in_pci_err_recovery)
>^~
>   CC [M]  drivers/staging/rtl8192u/ieee80211/ieee80211_tx.o
> scripts/Makefile.build:272: recipe for target 
> 'drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.o' failed
> make[7]: *** [drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.o] Error 1
> scripts/Makefile.build:515: recipe for target 'drivers/gpu/drm/amd/amdgpu' 
> failed
> make[6]: *** [drivers/gpu/drm/amd/amdgpu] Error 2
> make[6]: *** Waiting for unfinished jobs
>
> Fixes: e070ba49f3a7 drm/amd/pm: Fix a bug communicating with the SMU (v5)
>
> Signed-off-by: Guchun Chen 

Generally, there is no empty lines between tags.

Not sure what happened here? Rename? This compiled and worked fine on my 
machine.

Reviewed-by: Luben Tuikov 

Regards,
Luben

> ---
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
> b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> index a48f18932357..a0e2111eb783 100644
> --- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> +++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
> @@ -256,7 +256,7 @@ int smu_cmn_send_msg_without_waiting(struct smu_context 
> *smu,
>   u32 reg;
>   int res;
>  
> - if (smu->adev->in_pci_err_recovery)
> + if (smu->adev->no_hw_access)
>   return 0;
>  
>   mutex_lock(&smu->message_lock);

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH v4 10/13] lib: test_hmm add module param for zone device type

2021-07-22 Thread Jason Gunthorpe
On Sat, Jul 17, 2021 at 02:21:32PM -0500, Alex Sierra wrote:
> In order to configure device generic in test_hmm, two
> module parameters should be passed, which correspon to the
> SP start address of each device (2) spm_addr_dev0 &
> spm_addr_dev1. If no parameters are passed, private device
> type is configured.

I don't think tests should need configuration like this, is it really
necessary? How can people with normal HW run this test?

Jason
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: ThinkPad T14 Gen 1 AMD switching to lowest brightness during boot process with 5.14-rc2

2021-07-22 Thread Alex Deucher
On Thu, Jul 22, 2021 at 10:04 AM Martin Steigerwald  wrote:
>
> Not subscribed, so please Cc on answer.
>
> Hi!
>
> Just compiled 5.14-rc2 for my T14 Gen 1 AMD with "AMD Ryzen 7 PRO 4750U
> with Radeon Graphics" with Low Power Display (400 Nits).
>
> Using that kernel, the display starts with the usual high brightness.
> However, during boot the brightness goes down to what appears to be the
> minimum. In KDE Plasma however the brightness of the display is shown as
> maximum.
>
> If I move around the brightness slider just a few percent down from 100
> to maybe 98% or something like that the brightness is immediately reset
> to its actual value display by the slider.
>
> This did not happen with 5.13.

Please file a ticket here:
https://gitlab.freedesktop.org/drm/amd/-/issues
Can you bisect?

Alex

>
> I am still on X11 as some KDE Plasma features are not yet available for
> Wayland.
>
> This is on Devuan Ceres – basically Debian Sid without Systemd – with
> Runit as init system, so there no automatic Systemd brightness logic
> should be involved.
>
> If attached kernel configuration is stripped, please let me know.
>
> Best,
> --
> Martin___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 2/2] drm: add tdr support for embeded hw_fence

2021-07-22 Thread Christian König

Am 22.07.21 um 16:45 schrieb Andrey Grodzovsky:


On 2021-07-22 6:45 a.m., Jingwen Chen wrote:

On Wed Jul 21, 2021 at 12:53:51PM -0400, Andrey Grodzovsky wrote:

On 2021-07-20 11:13 p.m., Jingwen Chen wrote:

[Why]
After embeded hw_fence to amdgpu_job, we need to add tdr support
for this feature.

[How]
1. Add a resubmit_flag for resubmit jobs.
2. Clear job fence from RCU and force complete vm flush fences in
 pre_asic_reset
3. skip dma_fence_get for resubmit jobs and add a dma_fence_put
 for guilty jobs.

Signed-off-by: Jack Zhang 
Signed-off-by: Jingwen Chen 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++-
   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 16 +++-
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |  4 +++-
   drivers/gpu/drm/scheduler/sched_main.c |  1 +
   include/drm/gpu_scheduler.h    |  1 +
   5 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 40461547701a..fe0237f72a09 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4382,7 +4382,7 @@ int amdgpu_device_mode1_reset(struct 
amdgpu_device *adev)

   int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
    struct amdgpu_reset_context *reset_context)
   {
-    int i, r = 0;
+    int i, j, r = 0;
   struct amdgpu_job *job = NULL;
   bool need_full_reset =
   test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
@@ -4406,6 +4406,16 @@ int amdgpu_device_pre_asic_reset(struct 
amdgpu_device *adev,

   if (!ring || !ring->sched.thread)
   continue;
+    /*clear job fence from fence drv to avoid force_completion
+ *leave NULL and vm flush fence in fence drv */
+    for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) {
+    struct dma_fence *old,**ptr;
+    ptr = &ring->fence_drv.fences[j];
+    old = rcu_dereference_protected(*ptr, 1);
+    if (old && test_bit(DMA_FENCE_FLAG_USER_BITS, 
&old->flags))) {

+    RCU_INIT_POINTER(*ptr, NULL);
+    }


Is this to avoid premature job free because of dma_fence_put inside
amdgpu_fence_process ?
I can't currently remember why but we probably want all the HW fences
currently in the ring to
be forced signaled - maybe better to test for DMA_FENCE_FLAG_USER_BITS
inside amdgpu_fence_process
and still do the signaling but not the dma_fence_put part

Andrey

Hi Andrey,

This is to avoid signaling the same fence twice. If we still do the
signaling, then the job in the pending list will be signaled first in
force_completion, and later be signaled in resubmit. This will go to
BUG() in amdgpu_fence_process.



Oh, i see, how about just adding 'skip' flag to amdgpu_ring and 
setting it before calling
amdgpu_fence_driver_force_completion and resetting it after, then 
inside amdgpu_fence_driver_force_completion
you can just skip the signaling part with this flag for fences with 
DMA_FENCE_FLAG_USER_BITS set

Less lines of code at least.


Still sounds quite a bit hacky.

I would rather suggest to completely drop the approach with 
amdgpu_fence_driver_force_completion(). I could never see why we would 
want that in the first place.


Regards,
Christian.



Andrey





+    }
   /* after all hw jobs are reset, hw fence is meaningless, 
so force_completion */

   amdgpu_fence_driver_force_completion(ring);
   }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

index eecf21d8ec33..815776c9a013 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -156,11 +156,17 @@ int amdgpu_fence_emit(struct amdgpu_ring 
*ring, struct dma_fence **f, struct amd

   job->ring = ring;
   }
-    seq = ++ring->fence_drv.sync_seq;
-    dma_fence_init(fence, &amdgpu_fence_ops,
-   &ring->fence_drv.lock,
-   adev->fence_context + ring->idx,
-   seq);
+    if (job != NULL && job->base.resubmit_flag == 1) {
+    /* reinit seq for resubmitted jobs */
+    seq = ++ring->fence_drv.sync_seq;
+    fence->seqno = seq;
+    } else {
+    seq = ++ring->fence_drv.sync_seq;


Seems like you could do the above line only once above if-else as it 
was

before

Sure, I will modify this.


Best Regards,
JingWen Chen

+    dma_fence_init(fence, &amdgpu_fence_ops,
+    &ring->fence_drv.lock,
+    adev->fence_context + ring->idx,
+    seq);
+    }
   if (job != NULL) {
   /* mark this fence has a parent job */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c

index 7c426e225b24..d6f848adc3f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -241,6 +241,7 @@ static struct dma_fen

Re: [PATCH 2/2] drm: add tdr support for embeded hw_fence

2021-07-22 Thread Jingwen Chen
On Wed Jul 21, 2021 at 12:53:51PM -0400, Andrey Grodzovsky wrote:
> 
> On 2021-07-20 11:13 p.m., Jingwen Chen wrote:
> > [Why]
> > After embeded hw_fence to amdgpu_job, we need to add tdr support
> > for this feature.
> > 
> > [How]
> > 1. Add a resubmit_flag for resubmit jobs.
> > 2. Clear job fence from RCU and force complete vm flush fences in
> > pre_asic_reset
> > 3. skip dma_fence_get for resubmit jobs and add a dma_fence_put
> > for guilty jobs.
> > 
> > Signed-off-by: Jack Zhang 
> > Signed-off-by: Jingwen Chen 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 12 +++-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 16 +++-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c|  4 +++-
> >   drivers/gpu/drm/scheduler/sched_main.c |  1 +
> >   include/drm/gpu_scheduler.h|  1 +
> >   5 files changed, 27 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 40461547701a..fe0237f72a09 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4382,7 +4382,7 @@ int amdgpu_device_mode1_reset(struct amdgpu_device 
> > *adev)
> >   int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
> >  struct amdgpu_reset_context *reset_context)
> >   {
> > -   int i, r = 0;
> > +   int i, j, r = 0;
> > struct amdgpu_job *job = NULL;
> > bool need_full_reset =
> > test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
> > @@ -4406,6 +4406,16 @@ int amdgpu_device_pre_asic_reset(struct 
> > amdgpu_device *adev,
> > if (!ring || !ring->sched.thread)
> > continue;
> > +   /*clear job fence from fence drv to avoid force_completion
> > +*leave NULL and vm flush fence in fence drv */
> > +   for (j = 0; j <= ring->fence_drv.num_fences_mask; j ++) {
> > +   struct dma_fence *old,**ptr;
> > +   ptr = &ring->fence_drv.fences[j];
> > +   old = rcu_dereference_protected(*ptr, 1);
> > +   if (old && test_bit(DMA_FENCE_FLAG_USER_BITS, 
> > &old->flags))) {
> > +   RCU_INIT_POINTER(*ptr, NULL);
> > +   }
> 
> 
> Is this to avoid premature job free because of dma_fence_put inside
> amdgpu_fence_process ?
> I can't currently remember why but we probably want all the HW fences
> currently in the ring to
> be forced signaled - maybe better to test for DMA_FENCE_FLAG_USER_BITS
> inside amdgpu_fence_process
> and still do the signaling but not the dma_fence_put part
> 
> Andrey

Hi Andrey,

This is to avoid signaling the same fence twice. If we still do the
signaling, then the job in the pending list will be signaled first in
force_completion, and later be signaled in resubmit. This will go to
BUG() in amdgpu_fence_process.

> 
> > +   }
> > /* after all hw jobs are reset, hw fence is meaningless, so 
> > force_completion */
> > amdgpu_fence_driver_force_completion(ring);
> > }
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > index eecf21d8ec33..815776c9a013 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > @@ -156,11 +156,17 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
> > struct dma_fence **f, struct amd
> > job->ring = ring;
> > }
> > -   seq = ++ring->fence_drv.sync_seq;
> > -   dma_fence_init(fence, &amdgpu_fence_ops,
> > -  &ring->fence_drv.lock,
> > -  adev->fence_context + ring->idx,
> > -  seq);
> > +   if (job != NULL && job->base.resubmit_flag == 1) {
> > +   /* reinit seq for resubmitted jobs */
> > +   seq = ++ring->fence_drv.sync_seq;
> > +   fence->seqno = seq;
> > +   } else {
> > +   seq = ++ring->fence_drv.sync_seq;
> 
> 
> Seems like you could do the above line only once above if-else as it was
> before

Sure, I will modify this.


Best Regards,
JingWen Chen
> 
> > +   dma_fence_init(fence, &amdgpu_fence_ops,
> > +   &ring->fence_drv.lock,
> > +   adev->fence_context + ring->idx,
> > +   seq);
> > +   }
> > if (job != NULL) {
> > /* mark this fence has a parent job */
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 7c426e225b24..d6f848adc3f4 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -241,6 +241,7 @@ static struct dma_fence *amdgpu_job_run(struct 
> > drm_sched_job *sched_job)
> > dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if 
> > VRAM lost */
> > if (finished->erro

Re: drm/amdgpu: Check pmops for desired suspend state

2021-07-22 Thread Raul E Rangel
On Thu, Jul 22, 2021 at 10:57:14AM +0530, Pratik Vishwakarma wrote:
> [Why]
> User might set mem_sleep as deep and it will result
> in amdgpu resume errors.
> 
> [How]
> Check with pm for default suspend state
> 
> Signed-off-by: Pratik Vishwakarma 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index af1710971ff3..d92196429741 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1468,7 +1468,8 @@ static int amdgpu_pmops_suspend(struct device *dev)
>   struct amdgpu_device *adev = drm_to_adev(drm_dev);
>   int r;
>  
> - if (amdgpu_acpi_is_s0ix_supported(adev))
> + if (amdgpu_acpi_is_s0ix_supported(adev)
> + && pm_suspend_default_s2idle())

Why pm_suspend_default_s2idle instead of pm_suspend_via_firmware?
>   adev->in_s0ix = true;
>   adev->in_s3 = true;
>   r = amdgpu_device_suspend(drm_dev, true);
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


ThinkPad T14 Gen 1 AMD switching to lowest brightness during boot process with 5.14-rc2

2021-07-22 Thread Martin Steigerwald
Not subscribed, so please Cc on answer.

Hi!

Just compiled 5.14-rc2 for my T14 Gen 1 AMD with "AMD Ryzen 7 PRO 4750U 
with Radeon Graphics" with Low Power Display (400 Nits).

Using that kernel, the display starts with the usual high brightness. 
However, during boot the brightness goes down to what appears to be the 
minimum. In KDE Plasma however the brightness of the display is shown as 
maximum.

If I move around the brightness slider just a few percent down from 100 
to maybe 98% or something like that the brightness is immediately reset 
to its actual value display by the slider.

This did not happen with 5.13.

I am still on X11 as some KDE Plasma features are not yet available for 
Wayland.

This is on Devuan Ceres – basically Debian Sid without Systemd – with 
Runit as init system, so there no automatic Systemd brightness logic 
should be involved.

If attached kernel configuration is stripped, please let me know.

Best,
-- 
Martin

config-5.14.0-rc2-t14.xz
Description: application/xz
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: fix build error

2021-07-22 Thread Guchun Chen
drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.c: In function 
'smu_cmn_send_msg_without_waiting':
drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.c:259:15: error: 'struct 
amdgpu_device' has no member named 'in_pci_err_recovery'
   if (smu->adev->in_pci_err_recovery)
   ^~
  CC [M]  drivers/staging/rtl8192u/ieee80211/ieee80211_tx.o
scripts/Makefile.build:272: recipe for target 
'drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.o' failed
make[7]: *** [drivers/gpu/drm/amd/amdgpu/../pm/swsmu/smu_cmn.o] Error 1
scripts/Makefile.build:515: recipe for target 'drivers/gpu/drm/amd/amdgpu' 
failed
make[6]: *** [drivers/gpu/drm/amd/amdgpu] Error 2
make[6]: *** Waiting for unfinished jobs

Fixes: e070ba49f3a7 drm/amd/pm: Fix a bug communicating with the SMU (v5)

Signed-off-by: Guchun Chen 
---
 drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
index a48f18932357..a0e2111eb783 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c
@@ -256,7 +256,7 @@ int smu_cmn_send_msg_without_waiting(struct smu_context 
*smu,
u32 reg;
int res;
 
-   if (smu->adev->in_pci_err_recovery)
+   if (smu->adev->no_hw_access)
return 0;
 
mutex_lock(&smu->message_lock);
-- 
2.17.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/2] drm/amd/pm: restore user customized OD settings properly for NV1x

2021-07-22 Thread Lazar, Lijo




On 7/22/2021 2:03 PM, Quan, Evan wrote:

[AMD Official Use Only]




-Original Message-
From: Lazar, Lijo 
Sent: Thursday, July 22, 2021 4:10 PM
To: Quan, Evan ; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander 
Subject: Re: [PATCH 1/2] drm/amd/pm: restore user customized OD settings
properly for NV1x



On 7/22/2021 8:50 AM, Evan Quan wrote:

The customized OD settings can be divided into two parts: those
committed ones and non-committed ones.
- For those changes which had been fed to SMU before S3/S4/Runpm
  suspend kicked, they are committed changes. They should be properly
  restored and fed to SMU on S3/S4/Runpm resume.
- For those non-committed changes, they are restored only without

feeding

  to SMU.

Change-Id: Iea7cf7908dfcd919a4d0205e10bff91b1149a440
Signed-off-by: Evan Quan 
---
   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  8 +++
   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |  9 
   .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 52

++-

   .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c| 12 +
   4 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 3e89852e4820..8ba53f16d2d9 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -231,6 +231,7 @@ struct smu_user_dpm_profile {
uint32_t power_limit;
uint32_t fan_speed_percent;
uint32_t flags;
+   uint32_t committed_od;

/* user clock state information */
uint32_t clk_mask[SMU_CLK_COUNT];
@@ -352,6 +353,7 @@ struct smu_table_context

void*overdrive_table;
void*boot_overdrive_table;
+   void*committed_overdrive_table;


May be rename it to user_overdrive_table - OD table with user settings?

[Quan, Evan] Actually "overdrive_table " is  the user_overdrive_table. It contains all the 
modification including those not committed( the commit is performed by echo "c" > 
/sys/class/drm/card1/device/pp_od_clk_voltage).
The new member added refers to those user customized but committed only settings. That's 
why it was named as " committed_overdrive_table". Any good suggestion for the 
naming?


As far as I understand, the problem is overdrive_table can have 
intemediary settings as the edit/commit is a two-step process. At any 
point we can have user_od_table keep the latest user mode settings. That 
is true when user restores to default settings also; that time we can 
just copy the boot settings back to user_od table and keep the flag as 
false indicating that there is no custom settings.





uint32_tgpu_metrics_table_size;
void*gpu_metrics_table;
@@ -623,6 +625,12 @@ struct pptable_funcs {
 enum PP_OD_DPM_TABLE_COMMAND

type,

 long *input, uint32_t size);

+   /**
+* @restore_committed_od_settings: Restore the customized and

committed

+* OD settings on S3/S4/Runpm resume.
+*/
+   int (*restore_committed_od_settings)(struct smu_context *smu);
+
/**
 * @get_clock_by_type_with_latency: Get the speed and latency of

a clock

 *  domain.
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index ebe672142808..5f7d98e99f76 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -416,6 +416,15 @@ static void smu_restore_dpm_user_profile(struct

smu_context *smu)

}
}

+   /* Restore customized and committed OD settings */
+   if (smu->user_dpm_profile.committed_od) {
+   if (smu->ppt_funcs->restore_committed_od_settings) {
+   ret = smu->ppt_funcs-
restore_committed_od_settings(smu);
+   if (ret)
+   dev_err(smu->adev->dev, "Failed to upload

customized OD settings\n");

+   }
+   }
+


Just thinking if there is a need to handle it ASIC specific. The flags and table
pointer are maintained in common structure. So can't we do the restore of
user OD settings directly in this common flow instead of having each ASIC to
implement the callback?

[Quan, Evan] The OD settings restoring is ASIC specific as it performed on OD 
table and that(OD table) is ASIC specific.


I was thinking in terms of final logic that is being done. The below 
structures are available in common table context and we have a common 
logic to update the table. If there is no custom OD settings available, 
we could handle it with the flag.


+   struct smu_table_context *table_context = &smu->smu_table;
+   void *od_table = table_context->committed_overdrive_ta

RE: [PATCH] drm/amdgpu: Change the imprecise output

2021-07-22 Thread Zhou, Peng Ju
[AMD Official Use Only]

Reviewed-by: Peng Ju Zhou 

> -Original Message-
> From: amd-gfx  On Behalf Of Roy Sun
> Sent: Thursday, July 22, 2021 3:14 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Sun, Roy 
> Subject: [PATCH] drm/amdgpu: Change the imprecise output
> 
> The fail reason is that the vfgate is disabled
> 
> Signed-off-by: Roy Sun 
> ---
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> index 9f3d82dfb79c..f94ef15b3166 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
> @@ -56,7 +56,7 @@
>  #define GFX10_NUM_GFX_RINGS_Sienna_Cichlid   1
>  #define GFX10_MEC_HPD_SIZE   2048
> 
> -#define RLCG_INTERFACE_NOT_ENABLED   0x400
> +#define RLCG_VFGATE_DISABLED 0x400
>  #define RLCG_WRONG_OPERATION_TYPE0x200
>  #define RLCG_NOT_IN_RANGE0x100
> 
> @@ -1571,8 +1571,8 @@ static u32 gfx_v10_rlcg_rw(struct amdgpu_device
> *adev, u32 offset, u32 v, uint32
> 
>   if (i >= retries) {
>   if (RLCG_ERROR_REPORT_ENABLED(adev)) {
> - if (tmp & RLCG_INTERFACE_NOT_ENABLED)
> - pr_err("The interface is not enabled,
> program reg:0x%05x failed!\n", offset);
> + if (tmp & RLCG_VFGATE_DISABLED)
> + pr_err("The vfgate is disabled,
> program reg:0x%05x failed!\n", offset);
>   else if (tmp &
> RLCG_WRONG_OPERATION_TYPE)
>   pr_err("Wrong operation type,
> program reg:0x%05x failed!\n", offset);
>   else if (tmp & RLCG_NOT_IN_RANGE)
> --
> 2.32.0
> 
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fr
> eedesktop.org%2Fmailman%2Flistinfo%2Famd-
> gfx&data=04%7C01%7CPengju.Zhou%40amd.com%7C5767a32458114913
> dc8608d94ce0458a%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6
> 37625348403108644%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDA
> iLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Hs
> Cx%2B9B3pIvGnov0I3mnxZOEKE7%2FvFf3JhpYl8BI6ko%3D&reserved=0
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH 1/2] drm/amd/pm: restore user customized OD settings properly for NV1x

2021-07-22 Thread Quan, Evan
[AMD Official Use Only]



> -Original Message-
> From: Lazar, Lijo 
> Sent: Thursday, July 22, 2021 4:10 PM
> To: Quan, Evan ; amd-gfx@lists.freedesktop.org
> Cc: Deucher, Alexander 
> Subject: Re: [PATCH 1/2] drm/amd/pm: restore user customized OD settings
> properly for NV1x
> 
> 
> 
> On 7/22/2021 8:50 AM, Evan Quan wrote:
> > The customized OD settings can be divided into two parts: those
> > committed ones and non-committed ones.
> >- For those changes which had been fed to SMU before S3/S4/Runpm
> >  suspend kicked, they are committed changes. They should be properly
> >  restored and fed to SMU on S3/S4/Runpm resume.
> >- For those non-committed changes, they are restored only without
> feeding
> >  to SMU.
> >
> > Change-Id: Iea7cf7908dfcd919a4d0205e10bff91b1149a440
> > Signed-off-by: Evan Quan 
> > ---
> >   drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  8 +++
> >   drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |  9 
> >   .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 52
> ++-
> >   .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c| 12 +
> >   4 files changed, 69 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > index 3e89852e4820..8ba53f16d2d9 100644
> > --- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > +++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
> > @@ -231,6 +231,7 @@ struct smu_user_dpm_profile {
> > uint32_t power_limit;
> > uint32_t fan_speed_percent;
> > uint32_t flags;
> > +   uint32_t committed_od;
> >
> > /* user clock state information */
> > uint32_t clk_mask[SMU_CLK_COUNT];
> > @@ -352,6 +353,7 @@ struct smu_table_context
> >
> > void*overdrive_table;
> > void*boot_overdrive_table;
> > +   void*committed_overdrive_table;
> 
> May be rename it to user_overdrive_table - OD table with user settings?
[Quan, Evan] Actually "overdrive_table " is  the user_overdrive_table. It 
contains all the modification including those not committed( the commit is 
performed by echo "c" > /sys/class/drm/card1/device/pp_od_clk_voltage).
The new member added refers to those user customized but committed only 
settings. That's why it was named as " committed_overdrive_table". Any good 
suggestion for the naming?
> 
> > uint32_tgpu_metrics_table_size;
> > void*gpu_metrics_table;
> > @@ -623,6 +625,12 @@ struct pptable_funcs {
> >  enum PP_OD_DPM_TABLE_COMMAND
> type,
> >  long *input, uint32_t size);
> >
> > +   /**
> > +* @restore_committed_od_settings: Restore the customized and
> committed
> > +* OD settings on S3/S4/Runpm resume.
> > +*/
> > +   int (*restore_committed_od_settings)(struct smu_context *smu);
> > +
> > /**
> >  * @get_clock_by_type_with_latency: Get the speed and latency of
> a clock
> >  *  domain.
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > index ebe672142808..5f7d98e99f76 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
> > @@ -416,6 +416,15 @@ static void smu_restore_dpm_user_profile(struct
> smu_context *smu)
> > }
> > }
> >
> > +   /* Restore customized and committed OD settings */
> > +   if (smu->user_dpm_profile.committed_od) {
> > +   if (smu->ppt_funcs->restore_committed_od_settings) {
> > +   ret = smu->ppt_funcs-
> >restore_committed_od_settings(smu);
> > +   if (ret)
> > +   dev_err(smu->adev->dev, "Failed to upload
> customized OD settings\n");
> > +   }
> > +   }
> > +
> 
> Just thinking if there is a need to handle it ASIC specific. The flags and 
> table
> pointer are maintained in common structure. So can't we do the restore of
> user OD settings directly in this common flow instead of having each ASIC to
> implement the callback?
[Quan, Evan] The OD settings restoring is ASIC specific as it performed on OD 
table and that(OD table) is ASIC specific.
> 
> > /* Disable restore flag */
> > smu->user_dpm_profile.flags &=
> ~SMU_DPM_USER_PROFILE_RESTORE;
> >   }
> > diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
> > b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
> > index 59ea59acfb00..4752299d7f91 100644
> > --- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
> > +++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
> > @@ -2296,39 +2296,45 @@ static int
> navi10_set_default_od_settings(struct smu_context *smu)
> > (OverDriveTable_t *)smu->smu_table.boot_overdrive_table;
> > int ret = 0;
> >
> > -   ret = smu_cmn_update_table(smu, SMU_TABLE_OVERDRIVE, 0,
> (void *)

Re: [PATCH 1/2] drm/amd/pm: restore user customized OD settings properly for NV1x

2021-07-22 Thread Lazar, Lijo




On 7/22/2021 8:50 AM, Evan Quan wrote:

The customized OD settings can be divided into two parts: those
committed ones and non-committed ones.
   - For those changes which had been fed to SMU before S3/S4/Runpm
 suspend kicked, they are committed changes. They should be properly
 restored and fed to SMU on S3/S4/Runpm resume.
   - For those non-committed changes, they are restored only without feeding
 to SMU.

Change-Id: Iea7cf7908dfcd919a4d0205e10bff91b1149a440
Signed-off-by: Evan Quan 
---
  drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  8 +++
  drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |  9 
  .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 52 ++-
  .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c| 12 +
  4 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 3e89852e4820..8ba53f16d2d9 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -231,6 +231,7 @@ struct smu_user_dpm_profile {
uint32_t power_limit;
uint32_t fan_speed_percent;
uint32_t flags;
+   uint32_t committed_od;
  
  	/* user clock state information */

uint32_t clk_mask[SMU_CLK_COUNT];
@@ -352,6 +353,7 @@ struct smu_table_context
  
  	void*overdrive_table;

void*boot_overdrive_table;
+   void*committed_overdrive_table;


May be rename it to user_overdrive_table - OD table with user settings?


uint32_tgpu_metrics_table_size;
void*gpu_metrics_table;
@@ -623,6 +625,12 @@ struct pptable_funcs {
 enum PP_OD_DPM_TABLE_COMMAND type,
 long *input, uint32_t size);
  
+	/**

+* @restore_committed_od_settings: Restore the customized and committed
+* OD settings on S3/S4/Runpm resume.
+*/
+   int (*restore_committed_od_settings)(struct smu_context *smu);
+
/**
 * @get_clock_by_type_with_latency: Get the speed and latency of a clock
 *  domain.
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index ebe672142808..5f7d98e99f76 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -416,6 +416,15 @@ static void smu_restore_dpm_user_profile(struct 
smu_context *smu)
}
}
  
+	/* Restore customized and committed OD settings */

+   if (smu->user_dpm_profile.committed_od) {
+   if (smu->ppt_funcs->restore_committed_od_settings) {
+   ret = 
smu->ppt_funcs->restore_committed_od_settings(smu);
+   if (ret)
+   dev_err(smu->adev->dev, "Failed to upload customized 
OD settings\n");
+   }
+   }
+


Just thinking if there is a need to handle it ASIC specific. The flags 
and table pointer are maintained in common structure. So can't we do the 
restore of user OD settings directly in this common flow instead of 
having each ASIC to implement the callback?



/* Disable restore flag */
smu->user_dpm_profile.flags &= ~SMU_DPM_USER_PROFILE_RESTORE;
  }
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index 59ea59acfb00..4752299d7f91 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
@@ -2296,39 +2296,45 @@ static int navi10_set_default_od_settings(struct 
smu_context *smu)
(OverDriveTable_t *)smu->smu_table.boot_overdrive_table;
int ret = 0;
  
-	ret = smu_cmn_update_table(smu, SMU_TABLE_OVERDRIVE, 0, (void *)od_table, false);

+   ret = smu_cmn_update_table(smu, SMU_TABLE_OVERDRIVE, 0, (void 
*)boot_od_table, false);
if (ret) {
dev_err(smu->adev->dev, "Failed to get overdrive table!\n");
return ret;
}
  
-	if (!od_table->GfxclkVolt1) {

+   if (!boot_od_table->GfxclkVolt1) {
ret = navi10_overdrive_get_gfx_clk_base_voltage(smu,
-   
&od_table->GfxclkVolt1,
-   
od_table->GfxclkFreq1);
+   
&boot_od_table->GfxclkVolt1,
+   
boot_od_table->GfxclkFreq1);
if (ret)
return ret;
}
  
-	if (!od_table->GfxclkVolt2) {

+   if (!boot_od_table->GfxclkVolt2) {
ret = navi10_overdrive_get_gfx_clk_base_voltage(smu,
-   
&od_tabl

[PATCH] drm/amdgpu: Change the imprecise output

2021-07-22 Thread Roy Sun
The fail reason is that the vfgate is disabled

Signed-off-by: Roy Sun 
---
 drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c 
b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index 9f3d82dfb79c..f94ef15b3166 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -56,7 +56,7 @@
 #define GFX10_NUM_GFX_RINGS_Sienna_Cichlid 1
 #define GFX10_MEC_HPD_SIZE 2048
 
-#define RLCG_INTERFACE_NOT_ENABLED 0x400
+#define RLCG_VFGATE_DISABLED   0x400
 #define RLCG_WRONG_OPERATION_TYPE  0x200
 #define RLCG_NOT_IN_RANGE  0x100
 
@@ -1571,8 +1571,8 @@ static u32 gfx_v10_rlcg_rw(struct amdgpu_device *adev, 
u32 offset, u32 v, uint32
 
if (i >= retries) {
if (RLCG_ERROR_REPORT_ENABLED(adev)) {
-   if (tmp & RLCG_INTERFACE_NOT_ENABLED)
-   pr_err("The interface is not enabled, 
program reg:0x%05x failed!\n", offset);
+   if (tmp & RLCG_VFGATE_DISABLED)
+   pr_err("The vfgate is disabled, program 
reg:0x%05x failed!\n", offset);
else if (tmp & RLCG_WRONG_OPERATION_TYPE)
pr_err("Wrong operation type, program 
reg:0x%05x failed!\n", offset);
else if (tmp & RLCG_NOT_IN_RANGE)
-- 
2.32.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: Add msix restore for pass-through mode

2021-07-22 Thread Chengzhe Liu
In pass-through mode, after mode 1 reset, msix enablement status would
lost and never receives interrupt again. So, we should restore msix
status after mode 1 reset.

Signed-off-by: Chengzhe Liu 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 83af307e97cd..e1aa4a5e6a98 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -584,7 +584,7 @@ void amdgpu_irq_gpu_reset_resume_helper(struct 
amdgpu_device *adev)
 {
int i, j, k;
 
-   if (amdgpu_sriov_vf(adev))
+   if (amdgpu_sriov_vf(adev) || amdgpu_passthrough(adev))
amdgpu_restore_msix(adev);
 
for (i = 0; i < AMDGPU_IRQ_CLIENTID_MAX; ++i) {
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amdgpu: Check pmops for desired suspend state

2021-07-22 Thread Pratik Vishwakarma
[Why]
User might set mem_sleep as deep and it will result
in amdgpu resume errors.

[How]
Check with pm for default suspend state

Signed-off-by: Pratik Vishwakarma 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index af1710971ff3..d92196429741 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1468,7 +1468,8 @@ static int amdgpu_pmops_suspend(struct device *dev)
struct amdgpu_device *adev = drm_to_adev(drm_dev);
int r;
 
-   if (amdgpu_acpi_is_s0ix_supported(adev))
+   if (amdgpu_acpi_is_s0ix_supported(adev)
+   && pm_suspend_default_s2idle())
adev->in_s0ix = true;
adev->in_s3 = true;
r = amdgpu_device_suspend(drm_dev, true);
-- 
2.25.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 1/2] drm/amd/pm: restore user customized OD settings properly for NV1x

2021-07-22 Thread Evan Quan
The customized OD settings can be divided into two parts: those
committed ones and non-committed ones.
  - For those changes which had been fed to SMU before S3/S4/Runpm
suspend kicked, they are committed changes. They should be properly
restored and fed to SMU on S3/S4/Runpm resume.
  - For those non-committed changes, they are restored only without feeding
to SMU.

Change-Id: Iea7cf7908dfcd919a4d0205e10bff91b1149a440
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h   |  8 +++
 drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c |  9 
 .../gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c   | 52 ++-
 .../gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c| 12 +
 4 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h 
b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
index 3e89852e4820..8ba53f16d2d9 100644
--- a/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
+++ b/drivers/gpu/drm/amd/pm/inc/amdgpu_smu.h
@@ -231,6 +231,7 @@ struct smu_user_dpm_profile {
uint32_t power_limit;
uint32_t fan_speed_percent;
uint32_t flags;
+   uint32_t committed_od;
 
/* user clock state information */
uint32_t clk_mask[SMU_CLK_COUNT];
@@ -352,6 +353,7 @@ struct smu_table_context
 
void*overdrive_table;
void*boot_overdrive_table;
+   void*committed_overdrive_table;
 
uint32_tgpu_metrics_table_size;
void*gpu_metrics_table;
@@ -623,6 +625,12 @@ struct pptable_funcs {
 enum PP_OD_DPM_TABLE_COMMAND type,
 long *input, uint32_t size);
 
+   /**
+* @restore_committed_od_settings: Restore the customized and committed
+* OD settings on S3/S4/Runpm resume.
+*/
+   int (*restore_committed_od_settings)(struct smu_context *smu);
+
/**
 * @get_clock_by_type_with_latency: Get the speed and latency of a clock
 *  domain.
diff --git a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c 
b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
index ebe672142808..5f7d98e99f76 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/amdgpu_smu.c
@@ -416,6 +416,15 @@ static void smu_restore_dpm_user_profile(struct 
smu_context *smu)
}
}
 
+   /* Restore customized and committed OD settings */
+   if (smu->user_dpm_profile.committed_od) {
+   if (smu->ppt_funcs->restore_committed_od_settings) {
+   ret = 
smu->ppt_funcs->restore_committed_od_settings(smu);
+   if (ret)
+   dev_err(smu->adev->dev, "Failed to upload 
customized OD settings\n");
+   }
+   }
+
/* Disable restore flag */
smu->user_dpm_profile.flags &= ~SMU_DPM_USER_PROFILE_RESTORE;
 }
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index 59ea59acfb00..4752299d7f91 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
@@ -2296,39 +2296,45 @@ static int navi10_set_default_od_settings(struct 
smu_context *smu)
(OverDriveTable_t *)smu->smu_table.boot_overdrive_table;
int ret = 0;
 
-   ret = smu_cmn_update_table(smu, SMU_TABLE_OVERDRIVE, 0, (void 
*)od_table, false);
+   ret = smu_cmn_update_table(smu, SMU_TABLE_OVERDRIVE, 0, (void 
*)boot_od_table, false);
if (ret) {
dev_err(smu->adev->dev, "Failed to get overdrive table!\n");
return ret;
}
 
-   if (!od_table->GfxclkVolt1) {
+   if (!boot_od_table->GfxclkVolt1) {
ret = navi10_overdrive_get_gfx_clk_base_voltage(smu,
-   
&od_table->GfxclkVolt1,
-   
od_table->GfxclkFreq1);
+   
&boot_od_table->GfxclkVolt1,
+   
boot_od_table->GfxclkFreq1);
if (ret)
return ret;
}
 
-   if (!od_table->GfxclkVolt2) {
+   if (!boot_od_table->GfxclkVolt2) {
ret = navi10_overdrive_get_gfx_clk_base_voltage(smu,
-   
&od_table->GfxclkVolt2,
-   
od_table->GfxclkFreq2);
+   
&boot_od_table->GfxclkVolt2,
+   
boot_od_table->GfxclkFreq2);
if (ret)

[PATCH 2/2] drm/amd/pm: restore user customized OD settings properly for Sienna Cichlid

2021-07-22 Thread Evan Quan
Properly restore those committed and non-committed user customized OD
settings.

Change-Id: I25396df0b3ecdd7a0d9fc77ed220b0abf1fde020
Signed-off-by: Evan Quan 
---
 drivers/gpu/drm/amd/pm/inc/smu_v11_0.h   |  2 ++
 drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c  | 15 +--
 .../drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c  | 16 +---
 drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c   | 13 +
 4 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h 
b/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h
index 385b2ea5379c..7bf25efc3936 100644
--- a/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h
+++ b/drivers/gpu/drm/amd/pm/inc/smu_v11_0.h
@@ -302,5 +302,7 @@ void smu_v11_0_interrupt_work(struct smu_context *smu);
 
 int smu_v11_0_set_light_sbr(struct smu_context *smu, bool enable);
 
+int smu_v11_0_restore_committed_od_settings(struct smu_context *smu);
+
 #endif
 #endif
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
index 4752299d7f91..fbd29129550a 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/navi10_ppt.c
@@ -2513,19 +2513,6 @@ static int navi10_od_edit_dpm_table(struct smu_context 
*smu, enum PP_OD_DPM_TABL
return ret;
 }
 
-static int navi10_restore_committed_od_settings(struct smu_context *smu)
-{
-   struct smu_table_context *table_context = &smu->smu_table;
-   void *od_table = table_context->committed_overdrive_table;
-   int ret = 0;
-
-   ret = smu_cmn_update_table(smu, SMU_TABLE_OVERDRIVE, 0, (void 
*)od_table, true);
-   if (ret)
-   dev_err(smu->adev->dev, "Failed to import overdrive table!\n");
-
-   return ret;
-}
-
 static int navi10_run_btc(struct smu_context *smu)
 {
int ret = 0;
@@ -3289,7 +3276,7 @@ static const struct pptable_funcs navi10_ppt_funcs = {
.set_soft_freq_limited_range = smu_v11_0_set_soft_freq_limited_range,
.set_default_od_settings = navi10_set_default_od_settings,
.od_edit_dpm_table = navi10_od_edit_dpm_table,
-   .restore_committed_od_settings = navi10_restore_committed_od_settings,
+   .restore_committed_od_settings = 
smu_v11_0_restore_committed_od_settings,
.run_btc = navi10_run_btc,
.set_power_source = smu_v11_0_set_power_source,
.get_pp_feature_mask = smu_cmn_get_pp_feature_mask,
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
index e0638dc3f617..f0a7dc1d1640 100644
--- a/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
+++ b/drivers/gpu/drm/amd/pm/swsmu/smu11/sienna_cichlid_ppt.c
@@ -1957,15 +1957,16 @@ static int 
sienna_cichlid_set_default_od_settings(struct smu_context *smu)
int ret = 0;
 
ret = smu_cmn_update_table(smu, SMU_TABLE_OVERDRIVE,
-  0, (void *)od_table, false);
+  0, (void *)boot_od_table, false);
if (ret) {
dev_err(smu->adev->dev, "Failed to get overdrive table!\n");
return ret;
}
 
-   memcpy(boot_od_table, od_table, sizeof(OverDriveTable_t));
+   sienna_cichlid_dump_od_table(smu, boot_od_table);
 
-   sienna_cichlid_dump_od_table(smu, od_table);
+   if (!smu->adev->in_suspend)
+   memcpy(od_table, boot_od_table, sizeof(OverDriveTable_t));
 
return 0;
 }
@@ -2136,6 +2137,14 @@ static int sienna_cichlid_od_edit_dpm_table(struct 
smu_context *smu,
dev_err(smu->adev->dev, "Failed to import overdrive 
table!\n");
return ret;
}
+   if (memcmp(table_context->overdrive_table, 
table_context->boot_overdrive_table,
+   sizeof(OverDriveTable_t))) {
+   smu->user_dpm_profile.committed_od = true;
+   memcpy(table_context->committed_overdrive_table, 
table_context->overdrive_table,
+   sizeof(OverDriveTable_t));
+   } else {
+   smu->user_dpm_profile.committed_od = false;
+   }
break;
 
case PP_OD_EDIT_VDDGFX_OFFSET:
@@ -3902,6 +3911,7 @@ static const struct pptable_funcs 
sienna_cichlid_ppt_funcs = {
.set_soft_freq_limited_range = smu_v11_0_set_soft_freq_limited_range,
.set_default_od_settings = sienna_cichlid_set_default_od_settings,
.od_edit_dpm_table = sienna_cichlid_od_edit_dpm_table,
+   .restore_committed_od_settings = 
smu_v11_0_restore_committed_od_settings,
.run_btc = sienna_cichlid_run_btc,
.set_power_source = smu_v11_0_set_power_source,
.get_pp_feature_mask = smu_cmn_get_pp_feature_mask,
diff --git a/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c 
b/drivers/gpu/drm/amd/pm/swsmu/smu11/smu_v11_0.c
index 28fc3f17c1b1..323126a7ca49