amdgpu: recreate retired page's bo if vram get lost in gpu reset

Zhou1, Tao Mon, 30 Sep 2019 02:07:25 -0700

Hi Christian:

So the allocation information of a bo's vram page could be reserved after vram 
lost?
In fact, this series could be dropped if the result of 
amdgpu_bo_create_reserved is guaranteed to be failed after vram lost (protect 
bad pages from reallocating).


Hi Andrey:

A bad page's allocation information will not be lost in gpu reset according to 
Christian's comments. Do you have any other concern?

Regards,
Tao

> -----Original Message-----
> From: Christian König <ckoenig.leichtzumer...@gmail.com>
> Sent: 2019年9月30日 16:35
> To: Zhou1, Tao <tao.zh...@amd.com>; amd-gfx@lists.freedesktop.org;
> Grodzovsky, Andrey <andrey.grodzov...@amd.com>; Chen, Guchun
> <guchun.c...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com>
> Subject: Re: [PATCH 1/3] drm/amdgpu: recreate retired page's bo if vram get
> lost in gpu reset
> 
> Am 30.09.19 um 05:19 schrieb Zhou1, Tao:
> > the info of retired page's bo may be lost if vram lost is encountered
> > in gpu reset (gpu page table in vram may be lost), force to recreate
> > all bos
> >
> > v2: simplify NULL pointer check
> >      add more comments
> 
> Repeating on the v2 of the patch set, this is complete nonsense.
> 
> When VRAM is lost the BOs and their reservation still exits, only the content
> is lost because the MC is has been resetted.
> 
> Regards,
> Christian.
> 
> >
> > Signed-off-by: Tao Zhou <tao.zh...@amd.com>
> > Suggested-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  1 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 48
> ++++++++++++++++++++++
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  1 +
> >   3 files changed, 50 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index 62955156653c..a89f6d053b3f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3675,6 +3675,7 @@ static int amdgpu_do_asic_reset(struct
> amdgpu_hive_info *hive,
> >                             if (vram_lost) {
> >                                     DRM_INFO("VRAM is lost due to GPU
> reset!\n");
> >                                     amdgpu_inc_vram_lost(tmp_adev);
> > +
>       amdgpu_ras_recovery_reset(tmp_adev);
> >                             }
> >
> >                             r = amdgpu_gtt_mgr_recover(
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index 486568ded6d6..3f2a2f13e4c6 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -1573,6 +1573,54 @@ static int amdgpu_ras_recovery_fini(struct
> > amdgpu_device *adev)
> >
> >     return 0;
> >   }
> > +
> > +/*
> > + * the info of retired page's bo may be lost if vram lost is
> > +encountered
> > + * in gpu reset (gpu page table in vram may be lost), force to
> > +recreate
> > + * all bos
> > + */
> > +void amdgpu_ras_recovery_reset(struct amdgpu_device *adev) {
> > +   struct amdgpu_ras *con = amdgpu_ras_get_context(adev);
> > +   struct ras_err_handler_data *data;
> > +   uint64_t bp;
> > +   struct amdgpu_bo *bo = NULL;
> > +   int i;
> > +
> > +   if (!con || !con->eh_data)
> > +           return;
> > +
> > +   /* Note: It's called in gpu reset, recovery_lock may be acquired
> before
> > +    * gpu reset. The following code is out of protect of con-
> >recovery_lock
> > +    * in case of deadlock and bps_bo should be recreated (if successfully)
> > +    * whether recovery_lock is locked or not.
> > +    * We assume there is no other ras recovery operation simultaneous
> during
> > +    * gpu reset.
> > +    */
> > +   data = con->eh_data;
> > +   /* force to reserve all retired page again */
> > +   data->last_reserved = 0;
> > +
> > +   for (i = data->last_reserved; i < data->count; i++) {
> > +           bp = data->bps[i].retired_page;
> > +
> > +           /* There are three cases of reserve error should be ignored:
> > +            * 1) a ras bad page has been allocated (used by someone);
> > +            * 2) a ras bad page has been reserved (duplicate error
> injection
> > +            *    for one page);
> > +            * 3) bo info isn't lost in gpu reset
> > +            */
> > +           if (amdgpu_bo_create_kernel_at(adev, bp <<
> AMDGPU_GPU_PAGE_SHIFT,
> > +                                          AMDGPU_GPU_PAGE_SIZE,
> > +                                          AMDGPU_GEM_DOMAIN_VRAM,
> > +                                          &bo, NULL))
> > +                   DRM_NOTE("RAS NOTE: recreate bo for retired page
> 0x%llx fail\n", bp);
> > +           else
> > +                   data->bps_bo[i] = bo;
> > +           data->last_reserved = i + 1;
> > +           bo = NULL;
> > +   }
> > +}
> >   /* recovery end */
> >
> >   /* return 0 if ras will reset gpu and repost.*/ diff --git
> > a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index f80fd3428c98..7a606d3be806 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -479,6 +479,7 @@ static inline int amdgpu_ras_is_supported(struct
> amdgpu_device *adev,
> >   }
> >
> >   int amdgpu_ras_recovery_init(struct amdgpu_device *adev);
> > +void amdgpu_ras_recovery_reset(struct amdgpu_device *adev);
> >   int amdgpu_ras_request_reset_on_boot(struct amdgpu_device *adev,
> >             unsigned int block);
> >

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 1/3] drm/amdgpu: recreate retired page's bo if vram get lost in gpu reset

Reply via email to