-----Original Message-----
From: Lazar, Lijo <lijo.la...@amd.com> 
Sent: Thursday, January 20, 2022 3:32 PM
To: Chai, Thomas <yipeng.c...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Zhou1, Tao <tao.zh...@amd.com>; Zhang, Hawking <hawking.zh...@amd.com>; 
Clements, John <john.cleme...@amd.com>
Subject: Re: [PATCH V2 1/2] drm/amdgpu: Move xgmi ras initialization from 
.late_init to .early_init



On 1/20/2022 12:57 PM, Chai, Thomas wrote:
> 
> -----Original Message-----
> From: Lazar, Lijo <lijo.la...@amd.com>
> Sent: Thursday, January 20, 2022 1:49 PM
> To: Chai, Thomas <yipeng.c...@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Zhou1, Tao <tao.zh...@amd.com>; Zhang, Hawking 
> <hawking.zh...@amd.com>; Clements, John <john.cleme...@amd.com>; Chai, 
> Thomas <yipeng.c...@amd.com>
> Subject: Re: [PATCH V2 1/2] drm/amdgpu: Move xgmi ras initialization 
> from .late_init to .early_init
> 
> 
> 
> On 1/20/2022 8:48 AM, yipechai wrote:
>> Move xgmi ras initialization from .late_init to .early_init, which 
>> let xgmi ras can be initialized only once.
>>
>> Signed-off-by: yipechai <yipeng.c...@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 15 ++++++++++-----
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h |  1 +
>>    drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c  |  5 +++++
>>    drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   |  5 +++++
>>    4 files changed, 21 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> index 3483a82f5734..788c0257832d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>> @@ -436,6 +436,16 @@ void amdgpu_gmc_filter_faults_remove(struct 
>> amdgpu_device *adev, uint64_t addr,
>>      } while (fault->timestamp < tmp);
>>    }
>>    
>> +int amdgpu_gmc_ras_early_init(struct amdgpu_device *adev) {
>> +    if (!adev->gmc.xgmi.connected_to_cpu) {
>> +            adev->gmc.xgmi.ras = &xgmi_ras;
>> +            amdgpu_ras_register_ras_block(adev, 
>> &adev->gmc.xgmi.ras->ras_block);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>    int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
>>    {
>>      int r;
>> @@ -452,11 +462,6 @@ int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev)
>>                      return r;
>>      }
>>    
>> -    if (!adev->gmc.xgmi.connected_to_cpu) {
>> -            adev->gmc.xgmi.ras = &xgmi_ras;
>> -            amdgpu_ras_register_ras_block(adev, 
>> &adev->gmc.xgmi.ras->ras_block);
>> -    }
>> -
>>      if (adev->gmc.xgmi.ras && adev->gmc.xgmi.ras->ras_block.ras_late_init) {
>>              r = adev->gmc.xgmi.ras->ras_block.ras_late_init(adev, NULL);
>>              if (r)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> index 0001631cfedb..ac4c0e50b45c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
>> @@ -318,6 +318,7 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device *adev,
>>                            uint16_t pasid, uint64_t timestamp);
>>    void amdgpu_gmc_filter_faults_remove(struct amdgpu_device *adev, uint64_t 
>> addr,
>>                                   uint16_t pasid);
>> +int amdgpu_gmc_ras_early_init(struct amdgpu_device *adev);
>>    int amdgpu_gmc_ras_late_init(struct amdgpu_device *adev);
>>    void amdgpu_gmc_ras_fini(struct amdgpu_device *adev);
>>    int amdgpu_gmc_allocate_vm_inv_eng(struct amdgpu_device *adev); 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>> index 4f8d356f8432..7a6ad5d467b2 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
>> @@ -719,6 +719,7 @@ static void gmc_v10_0_set_gfxhub_funcs(struct 
>> amdgpu_device *adev)
>>    
>>    static int gmc_v10_0_early_init(void *handle)
>>    {
>> +    int r;
>>      struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>    
>>      gmc_v10_0_set_mmhub_funcs(adev);
>> @@ -734,6 +735,10 @@ static int gmc_v10_0_early_init(void *handle)
>>      adev->gmc.private_aperture_end =
>>              adev->gmc.private_aperture_start + (4ULL << 30) - 1;
>>    
>> +    r = amdgpu_gmc_ras_early_init(adev);
>> +    if (r)
>> +            return r;
>> +
> 
>> At this point it's unknown if RAS is applicable for the SKU. I think this 
>> failure check shouldn't be there (here and below one).
> 
>> amdgpu_gmc_ras_early_init is return 0 always, that way also this check is 
>> not needed.
> 
> [Thomas]  Just like calling amdgpu_gmc_ras_late_init,  checking the return 
> status may make the code extensible.
>          In amdgpu_gmc_ras_early_init,  the xgmi ras initialization may 
> always return 0, but it may add functions that need to check the return 
> status in future.
> 

>At this point, it's unknown

>1) If the device is part of XGMI hive or not.
>2) If the device supports RAS.

>For such devices, it doesn't make any sense to fail here based on this 
>function.

[Thomas] The current code in amdgpu_gmc_ras_early_init has no effect for the 
devices,  whether the device has xgmi hive or not, whether it support RAS or 
not.
                  Checking the return status is just for 
amdgpu_gmc_ras_early_init to make it easy to add future functions that need to 
check the return status.

> Thanks,
> Lijo
> 
>>      return 0;
>>    }
>>    
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> index c76ffd1a70cd..3cdd3d459d51 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> @@ -1318,6 +1318,7 @@ static void gmc_v9_0_set_mca_funcs(struct 
>> amdgpu_device *adev)
>>    
>>    static int gmc_v9_0_early_init(void *handle)
>>    {
>> +    int r;
>>      struct amdgpu_device *adev = (struct amdgpu_device *)handle;
>>    
>>      /* ARCT and VEGA20 don't have XGMI defined in their IP discovery 
>> tables */ @@ -1347,6 +1348,10 @@ static int gmc_v9_0_early_init(void *handle)
>>      adev->gmc.private_aperture_end =
>>              adev->gmc.private_aperture_start + (4ULL << 30) - 1;
>>    
>> +    r = amdgpu_gmc_ras_early_init(adev);
>> +    if (r)
>> +            return r;
>> +
>>      return 0;
>>    }
>>    
>>

Reply via email to